I've been checking out OpenSolaris for a NAS possibility, and it's pretty nifty. A different dialect than I'm used to, but still nifty.
Unfortunately, it seems to have a nasty problem in file I/O. Here are some metrics (40GB file, with 32K and 64K record-sizes).
Poking into iostats show that both OpenFiler and OpenSolaris are striping I/O across the four logical disks available to them. I know the storage side is able to pump the I/O, as witnessed by the random-write speed on OpenFiler. The chosen file-size is larger than local RAM so local caching effects are minimized.
As I mentioned back in the know-your-IO article series, random-read is the best analog of the type of I/O pattern your backup process follows when backing up large disorganized piles of files. Cache/pre-fetch will help with this to some extent, but the above numbers give a fair idea as to the lower bound of speed. OpenSolaris is w-a-y too slow. At least, how I've got it configured, which is largely out-of-the-box.
Unfortunately, I don't know if this bottleneck is a driver issue (HP's fault) or an OS issue. I don't know enough of the internals of ZFS to hazard a guess.
Unfortunately, it seems to have a nasty problem in file I/O. Here are some metrics (40GB file, with 32K and 64K record-sizes).
OpenFiler random random KB reclen write read read write 41943040 32 296238 118598 15682 62388 41943040 64 297141 118861 23731 86620 OpenSolaris random random KB reclen write read read write 41943040 32 259170 1179515 8458 7461 41943040 64 244747 1133916 13894 13001The identical hardware, but different operating system. I've figured out that the stellar Read performance is due to the zfs 'recordsize' being 128k. When I drop it down to 4k, similar to the block-size of XFS in OpenFiler, the Read performance is very similar. What I don't get is what's causing the large difference in random I/O. Random-write is exceedingly bad. With the recordsize dropped to 4K on XFS the random-read gets even worse; I haven't stuck through it enough to see what it does to random-write.
Poking into iostats show that both OpenFiler and OpenSolaris are striping I/O across the four logical disks available to them. I know the storage side is able to pump the I/O, as witnessed by the random-write speed on OpenFiler. The chosen file-size is larger than local RAM so local caching effects are minimized.
As I mentioned back in the know-your-IO article series, random-read is the best analog of the type of I/O pattern your backup process follows when backing up large disorganized piles of files. Cache/pre-fetch will help with this to some extent, but the above numbers give a fair idea as to the lower bound of speed. OpenSolaris is w-a-y too slow. At least, how I've got it configured, which is largely out-of-the-box.
Unfortunately, I don't know if this bottleneck is a driver issue (HP's fault) or an OS issue. I don't know enough of the internals of ZFS to hazard a guess.
Please post details of the underlying hardware, RAID volumes, how you create the ZFS pool, etc. The same for Linux/XFS and how you tested performance.
The hardware is all HP. A SmartArray P800 controller, four MSA shelves housing 7.2K RPM drives. The RAID card is configured to present four separate 7.5TB Raid5 LUNs. The server is a DL360G6 with 32GB of RAM.
The performance test is fairly simple. I'm using iozone, which I've used here before (see the benchmarking category in the sidebar), with the command, "iozone -s 40G -r 32k -r 64k", testing two separate record sizes.
OpenFiler runs Linux 2.6.31. The storage was on an XFS volume created on a Volume Group consisting of the four LUNs. I did not adjust stripe or block sizes. XFS defaults to a 4K block size. LVM on OpenFiler defaults to a stripe size of 128K I think.
On OpenSolaris the initial testing was done with a similar arrangement. A zpool was created out of the four LUNs, and a new file-system created on that zpool. The defaults were accepted.
iostat testing on both platforms shows that I/O was performed in parallel to all four LUNs. It's just that for some reason the random workloads on opensolaris were significantly slower than that for openfiler.
This morning did some concurrency testing, running 200 parallel operations on OpenSolaris with a working file-set fitting into cache and again with a file-set exceeding cache. The in-cache test returned a random-write performance very close to the single-threaded random-write test on OpenFiler. The cache-busting test returned a random-write speed exceeding the single-threaded random-write test by about 20%.
It might be that your raid cache is actually flushing its write cache on transaction commit. Try to do as root:
echo zfs_nocacheflush/W0t1 | mdb -kw
And see if it makes any difference for your random writes.
To revert to default:
echo zfs_nocacheflush/W0t0 | mdb -kw
To make it permanent set the following parameter in the /etc/system file:
set zfs:zfs_nocacheflush = 1
and reboot the server.
For more details see http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
another thing is - you are doing iozone -s 40G -r 32k -r 64k
so I would set ZFS's recordsize to 32K and not the default 128KB.
For random writes the default recordsize value with abot iozone will case each 32KB to be a 128KB read, modify and 128KB write and it may drive your numbers down badly.
I did disable cacheflushing and changed the recordsize to 32K and reran my tests. The write problem actually got worse (read and write both dropped to 7.2MB/s), which is concerning. What this tells me is that there is something fundamentally hinky with the ZFS / SmartArray driver interaction on the OpenSolaris platform. I had to shoehorn it in in the first place, it being actually a Sol10 driver, so I was suspicious.
Starting with build 132 the CPQary driver is integrated into Open Solaris.
So maybe you should try build 134?