OpenSolaris

By SysAdmin1138 on June 3, 2010 11:50 AM | 6 Comments

I've been checking out OpenSolaris for a NAS possibility, and it's pretty nifty. A different dialect than I'm used to, but still nifty.

Unfortunately, it seems to have a nasty problem in file I/O. Here are some metrics (40GB file, with 32K and 64K record-sizes).

OpenFiler                                 random  random
              KB  reclen   write    read    read   write
        41943040      32  296238  118598   15682   62388
        41943040      64  297141  118861   23731   86620

OpenSolaris                               random  random
              KB  reclen   write    read    read   write
        41943040      32  259170 1179515    8458    7461
        41943040      64  244747 1133916   13894   13001

The identical hardware, but different operating system. I've figured out that the stellar Read performance is due to the zfs 'recordsize' being 128k. When I drop it down to 4k, similar to the block-size of XFS in OpenFiler, the Read performance is very similar. What I don't get is what's causing the large difference in random I/O. Random-write is exceedingly bad. With the recordsize dropped to 4K on XFS the random-read gets even worse; I haven't stuck through it enough to see what it does to random-write.

Poking into iostats show that both OpenFiler and OpenSolaris are striping I/O across the four logical disks available to them. I know the storage side is able to pump the I/O, as witnessed by the random-write speed on OpenFiler. The chosen file-size is larger than local RAM so local caching effects are minimized.

As I mentioned back in the know-your-IO article series, random-read is the best analog of the type of I/O pattern your backup process follows when backing up large disorganized piles of files. Cache/pre-fetch will help with this to some extent, but the above numbers give a fair idea as to the lower bound of speed. OpenSolaris is w-a-y too slow. At least, how I've got it configured, which is largely out-of-the-box.

Unfortunately, I don't know if this bottleneck is a driver issue (HP's fault) or an OS issue. I don't know enough of the internals of ZFS to hazard a guess.

6 Comments

Giovanni | June 4, 2010 3:24 AM

Please post details of the underlying hardware, RAID volumes, how you create the ZFS pool, etc. The same for Linux/XFS and how you tested performance.

SysAdmin1138 | June 4, 2010 2:10 PM

The hardware is all HP. A SmartArray P800 controller, four MSA shelves housing 7.2K RPM drives. The RAID card is configured to present four separate 7.5TB Raid5 LUNs. The server is a DL360G6 with 32GB of RAM.

The performance test is fairly simple. I'm using iozone, which I've used here before (see the benchmarking category in the sidebar), with the command, "iozone -s 40G -r 32k -r 64k", testing two separate record sizes.

OpenFiler runs Linux 2.6.31. The storage was on an XFS volume created on a Volume Group consisting of the four LUNs. I did not adjust stripe or block sizes. XFS defaults to a 4K block size. LVM on OpenFiler defaults to a stripe size of 128K I think.

On OpenSolaris the initial testing was done with a similar arrangement. A zpool was created out of the four LUNs, and a new file-system created on that zpool. The defaults were accepted.

iostat testing on both platforms shows that I/O was performed in parallel to all four LUNs. It's just that for some reason the random workloads on opensolaris were significantly slower than that for openfiler.

This morning did some concurrency testing, running 200 parallel operations on OpenSolaris with a working file-set fitting into cache and again with a file-set exceeding cache. The in-cache test returned a random-write performance very close to the single-threaded random-write test on OpenFiler. The cache-busting test returned a random-write speed exceeding the single-threaded random-write test by about 20%.

Robert Milkowski | June 7, 2010 4:20 AM

It might be that your raid cache is actually flushing its write cache on transaction commit. Try to do as root:

echo zfs_nocacheflush/W0t1 | mdb -kw

And see if it makes any difference for your random writes.

To revert to default:

echo zfs_nocacheflush/W0t0 | mdb -kw

To make it permanent set the following parameter in the /etc/system file:

set zfs:zfs_nocacheflush = 1

and reboot the server.

For more details see http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

Robert Milkowski | June 7, 2010 4:27 AM

another thing is - you are doing iozone -s 40G -r 32k -r 64k
so I would set ZFS's recordsize to 32K and not the default 128KB.
For random writes the default recordsize value with abot iozone will case each 32KB to be a 128KB read, modify and 128KB write and it may drive your numbers down badly.

SysAdmin1138 | June 7, 2010 3:44 PM

I did disable cacheflushing and changed the recordsize to 32K and reran my tests. The write problem actually got worse (read and write both dropped to 7.2MB/s), which is concerning. What this tells me is that there is something fundamentally hinky with the ZFS / SmartArray driver interaction on the OpenSolaris platform. I had to shoehorn it in in the first place, it being actually a Sol10 driver, so I was suspicious.

Robert Milkowski | June 8, 2010 2:13 AM

Starting with build 132 the CPQary driver is integrated into Open Solaris.
So maybe you should try build 134?

OpenSolaris

Categories:

6 Comments