Know your I/O: Caching

As I strongly suggested in the 'Components' article, there is a lot of caching and buffering going on during I/O operations. Each discrete device does at least a little of this, and some can do quite a lot. Knowing what the effects of caching, and cache settings, has on your storage will allow you to better match needs to reality.

But first, a look at what's at each level.

Application: The application itself may do caching.
File-cache: The file-cache of your OS of choice does caching, though most have options to bypass this if requested. This is usually one of the biggest caches in the system.
Controller Driver: The storage driver at minimum buffers I/O requests, and may do its own caching.
Server Controller: The actual device buffers I/O for increased parallelism, and to handle errors.
Storage network switches: Strictly speaking, if your storage fabric is switched, the switch also buffers I/O. Though it is designed to be as in-order as possible while still maintaining performance.
Storage Virtualization device: Has a small cache at minimum to handle reorganizing I/O for efficiency, before forwarding the I/O stream to internal buffers talking to storage devices behind it. If it's fronting direct-attach storage, it may have significant on-board cache.
Storage Bus Controller: If the Storage Bus Controller is a discrete device, it will do buffering at minimum but is unlikely to do much caching.
Disk Bus Controller: Can do quite a bit of caching, but can be configured to not cache writes more than strictly needed to commit them. Allowing write-caching can improve perceived speed by quite a bit, at the risk of losing I/O in sudden power-loss situations. This is usually one of the biggest caches in the system.
Disk: More buffer than cache, the disk does cache enough to make efficient read/write patterns.

As you can see from this list, the big caches are to be found in:

The application itself.
The Operating System cache.
The Disk Bus Controller.

At the most basic level, all this caching means that a sudden loss of power can mean lost writes. If you don't trust your power environment, you really have to take that into account. Some enterprise storage arrays have built in batteries that are just beefy enough to flush internal cache and shut down in the case of power outage. Others have onboard batteries that'll preserve cache for hours or even days. Still others don't have anything built in at all, and you have to provide the insurance through external means.

The Disk Bus Controller Cache

As I've mentioned before, this is very commonly baked into the same device as the Storage Bus Controller, and even the Server Controller in the case of direct-attach RAID cards. This cache can be minimal (64MB) or quite large (8GB or more), depending on what the exact device is and how old it is. It may be used jointly between multiple controllers in the case of multi-controller storage devices like the EVA, or each controller can have its own independent cache (LeftHand, Equilogic).

In general, this cache can be configured on an array or LUN basis. The write policies are generally named, 'write-through', and, 'write-back'. The 'write-back' policy is where the host device is notified that a write is committed when it enters the Disk Bus Controller's cache. The 'write-through' policy is where the the host device is notified that a write committed when it gets sent to the disk itself.

Write-through is the safer of these two options, as the I/O operation itself is kept in volatile memory for as little time as possible. If you need very high assurance that all written data is really written, then you need to use write-through policy. Or, if your controller doesn't have a battery-backed cache, write-through is pretty much your only sane choice.

Write-back is the faster of these two options since it doesn't have to wait for the physical disk to respond to a write. Using this policy means that you are willing to accept that writes committed to controller-cache are as good as hitting disk. Use this if you and your application managers have very high confidence in your power environment.

When it comes to reading, not writing, the bigger your cache the better your performance. These controllers will cache frequently requested blocks, which can provide very significant performance improvements. The best-case usage scenario is if all the in-use blocks at any given time are held in controller cache, though this is very rarely the case.

Be wary of individual device cache policies, though. As a specific example of this, the HP MSA1500cs disables its internal cache when doing array operations such as rebuilding a RAID5/6 set, adding a disk to an array, or changing the stripe-size of a LUN. When this happens, the array is much less tolerant of high I/O levels. Even if the operation underway is not one that uses controller CPU, the lack of a cache for reads made performance very noticeably sluggish when handling a large write. Yet another reason to know how your storage devices behave when handling faults.

Operating System Cache

The cache in the operating system provides the fastest performance boost for reads, as it is the cache closest to the requesting application. The size of this cache is very much dependent upon the specific Operating System, but in general modern OSs use all 'unused' RAM for the file-cache. Some provide a block-cache instead of or along side a file-cache, check your OS for specifics. Some operating systems take the radical step of swapping out least-used RAM to the swap-file in order to increase the size of the file-cache.

Sizing this cache correctly can provide very significant performance improvements for applications that do more reads than writes. Ideally, you want this cache to be the same size or larger than the size of all open files on that server. This way all reads except the very first one are fulfilled through cache, and don't have to go to disk. With 64-bit memory limits and the price of RAM these days, it is a LOT easier to size a file-cache to the open data-set than it is to size the Disk Bus Controller cache.

This caching feature is one that some applications would rather not happen, generally due to data-integrity concerns or because the application is accepting responsibility for caching data. For this reason, Direct I/O is provided by operating systems as a way to bypass the cache. These I/O operations still pass through the Kernel's storage stack, so there is still some buffering going on. Databases are the usual applications requesting Direct I/O, as they use their own internal algorithms for determining what data needs to be cached and which can be read directly from disk.

Keep in mind that at the moment Direct I/O operations are still subject to the cache policy of the Disk Bus Controller. This may change as kernel drivers and disk bus controllers improve their ability to have different classes of I/O. I have noticed a trend in Linux Kernel change-logs that there is a drive for a priority system in I/O requests, which is the first step along this path. I expect that the ability to set the cache policy on a per-block basis is probably in the intermediate future. I could be wrong, though.

Application Caching

Not all applications cache, not all applications are configurable. People requesting high performance storage may be making requests based on 10 year old best-practices documents. It is impossible to know the caching details of all applications, but it is a good idea to research the details of application requests you get. The entity requesting storage resources may not know anything about their application, which means you have to do the work to figure out how the application handles this. Or maybe they know entirely too much, at which point you can work with them to ensure that everyone's needs are met.

Databases do their own caching to a very great degree, and are in fact likely to use direct-I/O methods to ensure performance.

Web-servers can do their own caching, which can involve quite a bit of memory. While not strictly a storage problem, it does impact server resources for other processes that may use file-cache.

Mail servers vary, but mailers like Postfix rely pretty heavily on the OS cache for speed. The more transactional they are, the higher the dirty buffer rate. Cross OS limits for dirty buffer percentages and you can hit performance tar-pits.

Do what you can, and be prepared to educate if you need to.

In the next article in this series, I put all of this together in an example.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade