Know your I/O: The Components

This is about the various layers of the storage stack. Not all of these will be present in any given system, nor are they required. Multiple things on this list will probably be baked into the same device. But they do add things, notably layers of abstraction. Enterprise-class shared storage systems can get mightily abstract, which can make engineering them correctly harder.

Back in the beginning of Intel x86-based PC hardware, storage was simple. DOS told the disk what to write and where to write it, and the disk obligingly did so in the order it was told to do it. Heck, when writing to disk, DOS stopped everything until the disk told it that the write was done. Time passed, new drive interfaces evolved, and we have the complexity of today.

Disk

Down at the bottom level is the disk itself. I'm not going into all the various kinds of disks and what they mean at this level, that's for the next post. However, some things are true of nearly all disks these days;

  • They all have onboard cache.

  • The ones you'll be using for 'enterprise' work have onboard I/O reordering (Native Command Queuing or Tagged Command Queuing). The drives you're buying for home use may have it.

Onboard cache and NCQ mean that even the disks don't commit writes in the order they're told to do them. They'll commit them in the order that provides the best performance, based on the data it has. You'll get more out of this from rotational media than solid-state, but even SSDs have it (it is called 'Write Combining', as writes are very expensive on SSDs).

Disk Bus Controller

This is what the Disk talks to. This could be the SATA port on your motherboard. Or it could be the Enclosure controller in your LeftHand storage module. The capabilities of this controller vary wildly. Some, like the SATA support baked into your southbridge only talk to a very few devices. Others, like the HSV controllers in my EVAs, talk to over 50 drives at a time. Even with such a disparate assortment of capabilities, there are still some commonalities:

  • Nearly all support some kind of RAID, especially the stand-alone controllers.

  • All reorder I/O operations for performance. Those with RAID support perform parallel operations wherever possible.

  • Stand-alone controllers have onboard cache for handling both read requests, and writes to some extent.

More advanced devices also have the ability to hide storage faults from higher levels of the stack. Management info will still reveal the fault, but the fact that storage has failed (RAID5 rebuild time!) can remain hidden.

Storage Bus Controller

This is what the Disk Bus Controller talks to, and faces the storage fabric, whatever it may be. Sometimes this is baked into the Disk Bus Controller, such as with the EVA HSV controllers. Other times, it's a stand-alone unit, such as the LeftHand and Equilogic storage redirectors. Your southbridge doesn't bother with this step. The features offered by these devices have varied over the years, but offered features include:

  • Directing traffic to the correct Disk Bus Controllers. This might be a one time redirection, or it could be continual.

  • LUN masking, which presents certain storage to certain devices.

  • Failover support between multiple controllers.

  • Protocol translation, such as between Fibre Channel (storage bus) and SAS.(disk bus), or iSCSI support.

Storage Virtualization

Also sometimes called a 'Storage Router'. I haven't worked with this stuff, but it presents multiple Storage Bus Controllers as a single virtual controller. This is handy when you want a single device to manage all of your storage access, or need to grant access to a device that doesn't have sufficient access controls on it. As with routers on IP networks, they too increase latency by a smidge. Features include:

  • Fibre Channel routing, connecting two separate fabrics without merging them.

  • Protocol translation, such as between Fibre Channel and SAS.

  • Fine grained access control.

Server Controller

This is the device that talks to the storage bus and is plugged into your server. Frequently called a Host Bus Adapter, the specific device may have features of the Disk Bus Controller and Storage Bus Controller baked into it, depending on what it is designed to do. This device typically includes at minimum:

  • A certain amount of onboard cache.

  • The ability to reorder transactions for better performance.

More advanced versions, such as those attached to multi-device buses such as Fibre Channel and SAS, also have a common feature set:

  • The ability to handle multiple paths to storage.

  • The ability to hide to a point certain storage events, such as path failovers and transient slow-downs, from the host operating system.

Controller Driver

This is the operating system code that talks to the controller. The storage stack in the kernel talks to the driver. The complexity of this code has increased significantly over the years, as has its place in the overall I/O stack in the operating system. Different operating systems place it in different spots. At any rate, modern drivers do have a common feature set:

  • They reorder transactions for better performance, such as parallelizing operations between multiple controllers.

  • Interpret hardware storage errors for the operating system, and can transparently handle some of them, as well as provide the management channels needed by storage management software.

Storage Stack in the Kernel

This is the code that talks to the controller drivers. File-system drivers and sometimes applications talk directly to the storage stack.. The kernel is the ultimate arbiter of who gets to access what inside of a server. This is typically called the I/O scheduler, which sets the policy for how I/O gets handled. Linux has several schedulers available, and each can be tuned to some degree. Other operating systems have tunable parameters for manipulating scheduler behavior.

Some schedulers do in-kernel reordering of transactions, others explicitly do not.

At this point the stack o' storage forks. On the one hand we have the File-system Driver, and on the other we have applications leveraging Direct I/O to talk I/O to the kernel without going through a file-system first. Databases are the majority application doing that, though its use is somewhat diminishing these days as file-systems have become more accommodating of the need to bypass caching.

I/O Abstraction Layer

Not all operating systems support this, but this is what LVM, EVMS, and Microsoft Dynamic Disks provide. It allows the operating system to present multiple storage devices as a single device to a file-system driver. This is where 'software RAID' lives for the most part. File-systems like NSS and ZFS have this baked into them directly.

File-system Driver

This is the code that presents a file-system to applications on the server. It does all the file-system things you'd expect. These drivers provide a lot of features, but the ones I'm interested in are:

  • Provides a significant level of caching, possibly multiple GB worth.

  • Performs predictive reads to improve read speeds.

  • Handles logical block-order of files.

  • Provides a method (or not) for writes to bypass caching.

Direct I/O Application Access

Some applications talk directly to the kernel for I/O operations. We hope they know what they're doing.

File-based Application

Any application that uses files instead of direct I/O. This is everything from DB2 to Dbase, to Apache, to AutoCAD, to Sendmail, to Ghost. Here at the very top of the storage stack I/O is initiated. It might hit disk, but there are enough layers of caching between here and 'Disk' that this isn't guaranteed. Even writes aren't guaranteed to hit disk if they're the right kind (such as a transient mail-spool file on the right file-system).


See? There is a LOT of abstraction in the storage stack. The days when the operating system wrote directly to physical disk sectors is long, long gone. Even on your smartphone, arguably the simplest storage pathway, it has several components:

disk → chipset → driver → kernel → file-system driver → phone O/S

1980 (IBM PC, MFM hard drive):

  1. User saves a Volkswriter file.

  2. DOS finds a free spot in the file-system, and tells the disk to write the data to specific blocks.

  3. The disk writes the data to the blocks specified.

  4. DOS returns control back to the user.

Compare this to:

2010 (HP DL360 G6, FC-attached EVA4400 with FC disks):

  1. User saves a Word file to a network share.

  2. Server OS caches the file in case the user will want it again.

  3. Server OS aggregates this file-save with whatever other Write I/O needs committing, grouping I/O wherever possible. Sends write stream to device driver for specific LUN.

  4. Since the write didn't specify bypass-caching, Server OS tells user the write committed. User rejoices.

  5. Device driver queues the writes for sending on the Fibre Channel bus, in and amongst the other FC traffic.

  6. EVA HSV controller receives the writes and caches them.

    • If the HSV controller was configured to cache writes, it informs the server that the write committed. If the user had specified bypass-caching, the Server OS informs the user that the write committed. User rejoices.

    • If the HSV controller was not configured to cache writes, the cache is immediately flushed, skipping step 7 and going straight to step 8.

  1. EVA HSV holds the write in cache until it needs to flush them, at which point...

  2. HSV reorders pending writes for maximum efficiency.

  3. HSV sends write commands to individual disks.

  4. Disk receives the writes and inserts it into its internal command queue.

  5. Disk reorders writes for maximum efficiency.

  6. Disk commits the write, and informs the controller it has done so.

    • If the HSV was not configured to cache writes, the HSV controller informs the Server that the write committed. If the user had specified bypass-caching, the Server OS informs the user the write committed. User rejoices.

We've come a long way in 30 years.

In my next article I'll talk about some technology specifics that I didn't go into here.


Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade