April 2010 Archives

What Apple sees of the future

By SysAdmin1138 on April 30, 2010 10:17 AM

Science Fiction author and armchair tech industry analyst Charles Stross has written an article on what the 'Apple/Adobe letter' signals for the future.

Go read it. It's good.

He sees the letter as a clear signal that Apple is actively trying to ensure that the Apple brand is relevant in a future in which computing is even more commoditized/cloudy than it is now. He sees PC hardware sales becoming even more of a loss-leader than it is now, and both Apple and HP (hello Palm purchase) have identified what the (profitable side of the) future looks like:

Wireless broadband everywhere
Very little local storage, basing everything on the cloud
Tight vendor controls on the software ecosystem, for safety. "Cross-platform" is for skeevy hackers.

There will still be traditional PC environments around, Microsoft won't be able to allow a monopolitic stack like Apple i$Device to develop for legal and technical reasons, but they won't be where all the money is being made. The real money in PC-land will be made in software, not hardware and OS. *I* suspect it'll get more annoying to rip apart a new machine to get linux on it. Because of naughtiness in the 90's, Microsoft won't be allowed to produce a vertically integrated Hardware/OS/Software stack like Apple is actively doing with iPhone/iPad.

The future has lots of mobile bandwidth, enough mobile bandwidth that having your primary data-storage be a few network hops away is not annoying; especially if there is a local, and large, cache available. The future is a lot more paternal, software will auto-update in the background without notifying you and will be hard to get around; you better hope updates don't trash your other software. The future has software cops preventing bad stuff from getting on your gear, and the software cops will be the device vendor (Apple, HP, Google).

At least, at the consumer level. How this all will interact with workplace environments is an open question. There are some tasks for which a full sized keyboard is really required, as well as 22" displays and high-volume printers. I strongly suspect there will be large computer-environment differences between home-computing and work-computing. We shall see how it develops.

Held to a higher standard

By SysAdmin1138 on April 28, 2010 2:07 PM

There are a variety of professions where mere strict adherence to the laws is not sufficient for maintaining a professional appearance. Some are subject to explicit professional ethical standards. Others, like Systems Administration, have an implicit ethical code. Sometimes I wish there was an explicit one to follow.

Yes, there is a moral standard I'm held to that is more than just 'don't get arrested'. People need to trust the guardians of their data, and that means meeting expectations for a position of high trust. Since there isn't a commonly accepted codified moral standard for Sysadmins, just exactly what the standard is changes from organization to organization.

This is one job where the mere accusation of wrong-doing can ruin a career. The accusation has to be meaningful in some way, it can't just be an office crank attempting to score points. I'm talking the, "Brought up on embezzlement charges, but the case was dropped due to lack of evidence," kind of accusation. Reputation matters, even to us suit-free IT geeks.

If I'm unlucky enough for something trust-bashing to make it to public-record, and therefore easy pickings for your standard pre-employment background-check, I may as well find a new line of work. While a future landlord wouldn't care that I was brought up on embezzlement charges but the case was thrown out on appeal, future employers care very much about that kind of thing. Such events can be purged from your public records, but... these days negative findings are sticky; it would not surprise me in the least that there are data-gathering firms out there that make sure that all negative findings are never purged from their own databases just so clients can know they happened. Once that kind of thing hits public record, my ability to be employed as a sysadmin in any organization of size is greatly reduced.

Heck, once the charge is laid it is entirely possible that I would be fired for cause. Never mind that the charge was dropped, or overturned on appeal. That's the downside of working in a trust-based industry.

And it's not just crime, it's internal politics as well. I have known IT workers who gleefully look at master contracts their ticket to free software, baybeee! They take home installation media for whatever and the master license key provided by purchasing and install umpty hundred (thousand) dollars worth of software on their home machines. This sort of casual piracy can infect SysAdmins as well, since we're the kind of people who just might have a need for, say, Server 2008 Enterprise or Exchange 2010 in our homes. (why? ~~because we're nuts~~ Continuing education. Yeah, that's it.) This sort of behavior can turn supervisors and peers against a person.

And it isn't just piracy. Getting a reputation for exploiting your godlike access to casually browse other people's emails, or indulging in curiosity and peeking at the Budget Office's internal documents to see what the coming IT budget is likely to look like, can be just as damaging if it gets discovered. Users are, justly, paranoid about their privacy, and finding out that the sysadmin has been browsing their data for their own curiosity rather than as part of their job-duties is a sure-fire way to make enemies. We can obtain official sanction to look at other people's data a variety of ways, but if we exercise this access for purely personal reasons it is a violation.

This is the kind of thing that can trip up new sysadmins. Just because you have access doesn't mean you have authorization. I find navigating our large Shared volumes a bit tricky since I can see everything. Access is having Administrator rights to a whole system. Authorization is being asked by an employees supervisor to go into a specific individual's mailbox to look for mails pertaining to topic X. Access does not directly imply authorization, not everyone gets this.

This kind of thing can have significant consequences. If as part of an illicit information gathering regime (looking to see how a certain high-value IT purchasing contract is progressing without harassing actual people for updates) I discover that a certain individual in the Purchasing office has been doing something illegal, what do I do? I certainly had sufficient access to the data in question, and I am duty bound to report malfeasance whenever I run into it. The BOFH answer here is to shakedown the employee question in some way. Since BOFH is sysadmin dark humor, that's not really an answer. More realistically, what next? If I come forward with the evidence I have to provide some reason for why I was looking there in the first place, some reason other than "because I was snooping." As law enforcement will tell you, information found by way of an illegal activity is not admissible.

Losing the faith of your current employer is hazardous to your job, even if that activity won't splash on you enough to prevent you from finding work elsewhere. Annoy them enough, and they'll 'helpfully tip off' your future employer about your activities, which may cost you your new job before you actually start.

System Administrators are held up to a higher moral standard than ye olde citizen. We don't have the benefit of having a codified professional standard to follow other than, 'keep your nose clean, and don't be evil.' There are some attempts to codify this standard, but they haven't penetrated the entire industry the same way that, say, a Certified Professional Accountant is. But that doesn't stop me from trying to live up to one.

Mandatory time off

By SysAdmin1138 on April 28, 2010 8:13 AM

The Governor signed the furlough bill, which I had been expecting for some time. The only reason she hadn't signed it yet is because she was waiting on analysis from the budget office. As the bill reads, all state agencies have to come up with 10 days in which to either close operations, or find some other way to save an equivalent chunk of salary money.

How does this apply to me? Well, we're not sure. The University President sent a message out to all staff last week describing what this bill would likely mean for WWU. It means we'll have to come up with $1,172,000 in salary savings one way or another, and if we can't do that we'll have to come up with whole days in which we can shut down operations.

I say 'close operations' even though the bill exempts anyone in a direct teaching function, so in theory we could still teach. However, that's teach without any support staff what so ever, Some can do it, others can't, and if things break in any way they'll stay broken until the next day. Suffice it to say, we can't plan on teaching on the furlough days.

Can we even find 10 days to shut down? Our biggest target is the summer/fall intersession where we have four weeks of no teaching, followed by the fall/winter intersession that's three vacation-heavy weeks as it is. Winter/spring, and spring/summer are only single weeks and... we can't afford to take a mandatory day off that week; there is simply too much changeover going on.

However, as the President said, we may not have to find 10 days. Maybe only five. Or if we're lucky, none. Agencies have to come up with a dollar figure cut in personnel expenses, which will come in the form of furloughs if the agency can't find any other way to reach it. Lay-offs are an option. As are leaving positions open through retirement open for longer, eliminating already open positions, and work-hour reductions.

So while there was much complaining in the office this morning about this bill, the exact nature of the impact we perceive to our jobs is solely in the hands of the WWU budget process. And we just don't know what that looks like yet.

Cable management arms

By SysAdmin1138 on April 26, 2010 3:27 PM | 7 Comments

This afternoon I racked three new 1U servers. These will be an extension of our current ESX cluster, and will facilitate our upgrade to vSphere.

They didn't come with cable management arms. The whole arm/no-arm debate has been raging in datacenter circles for years and years. We use them. Others don't. Vendors have to support both.

It could be that the arms are something of a minority thing, since getting them (at least from HP) requires a few extra steps. We'd like to have them for these servers since there are 9 cables attached to each server, and that's a lot of cables to undress/redress whenever you need to pull it out for maintenance. Happily, we're no longer snaking old skool KVM cables into these racks (YAY!), which is one less monster cable to worry about.

We need to determine if it is worth the effort (and budget) to get the arms for these servers. We'll see how this turns out.

What's your opinion on these arms? How do they work (if at all) in your environment?

Know your I/O, by someone else

By SysAdmin1138 on April 22, 2010 12:49 PM

Today I ran into another post that goes into a practical example of diagnosing I/O problems on a linux host. It includes actual math, unlike what I did earlier.

http://www.cmdln.org/2010/04/22/analyzing-io-performance-in-linux/

The author also included a series of links at the bottom of the post for 'further reading' about storage issues. Including a series of articles much like the one I just got done with, but with more of a virtualization point of view than I had.

Tape is dead, Long live Disk!

By SysAdmin1138 on April 21, 2010 4:07 PM

Except if you're using HP Data Protector.

Much as I'd like to jump on the backup-to-disk de-dup bandwagon, I can't. Can't afford it. It all comes down to the cost-per-GB of storage in the backup system.

With tape, Data Protector licenses on the following items:

Per tape-drive over 2
Per tape library with a capacity between 50 and 250 slots
Per tape library that exceeds 250 slots
Per media pool with more than some-big-number of tapes

With disk, DP licenses on the following items:

Per TB in the backup-to-disk system

Obviously, the Disk side is much easier to license. In our environment we had something like 500 SDLT320 tapes, and our library had 6 drives and 45 slots. We only had to license the 4 extra tape drives.

Then our library started crapping out, and we outgrew it anyway. Prime time to figure out what the future holds for our backup environment. TO DISK!

HOLY CRAP that's expensive.

HP licenses their B2D space by the Terabyte. After you do the math it comes down to about $5/GB. Without using a de-duplication technology, you can easily make 10 copies or more of every bit of data subject to backup. Which means that for every 1 GB of data in the primary storage, 10GB of data is in the B2D system, and that'll set us back a whopping $50/GB. So... about the de-duplication system...

Too bad it doesn't work for non-file data, and kinda sorta explicitly doesn't work for clustered systems. Since 70% or so of our backup data is sourced from clustered file-servers or is non-file data (Exchange, SQL backups), this means the gains from HP's de-dup technology are pretty minor. Looks like we're stuck doing standard backups at $50/GB (or more).

So, about that 'dead' tape technology! We've already shelled out for the tape-drive licenses so that's a sunk cost. The library we want doesn't have enough slots to force us to get that license. All that's left is the media costs. Math math math, and the amortized cost of the entire library and media set comes to about $0.25/GB. Niiice. Factor in the magnification factor, and each 1 GB of backup will cost $2.50/GB, a far, far cry from $50/GB.

We still have SOME backup to disk space. This is needed since these LTO4 drives are HUNGRY critters, and the only way to feed them fast enough to prevent shoe-shining is to back everything up to disk, and then copy the jobs to tape directly from disk. So long as we have a week's worth of free-space, we're good. This is a sunk cost too, happily.

So. To-disk backups may be the greatest thing since the invention of the tape-changing robot, but our software isn't letting us take advantage of it.

Clean closets, proof

By SysAdmin1138 on April 20, 2010 9:43 AM | 2 Comments

This is actually the second round of closet cleaning we've done, but this one included a lot more really old stuff. Photographic proof is under the cut.

Continue reading Clean closets, proof.

New backup hardware

By SysAdmin1138 on April 19, 2010 8:24 AM

Friday represented the first production use of our new LTO4-based tape library. This replaced the old SDLT320 based Scalar 100 we've had for entirely too long. The simple fact that all of the media and drives are BRAND NEW should make our completion rate go very close to 100%. This excites me.

Friday we did a backup of our main file-serving cluster and the Blackboard content volume in a single job that streamed to a single tape drive.

Total data backed up: 6.41TB
Total time: 1475 minutes
Speed: 4669 MB/Min, or 77 MB/s

Still not flank speed for LTO4 (that's closer to 120 MB/s) but still markedly faster than the SDLT stuff we had been doing. The similar backup on the Scalar 100 took around 36 hours (2160 minutes) instead of the 24ish hours this one took, and it used 4 tape drives to do it.

Ahhhh, modern technology, how I've desired you.

*pets it*

Now to resist taking a fire ax to the old library. We have to surplus it through official channels, and they won't take it if it has been "obviously defaced". Ah well.

Old closets

By SysAdmin1138 on April 16, 2010 2:18 PM | 1 Comment

We've spent a good chunk of today cleaning out old gear. It is an axiom of the IT world that unused cubicles gather old, unused equipment, and supply closets are even worse. We did a big purge two months ago, and this is the second step of it. I've spent the intervening months wiping disk drives so we can get rid of them.

Today I cracked the case on an IBM AT. I haven't been inside one of those since 1996. This wasn't original issue though, as it had an AMD-386 CPU in it and a non-IBM motherboard. But still, an AT. That and the DDS2 tape changer are rivals for oldest-crap in that room.

An old HP Pentium Pro 200MHz server booted, which is good. It means we can at least wipe that era of drive now.

And more monitors than we knew were around are now awaiting the surplus people to haul them away. Including at least one HP-server monitor, dating from the era when all servers came with monitors. Scary, huh?

In 1996 in one of my first jobs after college, I was tasked with cleaning out another closet like this one. The department was upgrading their desktops to Pentium-class machines, and were surplussing anything older than a 486-33DX. As it happens, the remains of the previous replacement cycle were still in the back of the graveyard, and that was one big pile of IBM PC, XT, and AT machines. I was pulling network cards (10-base-2) for reuse in the newer stuff which shipped with the wrong kind of ethernet card (10-base-T), and tagging everything else for surplus. It was a deep closet. I got to know the Surplus people real well.

Also? Then as now, everything is dusty. These old boat-anchors I'm working with today spent some of their service-life parked on carpet and are therefore filled with dust. The stuff that only lived in a datacenter only have minimal dust.

2010-11 budget

By SysAdmin1138 on April 14, 2010 10:20 AM

The WA legislature finally, finally, got around to passing an amended budget for the second half of the biennium. They had to fill budget holes, and have spent the last three and a half weeks in Special Session arm-wrestling between the House and Senate versions. The main sticking point was over revenue-enhancers (sometimes referred to as 'new taxes'). Anyway, they reached an agreement, and the Governor should sign it real soon now. This means that we (WWU) now know what our budget cut is going to be (5.2%).

WWU's Budget Planning office has a nice chart up describing how our cut has evolved as the legislative session progressed: link.

In a letter to all staff this morning, the President said:

What remain possibilities at this point could affect 39 positions. Of these, 10 that are currently occupied would either be eliminated or have the FTE reduced. An additional 7 would be continued but funding sources would be changed. The remainder, 22, are currently vacant or will become so through retirement.

I'm safe. Technical Services staff has been told that any one of us will cause grievous pain in the event of a departure, so the cuts would have to be very bad for us to get passed an ax.

On the down side, this does mean that normal expansion-of-business upgrades will be much harder to fund. Exciting times.

The passing

By SysAdmin1138 on April 12, 2010 6:37 PM

At 16:40 this afternoon I issued the final 'cluster down' command on the WUF cluster. This 6 node NetWare cluster was born August 26, 2003 as NetWare 6.0. It replaced a trio of large file servers (Huey, Dewey, and Louie) that had been providing file-serving to campus, and allowed this critical function to be provided in a highly available way.

As of 16:40 April 12th, 2010, WWU Information Technology Services was no longer in the Novell File Serving business. Other entities on campus still provide this service. ITS continues to provide identity management and replica hosting to eDirectory.

The remains of WUF will be cleaned up over the next couple of days.

Storage Administration

By SysAdmin1138 on April 7, 2010 4:35 PM

That last series of articles might suggest I've been doing storage administration for a while. And I have. But every so often I run across an article that just reminds me that I'm still in the shallow end.

Like this article from The Register, going over Quantum's new mega-library, the i6000. I have a buddy who has an i2000 and I've petted it. Lovingly. *sigh* This new baby can store 8PB. Petabyes, baaybee. LTO5. Mmmm. Sexy.

Storage is a major concern just now. One of the main reasons that there are still IT stacks on campus that aren't centralized is storage. We have researchers, generally in the College of Science and Technology, that use departmental, rather than central, resources for storing their data. Departmental means servers, so CST represents the biggest non-ITS concentration of IT at WWU. They don't have any shared storage arrays over there, so they make do with large direct-attach-storage servers over there. A quick back-of-envelope calculation says that they have about as much storage in DAS as we have on our fastest SAN-attached storage array. Combine that with the chronic storage shortages central IT has had for the past, oh, 15 years and you have an entrenched set of servers over there.

If they were to join us in ~~the borg~~ ITS my area just might crack 100TB in disk space. Ooo. An i2000 with LTO4 still would be overkill for a storage network that large. And the i2000 can expand to several cabinets.

Yeeeah. WWU is still strictly small time when it comes to storage. In a lot of ways I'm a Stand Alone Storage Administrator.

Know your I/O: Putting it together, Exchange upgrade

By SysAdmin1138 on April 6, 2010 9:30 AM | 2 Comments

It has come time to upgrade the email system to Exchange 2010. What's more, this time you have to build in email archiving. Rumors of 'unlimited email quota' have already leaked to the userbase, who have started sending you and your team candy and salty-snacks to urge you to get it done faster. Clearly, the existing mail-quota regime is not loved. The email team already has figured out what they want for software and server count, they just want your opinion for how to manage the storage. Meetings are held, email is sent around. A picture emerges.

Current email data is 400GB, with a fairly consistent but shallow growth curve.
Number of users is 4500, which makes it around 91MB per user.
The 'free' webmails all have quotas measurable in gigs.
Long-standing Helpdesk policy for handling email-retentive users is to convince them that PST files do the same thing.
This is email. Any outage is noticed.
Depending on time of the business cycle, internet-sourced mail-volume comes to between 2 and 6 GB a day. It is unknown what internal-generated mail-volume comes to.
The Email Archiving product is integrated into Exchange. Mail older than a system defined threshold is migrated to the Archive system and a pointer left in the mailbox. Any deleted mail is migrated to Archive immediately.
Bitter experience has shown that recovering whole mail servers can take multiple days when it has to be done from tape backup. Hours is the new target.
New Helpdesk policy will be to convince people that PST files aren't trustworthy and to keep all their email they want to save inside their backed-up mailboxes.

And now for the analysis.

Read/Write Percentage: I/O monitoring on the existing Exchange 2007 system suggests a 70/30 ratio. The log-files are 100% writes of course. On the Archiving side, it is predicted to be 20/80 ratio, as very little old email is actually accessed.
Average and Peak I/O: Peak I/O happens during the full database backups, and dwarfs Average I/O by a factor of 10. On the archiving side, the factor is even higher.
I/O Access Type: Highly random, and typically pretty small individual accesses. Queuing volumes are very highly transactional. Log files are constantly updated. On the Archiving side, significantly random but mostly writes.
Latency Sensitivity: Significant. Outlook Cached-Mode shields users from quite a lot, but they do notice when it takes longer than 15 seconds to send an email to someone across the cube-wall. As Exchange is DB backed, slowdowns in the transaction logs slows down the entire system so those are very highly latency sensitive. On the Archiving side, reads via Outlook need to be fast and are proxied through the Exchange system itself.
Storage Failure Handling: The latency tolerance of the transaction-logs suggest that the system has low tolerance for failure-induced slow-downs. The mailbox databases themselves have a higher tolerance but not overly so. On the Archiving side, as the amount of read access to that system is predicted to be much smaller than read of the 'online' email system, tolerance for slowdowns are higher.
Size and Growth: PRIMED FOR EXPLOSIVE GROWTH. Email growth has been repressed through draconian email quotas, which are now being removed. Users are used to GB-sized mailboxes on their private email systems. Some mailbox DB space will be liberated when the new Archiving system comes online and removes the 6+ month old email. Plan on no email being deleted for 6 months. 180 days, 3GB/day for internet-sourced email, call it 6GB/day for internal-sourced email, and you have 1.6TB for just your online mailbox databases and constantly growing as the average email size increases. The Archive system would grow 3.2TB a year for the first year.

The three main attributes driving the storage system are: size of the entire system, latency tolerance, and disaster-recovery engineering. The average I/O and I/O access types of the online system strongly, strongly suggests 15K SAS drives for usage. On the Archive side, the transaction logs should be on 15K SAS, but the data volumes could survive on 7.2K SAS.

The existing storage infrastructure includes several SAN-based storage arrays. The existing email system has been on the fastest one (FC-based, not SAS) and never suffered a fault. Analyzing the usage of the existing FC array shows plenty of head-room in controller CPU and disk queue lengths. 2TB of space will be needed on this system if it will house the online Exchange mailbox databases. RAID5 is sufficient, and rebuilds have not affected I/O performance on this system so far.

Another array containing a mix of 7.2K SATA and smaller 7.2K SAS drives also exists. The reliability of the SAS drives meets the reliability demands of the application, and that's what the Email Admins want to use. However, they'll need 6TB of it to start with and the ability to add more as mail grows ever larger. Analysis shows that existing controller CPU demands are minimal, but disk queue lengths are showing signs of periodic saturation.

Exchange has some disaster-recovery mechanisms built into it, which the Email Admins opt to use instead of array-based mirroring. This will require mirroring the online database in a remote site. This remote site has a single storage array populated with 7.2K SATA drives already showing signs of regular saturation, and performance tanks when doing a rebuild..

The existing Fibre Channel Drive based storage array has enough room to handle the new online mail system. The SAS/SATA one will require the purchase of new 7.2K SAS disks to dedicate to the Archive system. The controller on this second system has enough horsepower to drive the added disks, and should not run in to I/O contention with the already busy disks. The DR site will require the purchase of a brand new disk array, 7.2K SAS disks being the most probable choice.

The Archive systems will have their Transaction Log volumes on the FC SAN, and their data volumes on the SAS SAN. The Online system will have both transaction and data volumes on the FC SAN. The DR system will use periodic log-shipping, and keep both volume types on the local SAS disks.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade

Know your I/O: Putting it together, Blackboard

By SysAdmin1138 on April 6, 2010 8:55 AM

That was a lot of reading (and writing). How how about a concrete example or two to demonstrate these concepts?

Scenario: Blackboard

It has come time to upgrade the elderly Blackboard infrastructure. E-Learning has seriously taken off since the last time any hardware was purchased for this environment, and there is a crying need for storage space. You, the intrepid storage person, have been called in to help figure out how to stop the sobbing. You and the Server person pass knowing looks going into the meeting, perhaps because that's also you.

The Blackboard Administration team has some idea what's going on. They have 300GB of storage right now. The application is seriously high-availability since professors have taken to using it for passing in homework and dealing with tests, the very definition of a highly-visible line-of-business application for a University campus. Past trends indicate that space is growing around 50GB a quarter and increasing as average file sizes grow and more and more teaching staff start using the system.

After asking a few key questions you learn a few things.

The read/write ratio is about 6:4 for the file storage.
The service is web-fronted, so whole files are read and written. Files are not held open and updated all day.
2 years of courses are held online for legal reasons, so only 1/8^th of the data is ever touched in a quarter.
The busiest time of the quarter is in the three weeks up to and including finals week, as students hand in work.
The later in the quarter it gets, the more late-night access happens.
Once a quarter there is a purge of old courses.
The database backing this application has plenty of head-room and already meets DR requirements, so you don't have to worry about that.
Words can not explain how busy the Helpdesk gets when Blackboard is down.
Fast recovery from a disaster is a paramount concern. Lost work will bring the wrath of parents upon the University.

A nice list of useful facts. From this you can determine many things:

Read/Write percentage: This was explicitly spelled out, 60%/40%. What's more, since the storage is fronted by web-servers, write performance is almost completely hidden from end-users due to the very extensive app-level caching and no one expects uploading to be fast, just download.
Average and Peak I/O rates: Because only an eighth of the data is accessed during a quarter, and the need for fast recovery is there, the weekend backup is the largest I/O event by far. User generated I/O occurs in the weeks approaching finals week, but doesn't come to even a fifth of backup I/O.
Latency Sensitivity: As this is a web-fronted storage system that reads and writes whole files, this system is not significantly latency sensitive. As it can tolerate high latencies, this reduces the amount of hardware required to support it.
I/O Access Type: User generated I/O will be infrequent random accesses. System generated I/O, that backup again, will be large sequential. Due to the latency tolerance of the system, a degradation of random I/O speeds during the large sequential access is permissible.
Storage Failure Handling: More of an implementation detail, but the latency tolerance of the system allows much more flexibility in selecting an underlaying storage system. If random I/O is noticeably degraded during the backup, then tests will need to be made to see how bad it gets when the disk array is rebuilding after a failure.
Size and Growth: The app managers know what they have and what past growth suggests. However, storage always grows more than expected.. The app managers said outright that they're experiencing two kinds of growth: new users to the system, and changing useage patterns by existing users. In other words, whatever the system that gets created, ease of storage expansion needs to be a high priority.

With this in mind and given the constraints of high availability (undoubtedly clustering of some kind) the shape of a system suggests itself. Direct-attach disk is off the table due to the clustering, so it has to be some kind of shared-access disk array. The I/O patterns and latency sensitivity do not suggest that high speed disks are needed, so those 15K SAS drives are probably overkill and SSDs not even in the same country. However, it does need to be highly reliable and still performant under the worst conditions; a disk failure during finals week.

The disaster-recovery question needs to be worked out with the backup people (which also may be you). This is an application where a live mirror on a separate storage system would be a good idea to maintain, as it would significantly reduce the downtime incurred if the file store were completely lost for some reason. Depending on the servers involved that kind of replication could cost quite a lot of money (in case the mirror is implemented in the storage array's mirroring software), or be free (in the case of Linux + DRDB). One for the application managers to figure out if they can afford.

The disks for this system need to be highly reliable and cheap, and that spells 7.2K RPM SAS. The storage quantities suggest that RAID10 could be a reasonable RAID level, but the latency tolerance suggests that RAID5/6 would be permissible. The need for shared storage means some kind of either iSCSI or Fibre Channel storage array, with iSCSI being the cheaper choice (presuming the network is prepared for it). The disk controller doesn't have to be terribly beefy, but still beefy enough to handle backup I/O while dealing with an array rebuild or disk-add.

This could be either a low to middle stand-alone storage array, or a modest increase in an existing one. Next step? Figuring out if this can fit in existing hardware or requires a new purchase. High availability doesn't require dedicated hardware! Or even a lot of it.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade

Know your I/O: Caching

By SysAdmin1138 on April 5, 2010 8:22 AM

As I strongly suggested in the 'Components' article, there is a lot of caching and buffering going on during I/O operations. Each discrete device does at least a little of this, and some can do quite a lot. Knowing what the effects of caching, and cache settings, has on your storage will allow you to better match needs to reality.

But first, a look at what's at each level.

Application: The application itself may do caching.
File-cache: The file-cache of your OS of choice does caching, though most have options to bypass this if requested. This is usually one of the biggest caches in the system.
Controller Driver: The storage driver at minimum buffers I/O requests, and may do its own caching.
Server Controller: The actual device buffers I/O for increased parallelism, and to handle errors.
Storage network switches: Strictly speaking, if your storage fabric is switched, the switch also buffers I/O. Though it is designed to be as in-order as possible while still maintaining performance.
Storage Virtualization device: Has a small cache at minimum to handle reorganizing I/O for efficiency, before forwarding the I/O stream to internal buffers talking to storage devices behind it. If it's fronting direct-attach storage, it may have significant on-board cache.
Storage Bus Controller: If the Storage Bus Controller is a discrete device, it will do buffering at minimum but is unlikely to do much caching.
Disk Bus Controller: Can do quite a bit of caching, but can be configured to not cache writes more than strictly needed to commit them. Allowing write-caching can improve perceived speed by quite a bit, at the risk of losing I/O in sudden power-loss situations. This is usually one of the biggest caches in the system.
Disk: More buffer than cache, the disk does cache enough to make efficient read/write patterns.

As you can see from this list, the big caches are to be found in:

The application itself.
The Operating System cache.
The Disk Bus Controller.

At the most basic level, all this caching means that a sudden loss of power can mean lost writes. If you don't trust your power environment, you really have to take that into account. Some enterprise storage arrays have built in batteries that are just beefy enough to flush internal cache and shut down in the case of power outage. Others have onboard batteries that'll preserve cache for hours or even days. Still others don't have anything built in at all, and you have to provide the insurance through external means.

The Disk Bus Controller Cache

As I've mentioned before, this is very commonly baked into the same device as the Storage Bus Controller, and even the Server Controller in the case of direct-attach RAID cards. This cache can be minimal (64MB) or quite large (8GB or more), depending on what the exact device is and how old it is. It may be used jointly between multiple controllers in the case of multi-controller storage devices like the EVA, or each controller can have its own independent cache (LeftHand, Equilogic).

In general, this cache can be configured on an array or LUN basis. The write policies are generally named, 'write-through', and, 'write-back'. The 'write-back' policy is where the host device is notified that a write is committed when it enters the Disk Bus Controller's cache. The 'write-through' policy is where the the host device is notified that a write committed when it gets sent to the disk itself.

Write-through is the safer of these two options, as the I/O operation itself is kept in volatile memory for as little time as possible. If you need very high assurance that all written data is really written, then you need to use write-through policy. Or, if your controller doesn't have a battery-backed cache, write-through is pretty much your only sane choice.

Write-back is the faster of these two options since it doesn't have to wait for the physical disk to respond to a write. Using this policy means that you are willing to accept that writes committed to controller-cache are as good as hitting disk. Use this if you and your application managers have very high confidence in your power environment.

When it comes to reading, not writing, the bigger your cache the better your performance. These controllers will cache frequently requested blocks, which can provide very significant performance improvements. The best-case usage scenario is if all the in-use blocks at any given time are held in controller cache, though this is very rarely the case.

Be wary of individual device cache policies, though. As a specific example of this, the HP MSA1500cs disables its internal cache when doing array operations such as rebuilding a RAID5/6 set, adding a disk to an array, or changing the stripe-size of a LUN. When this happens, the array is much less tolerant of high I/O levels. Even if the operation underway is not one that uses controller CPU, the lack of a cache for reads made performance very noticeably sluggish when handling a large write. Yet another reason to know how your storage devices behave when handling faults.

Operating System Cache

The cache in the operating system provides the fastest performance boost for reads, as it is the cache closest to the requesting application. The size of this cache is very much dependent upon the specific Operating System, but in general modern OSs use all 'unused' RAM for the file-cache. Some provide a block-cache instead of or along side a file-cache, check your OS for specifics. Some operating systems take the radical step of swapping out least-used RAM to the swap-file in order to increase the size of the file-cache.

Sizing this cache correctly can provide very significant performance improvements for applications that do more reads than writes. Ideally, you want this cache to be the same size or larger than the size of all open files on that server. This way all reads except the very first one are fulfilled through cache, and don't have to go to disk. With 64-bit memory limits and the price of RAM these days, it is a LOT easier to size a file-cache to the open data-set than it is to size the Disk Bus Controller cache.

This caching feature is one that some applications would rather not happen, generally due to data-integrity concerns or because the application is accepting responsibility for caching data. For this reason, Direct I/O is provided by operating systems as a way to bypass the cache. These I/O operations still pass through the Kernel's storage stack, so there is still some buffering going on. Databases are the usual applications requesting Direct I/O, as they use their own internal algorithms for determining what data needs to be cached and which can be read directly from disk.

Keep in mind that at the moment Direct I/O operations are still subject to the cache policy of the Disk Bus Controller. This may change as kernel drivers and disk bus controllers improve their ability to have different classes of I/O. I have noticed a trend in Linux Kernel change-logs that there is a drive for a priority system in I/O requests, which is the first step along this path. I expect that the ability to set the cache policy on a per-block basis is probably in the intermediate future. I could be wrong, though.

Application Caching

Not all applications cache, not all applications are configurable. People requesting high performance storage may be making requests based on 10 year old best-practices documents. It is impossible to know the caching details of all applications, but it is a good idea to research the details of application requests you get. The entity requesting storage resources may not know anything about their application, which means you have to do the work to figure out how the application handles this. Or maybe they know entirely too much, at which point you can work with them to ensure that everyone's needs are met.

Databases do their own caching to a very great degree, and are in fact likely to use direct-I/O methods to ensure performance.

Web-servers can do their own caching, which can involve quite a bit of memory. While not strictly a storage problem, it does impact server resources for other processes that may use file-cache.

Mail servers vary, but mailers like Postfix rely pretty heavily on the OS cache for speed. The more transactional they are, the higher the dirty buffer rate. Cross OS limits for dirty buffer percentages and you can hit performance tar-pits.

Do what you can, and be prepared to educate if you need to.

In the next article in this series, I put all of this together in an example.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade

Sometimes wired is better

By SysAdmin1138 on April 2, 2010 10:15 PM | 1 Comment

At this moment a Windows 7 VM I have on my laptop is being copied to my home storage server. It's finally at a state where I can archive it in case of disaster. It works in VirtualBox just fine, my apps work, and it all runs stable. I've applied the license key, now to put a copy of this VM into the vault in case of sudden laptop death or some other calamity; the days of reusable license keys are now behind us so these precautions need to be taken.

All I can say right now is that I'm very glad I have wired networking in my house. If this 14GB copy had to happen over wifi I'd be up a lot longer than I am now. The copy is currently throttling on I/O waits on the receiving server (gotta get that thing upgraded, like soon. 8 yo hardware really is too slow) rather than pure network. I know from experience that I can expect between 2-3 MB/s transfer rate on the wireless side (no N yet, gotta change that too). Right now this transfer is averaging about 7 MB/s to an NFS export.

Wireless is good enough for most things, but when you need something streamed fast wired is better.

Know your I/O: The Technology

By SysAdmin1138 on April 2, 2010 7:44 AM | 2 Comments

We've all heard that SATA is good for sequential I/O and SCSI/SAS is better for random I/O, but why is that? And what are these new RAID variants that are showing up? Fibre Channel vs. iSCSI vs. FCoE? Books have been written on these topics, but this should help familiarize you with some of the high level details.

SATA vs SAS

When it gets down to the brass tacks, there are a few protocol differences that make SAS a more robust protocol. However, the primary difference between the two when it comes to performance is simple cost, not inherent betterness. It's a simple fact that a 15,000 RPM drive will provide faster random I/O performance than a 7,200 RPM drive, regardless of which one is SATA and which is SAS. It's a simple fact of the marketplace that the 15K drive will be SAS, and the 7.2K drive will be SATA.

This has a lot to do with tradition, and a healthy pinch of protocol differences.

First, tradition. Since the early days of enterprise computing on Intel hardware, the SCSI bus has reigned supreme. When the need for an interconnect able to connect more than two nodes to the same storage bus arrived, Fibre Channel won out. SCSI/FC allowed attaching many disks to a single bus, ATA allowed for... 2. The ATA spec was for, ahem, desktops. This is a mindset that is hard to break.

As for protocol differences, SAS is able to handle massive device counts much better than plain SATA can. SAS allows a single device to present multiple virtual devices to a server, where SATA can't. So in enterprise storage, especially shared storage, SAS is needed to provide the level of virtualization needed by a flexible storage environment. What's more, if a storage environment includes non-disks, such as a tape library with associated robotics, SATA has no capabilities for handling such devices. And finally, SAS has better descriptors for errors than SATA does, which improves the ability to deal with errors (depending on implementation, of course).

Combine the two, and you get a marketplace where everyone in the enterprise market has SAS already. While SATA drives can plug into a SAS backplane, why bother with SATA when you can use SAS? To my knowledge there is nothing stopping the storage vendors from offering 10K or even 15K SATA drives; I know people who'd love to use those drives at home. 10K SATA drives exist, the Western Digital VelociRaptor, though I only know of one maker marketing them.

The storage makers impose the performance restriction. Disks can be big, fast, cheap, or error-free; and you can only pick two of these attributes. SATA drives are, in general, aimed at the desktop and mid-market storage markets. Desktop class drives are Big and Cheap. Mid-market storage class drives are Big and Error Free. Enterprise and High Performance Computing drives are Fast and Error Free. The storage vendors have decided that SATA=Big, and SAS=Fast. And since FAST is what determines your random I/O performance, that is why SAS beats out SATA in random I/O.

Now you know.

New RAID

If you're reading this blog, I'm assuming you know WTF RAID1 and RAID5 are, and can make a solid guess about RAID10, RAID01, and RAID50. In case you don't, the quick run-down:

RAID10: A set of mirrored drives which are then striped.
RAID01: A set of striped drives, which are then mirrored.
RAID50: A set of discrete RAID5 sets, which are then striped.

The new kid on the block, though not on the RAID spec-sheets, is RAID6. RAID6 is RAID5 with a second parity stripe-set. This means it can survive a double disk failure. You didn't really see RAID6 much until SATA drives started to penetrate the enterprise storage marketplace, and there is a good reason for that.

Back when SATA started showing up in enterprise markets, the storage makers hadn't get managed to get the build quality of their SATA drives to the standards they'd set for their SCSI drives. Therefore, they failed more often. And as the SATA drives were a lot bigger than their SCSI and FC brethren, a higher error rate meant a vastly higher chance of an error during a RAID5 rebuild. Thus, RAID6 entered the marketplace as a way to sell SATA drives and still survive the still-intrinsic faults.

These days SATA drives aren't as bad as they used to be, but the storage vendors are still sensitive to sacrificing the 'Cheap' from their products in the quest for lower error rates. The reason the mid-market drives are 'only' at 750GB while the consumer grade drives are hitting 2TB is that very error rate problem. An 8 drive RAID5 array of those consumer-grade 2TB disks give a niiiice size of 14TB, but the odds of it being able to rebuild after replacing the bad drive are very, very small. A 20 drive RAID5 array of 750GB mid-market drives (yes, I know a 20 drive RAID5 array is bad, bear with me please) gives the same size, but has a far higher chance of surviving a rebuild.

Storage Area Networking

Storage Area Networks came into the Intel datacenter on the heels of clustering, and soon expanded to include storage consolidation. A SAN allows you to share storage between multiple hosts, which is a central part of clustering for high availability. It also allows you to have a single storage array provide storage to a number of independent servers, allowing you to manage your storage centrally as a sharable resource rather than a static resource assigned to each server at time of purchase.

Fibre Channel was the technology that really launched the SAN into the wide marketplace. Earlier technologies allows sharing between two servers (SCSI), or involved now-obscure interconnects borrowed from Mainframes. Fibre Channel managed to hit the right feature set to make it mainstream.

Fibre Channel had some nice ideas that have been carried forward into the newer iSCSI and SAS protocols. First and foremost, FC was designed for storage from the outset. The packet size was specified with this in mind, and in-order arrival was a big focus. FC beat the pants off of Ethernet for this kind of thing, which showed up during the early iSCSI years. What's more, FC was markedly faster than Ethernet, and supported higher data contention before bottlenecking.

The Ethernet based iSCSI protocol came about as a way to provide the benefits of a SAN without the eye-bleeding cost-per-port of Fibre Channel. The early years of iSCSI were somewhat buggy. The reason for which can be summarized by a quote from a friend of mine who worked for a while building embedded firmware for iSCSI NICs:

"An operating system makes assumptions about its storage drivers. All data will be sent and received in order. They don't handle it well when that doesn't happen, some are worse than others. So when you start basing your storage stack on a network [TCP/IP] that has out-of-order arrival as an assumption, you get problems. If you absolutely have to go iSCSI, which I don't recommend, go with a hardware iSCSI adapter. Don't go with a software iSCSI driver."

This was advice from over five years ago, but it is illustrative of the time. These days operating systems have caught up to the fact that storage I/O can arrive out of order, and the software stacks are now much better than they were. In fact, if you can structure your network for it (increasing the MTU of your Ethernet from the standard 1500 bytes to something significantly larger than 4096 bytes, also known as 'jumbo frames') iSCSI provides a very simple and cheap way of getting the benefits of Storage Area Networking.

Taking the same idea as iSCSI, make a SAN out of cheaper parts, Serial Attached SCSI is doing a lot of it. Unlike Fibre Channel, SAS is strictly a copper-based cabling system which restricts its distance. Think of it as a rack-local or row-local SAN. However, if you want a lot of storage in a relatively small space, SAS can provide. SAS switches similar to Fibre Channel switches are already on the market and able to connect multiple hosts to multiple targets. Like Fibre Channel, SAS also can connect tape libraries.

A new thing has been coming for a while called Fibre Channel over Ethernet, or FCoE. Unlike iSCSI, FCoE is not based on TCP/IP, it is an Ethernet protocol. The prime benefit of FCoE is to do away with the still very expensive-per-port Fibre Channel ports and use a standard Ethernet port. It will still require some enhancements on the Ethernet network, part of why it has been taking this long to ratify is standardizing what exactly needs to be done, but should be markedly cheaper to implement and maintain than traditional Fibre Channel. Unsurprisingly, Cisco is very into FCoE and Brocade somewhat lukewarm.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade

Read-only databases

By SysAdmin1138 on April 1, 2010 10:03 AM

I've been reading up on Active Directory read-only domain controllers (RODC), new in Server 2008. When I first glanced at them, they looked an awful lot like NDS read-only replicas which have been around since the advent of NetWare 4.0 too many years ago. Novell put r/o replicas into NDS in large part for complete X.500 compliance. However, their real use case was never made clear. The only case I could ever come up with is a kind of disaster-recovery site, where that R/O replica could be promoted to a R/W replica in an emergency. So why was Microsoft finally putting the last X.500 piece in now?

Turns out, it wasn't X.500, it was to solve a somewhat intractable problem with Active Directory domains; the satellite office problem. The Small Business Development Center is a part of the College of Business Education, and actually offices in downtown Bellingham. Before they got a reliable WAN connection to campus, they needed to be able to work when their internet connection was down. What we did was put all of those users into a single OU, made that OU a partition in eDirectory, and gave their NetWare server a copy of that replica. That way, only those security principles were ever at threat, and they could still log in and use resources local to them when their WAN link was down.

The same problem with AD is much trickier to solve, since you can't partition the AD database that way. You really had three options:

Tell the users to live with the outage.
Put a Domain Controller down there.
Declare the site a new Domain in the forest and put Domain Controllers down there.

Putting a DC down there meant that the site would have a full copy of your entire authentication database, which can represent a major security vulnerability if the site lacks any way to truly secure the DC's physical existence. AD Sites allow for more efficient use of WAN resources, but that doesn't change the fact that a full and complete copy of the domain was hosted there.

A Read-Only Domain Controller is NOT a full copy of the domain; it does not contain any passwords by default. Unlike a R/O NDS replica, users can actually authenticate against it; the server proxies the authentication against a normal DC if it can find one. You can set a password-caching policy to tell it which passwords to keep local copies of, so branch-local users can still log in when the WAN is down. That's... not useless You're still having to keep the entire AD database down there complete with GPO SYSVOL goodness and all those groups, but at least if thieves run off with the RODC they'll only fully compromise local users.

It still isn't as robust as how eDirectory handles it, but at least it's a lot better than it used to be. Especially if politics prevent you from being able to declare a new domain.

Know your I/O: The Components

By SysAdmin1138 on April 1, 2010 9:32 AM

This is about the various layers of the storage stack. Not all of these will be present in any given system, nor are they required. Multiple things on this list will probably be baked into the same device. But they do add things, notably layers of abstraction. Enterprise-class shared storage systems can get mightily abstract, which can make engineering them correctly harder.

Back in the beginning of Intel x86-based PC hardware, storage was simple. DOS told the disk what to write and where to write it, and the disk obligingly did so in the order it was told to do it. Heck, when writing to disk, DOS stopped everything until the disk told it that the write was done. Time passed, new drive interfaces evolved, and we have the complexity of today.

Disk

Down at the bottom level is the disk itself. I'm not going into all the various kinds of disks and what they mean at this level, that's for the next post. However, some things are true of nearly all disks these days;

They all have onboard cache.
The ones you'll be using for 'enterprise' work have onboard I/O reordering (Native Command Queuing or Tagged Command Queuing). The drives you're buying for home use may have it.

Onboard cache and NCQ mean that even the disks don't commit writes in the order they're told to do them. They'll commit them in the order that provides the best performance, based on the data it has. You'll get more out of this from rotational media than solid-state, but even SSDs have it (it is called 'Write Combining', as writes are very expensive on SSDs).

Disk Bus Controller

This is what the Disk talks to. This could be the SATA port on your motherboard. Or it could be the Enclosure controller in your LeftHand storage module. The capabilities of this controller vary wildly. Some, like the SATA support baked into your southbridge only talk to a very few devices. Others, like the HSV controllers in my EVAs, talk to over 50 drives at a time. Even with such a disparate assortment of capabilities, there are still some commonalities:

Nearly all support some kind of RAID, especially the stand-alone controllers.
All reorder I/O operations for performance. Those with RAID support perform parallel operations wherever possible.
Stand-alone controllers have onboard cache for handling both read requests, and writes to some extent.

More advanced devices also have the ability to hide storage faults from higher levels of the stack. Management info will still reveal the fault, but the fact that storage has failed (RAID5 rebuild time!) can remain hidden.

Storage Bus Controller

This is what the Disk Bus Controller talks to, and faces the storage fabric, whatever it may be. Sometimes this is baked into the Disk Bus Controller, such as with the EVA HSV controllers. Other times, it's a stand-alone unit, such as the LeftHand and Equilogic storage redirectors. Your southbridge doesn't bother with this step. The features offered by these devices have varied over the years, but offered features include:

Directing traffic to the correct Disk Bus Controllers. This might be a one time redirection, or it could be continual.
LUN masking, which presents certain storage to certain devices.
Failover support between multiple controllers.
Protocol translation, such as between Fibre Channel (storage bus) and SAS.(disk bus), or iSCSI support.

Storage Virtualization

Also sometimes called a 'Storage Router'. I haven't worked with this stuff, but it presents multiple Storage Bus Controllers as a single virtual controller. This is handy when you want a single device to manage all of your storage access, or need to grant access to a device that doesn't have sufficient access controls on it. As with routers on IP networks, they too increase latency by a smidge. Features include:

Fibre Channel routing, connecting two separate fabrics without merging them.
Protocol translation, such as between Fibre Channel and SAS.
Fine grained access control.

Server Controller

This is the device that talks to the storage bus and is plugged into your server. Frequently called a Host Bus Adapter, the specific device may have features of the Disk Bus Controller and Storage Bus Controller baked into it, depending on what it is designed to do. This device typically includes at minimum:

A certain amount of onboard cache.
The ability to reorder transactions for better performance.

More advanced versions, such as those attached to multi-device buses such as Fibre Channel and SAS, also have a common feature set:

The ability to handle multiple paths to storage.
The ability to hide to a point certain storage events, such as path failovers and transient slow-downs, from the host operating system.

Controller Driver

This is the operating system code that talks to the controller. The storage stack in the kernel talks to the driver. The complexity of this code has increased significantly over the years, as has its place in the overall I/O stack in the operating system. Different operating systems place it in different spots. At any rate, modern drivers do have a common feature set:

They reorder transactions for better performance, such as parallelizing operations between multiple controllers.
Interpret hardware storage errors for the operating system, and can transparently handle some of them, as well as provide the management channels needed by storage management software.

Storage Stack in the Kernel

This is the code that talks to the controller drivers. File-system drivers and sometimes applications talk directly to the storage stack.. The kernel is the ultimate arbiter of who gets to access what inside of a server. This is typically called the I/O scheduler, which sets the policy for how I/O gets handled. Linux has several schedulers available, and each can be tuned to some degree. Other operating systems have tunable parameters for manipulating scheduler behavior.

Some schedulers do in-kernel reordering of transactions, others explicitly do not.

At this point the stack o' storage forks. On the one hand we have the File-system Driver, and on the other we have applications leveraging Direct I/O to talk I/O to the kernel without going through a file-system first. Databases are the majority application doing that, though its use is somewhat diminishing these days as file-systems have become more accommodating of the need to bypass caching.

I/O Abstraction Layer

Not all operating systems support this, but this is what LVM, EVMS, and Microsoft Dynamic Disks provide. It allows the operating system to present multiple storage devices as a single device to a file-system driver. This is where 'software RAID' lives for the most part. File-systems like NSS and ZFS have this baked into them directly.

File-system Driver

This is the code that presents a file-system to applications on the server. It does all the file-system things you'd expect. These drivers provide a lot of features, but the ones I'm interested in are:

Provides a significant level of caching, possibly multiple GB worth.
Performs predictive reads to improve read speeds.
Handles logical block-order of files.
Provides a method (or not) for writes to bypass caching.

Direct I/O Application Access

Some applications talk directly to the kernel for I/O operations. We hope they know what they're doing.

File-based Application

Any application that uses files instead of direct I/O. This is everything from DB2 to Dbase, to Apache, to AutoCAD, to Sendmail, to Ghost. Here at the very top of the storage stack I/O is initiated. It might hit disk, but there are enough layers of caching between here and 'Disk' that this isn't guaranteed. Even writes aren't guaranteed to hit disk if they're the right kind (such as a transient mail-spool file on the right file-system).

See? There is a LOT of abstraction in the storage stack. The days when the operating system wrote directly to physical disk sectors is long, long gone. Even on your smartphone, arguably the simplest storage pathway, it has several components:

disk â†’ chipset â†’ driver â†’ kernel â†’ file-system driver â†’ phone O/S

1980 (IBM PC, MFM hard drive):

User saves a Volkswriter file.
DOS finds a free spot in the file-system, and tells the disk to write the data to specific blocks.
The disk writes the data to the blocks specified.
DOS returns control back to the user.

Compare this to:

2010 (HP DL360 G6, FC-attached EVA4400 with FC disks):

User saves a Word file to a network share.
Server OS caches the file in case the user will want it again.
Server OS aggregates this file-save with whatever other Write I/O needs committing, grouping I/O wherever possible. Sends write stream to device driver for specific LUN.
Since the write didn't specify bypass-caching, Server OS tells user the write committed. User rejoices.
Device driver queues the writes for sending on the Fibre Channel bus, in and amongst the other FC traffic.
EVA HSV controller receives the writes and caches them.

If the HSV controller was configured to cache writes, it informs the server that the write committed. If the user had specified bypass-caching, the Server OS informs the user that the write committed. User rejoices.
If the HSV controller was not configured to cache writes, the cache is immediately flushed, skipping step 7 and going straight to step 8.

EVA HSV holds the write in cache until it needs to flush them, at which point...
HSV reorders pending writes for maximum efficiency.
HSV sends write commands to individual disks.
Disk receives the writes and inserts it into its internal command queue.
Disk reorders writes for maximum efficiency.
Disk commits the write, and informs the controller it has done so.

If the HSV was not configured to cache writes, the HSV controller informs the Server that the write committed. If the user had specified bypass-caching, the Server OS informs the user the write committed. User rejoices.

We've come a long way in 30 years.

In my next article I'll talk about some technology specifics that I didn't go into here.

Know your I/O: Access Patterns

Know your I/O: The Components

Know your I/O: The Technology

Know your I/O: Caching

Know your I/O: Putting it together, Blackboard

Know your I/O: Putting it together, Exchange 2007 Upgrade

« March 2010 | Main Index | Archives | May 2010 »