That was a lot of reading (and writing). How how about a concrete example or two to demonstrate these concepts?
It has come time to upgrade the elderly Blackboard infrastructure. E-Learning has seriously taken off since the last time any hardware was purchased for this environment, and there is a crying need for storage space. You, the intrepid storage person, have been called in to help figure out how to stop the sobbing. You and the Server person pass knowing looks going into the meeting, perhaps because that's also you.
The Blackboard Administration team has some idea what's going on. They have 300GB of storage right now. The application is seriously high-availability since professors have taken to using it for passing in homework and dealing with tests, the very definition of a highly-visible line-of-business application for a University campus. Past trends indicate that space is growing around 50GB a quarter and increasing as average file sizes grow and more and more teaching staff start using the system.
After asking a few key questions you learn a few things.
The read/write ratio is about 6:4 for the file storage.
The service is web-fronted, so whole files are read and written. Files are not held open and updated all day.
2 years of courses are held online for legal reasons, so only 1/8th of the data is ever touched in a quarter.
The busiest time of the quarter is in the three weeks up to and including finals week, as students hand in work.
The later in the quarter it gets, the more late-night access happens.
Once a quarter there is a purge of old courses.
The database backing this application has plenty of head-room and already meets DR requirements, so you don't have to worry about that.
Words can not explain how busy the Helpdesk gets when Blackboard is down.
Fast recovery from a disaster is a paramount concern. Lost work will bring the wrath of parents upon the University.
A nice list of useful facts. From this you can determine many things:
Read/Write percentage: This was explicitly spelled out, 60%/40%. What's more, since the storage is fronted by web-servers, write performance is almost completely hidden from end-users due to the very extensive app-level caching and no one expects uploading to be fast, just download.
Average and Peak I/O rates: Because only an eighth of the data is accessed during a quarter, and the need for fast recovery is there, the weekend backup is the largest I/O event by far. User generated I/O occurs in the weeks approaching finals week, but doesn't come to even a fifth of backup I/O.
Latency Sensitivity: As this is a web-fronted storage system that reads and writes whole files, this system is not significantly latency sensitive. As it can tolerate high latencies, this reduces the amount of hardware required to support it.
I/O Access Type: User generated I/O will be infrequent random accesses. System generated I/O, that backup again, will be large sequential. Due to the latency tolerance of the system, a degradation of random I/O speeds during the large sequential access is permissible.
Storage Failure Handling: More of an implementation detail, but the latency tolerance of the system allows much more flexibility in selecting an underlaying storage system. If random I/O is noticeably degraded during the backup, then tests will need to be made to see how bad it gets when the disk array is rebuilding after a failure.
Size and Growth: The app managers know what they have and what past growth suggests. However, storage always grows more than expected.. The app managers said outright that they're experiencing two kinds of growth: new users to the system, and changing useage patterns by existing users. In other words, whatever the system that gets created, ease of storage expansion needs to be a high priority.
With this in mind and given the constraints of high availability (undoubtedly clustering of some kind) the shape of a system suggests itself. Direct-attach disk is off the table due to the clustering, so it has to be some kind of shared-access disk array. The I/O patterns and latency sensitivity do not suggest that high speed disks are needed, so those 15K SAS drives are probably overkill and SSDs not even in the same country. However, it does need to be highly reliable and still performant under the worst conditions; a disk failure during finals week.
The disaster-recovery question needs to be worked out with the backup people (which also may be you). This is an application where a live mirror on a separate storage system would be a good idea to maintain, as it would significantly reduce the downtime incurred if the file store were completely lost for some reason. Depending on the servers involved that kind of replication could cost quite a lot of money (in case the mirror is implemented in the storage array's mirroring software), or be free (in the case of Linux + DRDB). One for the application managers to figure out if they can afford.
The disks for this system need to be highly reliable and cheap, and that spells 7.2K RPM SAS. The storage quantities suggest that RAID10 could be a reasonable RAID level, but the latency tolerance suggests that RAID5/6 would be permissible. The need for shared storage means some kind of either iSCSI or Fibre Channel storage array, with iSCSI being the cheaper choice (presuming the network is prepared for it). The disk controller doesn't have to be terribly beefy, but still beefy enough to handle backup I/O while dealing with an array rebuild or disk-add.
This could be either a low to middle stand-alone storage array, or a modest increase in an existing one. Next step? Figuring out if this can fit in existing hardware or requires a new purchase. High availability doesn't require dedicated hardware! Or even a lot of it.