The Department of Government Efficiency, Musk's vehicle. made news by "discovering" the General Services Administration uses tapes, and plans to save $1M by switching to something else (disks, or cloud-based storage). Long time readers of this blog may remember I used to talk a lot about storage and tape backup. Guess it's time to get my antique Storage Nerd hat out of the closet (this is my first storage post since 2013) to explain why tape is still relevant in an era of 400Gb backbone networks and 30TB SMR disks.
The SaaS revolution has utterly transformed the office automation space. The job I had in 2005, in the early years of this blog, only exists in small pockets anymore. So many office systems have been SaaSified that the old problems I used to blog about around backups and storage tech are much less pressing in the modern era. Where we have stuff like that are places that have decades of old file data, staring in the mid to late 1980s, that is still being hauled around. Even when I was still doing this in the late 2000s the needle was shifting to large arrays of cheap disks replacing tape arrays.
Where you still see tape being used here are offices with policies for "off-site" or "offline" storage of key office data. A lot of that stuff is also done on disk these days, but some offices still kept their tape libraries. I suspect a lot of what DoGE found was in this category of offices retaining tape infrastructure. Is disk cheaper here? Marginally, the true savings will be much less than the $1M headline rate.
But there is another area where tape continues to be the economical option, and it's another area DoGE is going to run into: large scientific datasets.
To explain why, I want to use a contrasting example: A vacation picture you took on an iPhone in 2011, put into Dropbox, shared twice, and haven't looked at in 14 years. That file has followed you to new laptops and phones, unseen, unloved, but available. A lot goes into making sure it's available.
All the big object-stores like S3, and file-sync-and-share services (like Dropbox, Box, MS live, Google Drive, Proton Drive, etc) use a common architecture because this architecture has been proven to be reliable at avoiding visible data-loss:
- Every uploaded file is split into 4KB blocks (the size is relevant to disk technology, which I'm not going into here)
- Each block is written between 3 and 7 times to disk in a given datacenter or region, the exact replication factor changes based on service and internal realities
- Each block is replicated to more than one geographic region as a disaster resilience move, generally at least 2, often 3 or more
The end result of the above is that the 1MB vacation picture is written to disk 6 to 14 different times. The nice thing about the above is you can lose an entire rack-row of a datacenter and not lose data; you might lose 2 of your 5 copies of a given block, but you have 3 left to rebuild, and your other region still has full copies.
But I mentioned this 1MB file has been kept online for 14 years. Assuming an average disk life-span of 5 years, each block has been migrated to new hardware 3 times in those years. Meaning each 4KB block of that file has been resident on between 24 and 42 hardrives; or more, if your provider replicates to more than 2 discrete geographic region. Those drives have been spinning and using power (and therefore requiring cooling) the entire time.
These systems need to go to all of this effort because they need to be sure that all files are available all the time, when you need it, where you need it, as fast as possible. If a person in that vacation photo retires, and you suddenly need that picture for the Retirement Montage at their going away party, you don't want to wait hours for it to come off tape. You want it now.
Contrast this to a scientific dataset. Once the data has stopped being used for Science! it can safely be archived until someone else needs to use it. This is the use-case behind AWS S3 Glacier: you pay a lot less for storing data, so long as you're willing to accept delays measurable in hours before you can access it. This is also the use-case where tape shines.
A lab gets done chewing on a dataset sized at 100TB, which is pretty chonky for 2011. They send it to cold storage. Their IT section dutifully copies the 100TB dataset onto LTO-5 drives at 1.5TB per tape, for a stack of 67 tapes, and removes the dataset from their disk-based storage arrays.
Time passes, as with the Dropbox-style data. LTO drives can read between 1 and 2 generations prior. Assuming the lab IT section keeps up on tape technology, it would be the advent of LTO-7 in 2015 that would prompt a great restore and rearchive effort of all LTO-5 and previous media. LTO-7 can do 6TB per tape, for a much smaller stack of 17 tapes.
LTO-8 changed this, with only a one version lookback. So when LTO-8 comes out in 2017 with a 9TB capacity, a read restore/rearchive effort runs again, changing our stack of tapes from 17 to 12. LTO-9 comes out in 2021 with 18TB per tape, and that stack reduces to 6 tapes to hold 100TB.
All in all, our cold dataset had to relocate to new media three times, same as the disk-based stuff. However, keeping stacks of tape in a climate controlled room is vastly cheaper than a room of powered, spinning disk. The actual reality is somewhat different, as the few data archive people I know mention they do great restore/archive runs about every 8 to 10 years, largely driven by changes in drive connectivity (SCSI, SATA, FibreChannel, Infiniband, SAS, etc), OS and software support, and corporate purchasing cycles. Keeping old drives around for as long as possible is fiscally smart, so the true recopy events for our example data is likely "1".
So another lab wants to use that dataset and puts in a request. A day later, the data is on a disk-array for usage. Done. Carrying costs for that data in the intervening 14 years are significantly lower than the always available model of S3 and Dropbox.
Tape: still quite useful in the right contexts.