Recently in backup Category

Except if you're using HP Data Protector.

Much as I'd like to jump on the backup-to-disk de-dup bandwagon, I can't. Can't afford it. It all comes down to the cost-per-GB of storage in the backup system.

With tape, Data Protector licenses on the following items:
  • Per tape-drive over 2
  • Per tape library with a capacity between 50 and 250 slots
  • Per tape library that exceeds 250 slots
  • Per media pool with more than some-big-number of tapes
With disk, DP licenses on the following items:
  • Per TB in the backup-to-disk system
Obviously, the Disk side is much easier to license. In our environment we had something like 500 SDLT320 tapes, and our library had 6 drives and 45 slots. We only had to license the 4 extra tape drives.

Then our library started crapping out, and we outgrew it anyway. Prime time to figure out what the future holds for our backup environment. TO DISK!

HOLY CRAP that's expensive.

HP licenses their B2D space by the Terabyte. After you do the math it comes down to about $5/GB. Without using a de-duplication technology, you can easily make 10 copies or more of every bit of data subject to backup. Which means that for every 1 GB of data in the primary storage, 10GB of data is in the B2D system, and that'll set us back a whopping $50/GB. So... about the de-duplication system...

Too bad it doesn't work for non-file data, and kinda sorta explicitly doesn't work for clustered systems. Since 70% or so of our backup data is sourced from clustered file-servers or is non-file data (Exchange, SQL backups), this means the gains from HP's de-dup technology are pretty minor. Looks like we're stuck doing standard backups at $50/GB (or more).

So, about that 'dead' tape technology! We've already shelled out for the tape-drive licenses so that's a sunk cost. The library we want doesn't have enough slots to force us to get that license. All that's left is the media costs. Math math math, and the amortized cost of the entire library and media set comes to about $0.25/GB. Niiice. Factor in the magnification factor, and each 1 GB of backup will cost $2.50/GB, a far, far cry from $50/GB.

We still have SOME backup to disk space. This is needed since these LTO4 drives are HUNGRY critters, and the only way to feed them fast enough to prevent shoe-shining is to back everything up to disk, and then copy the jobs to tape directly from disk. So long as we have a week's worth of free-space, we're good. This is a sunk cost too, happily.

So. To-disk backups may be the greatest thing since the invention of the tape-changing robot, but our software isn't letting us take advantage of it. 

New backup hardware

| No Comments | No TrackBacks
Friday represented the first production use of our new LTO4-based tape library. This replaced the old SDLT320 based Scalar 100 we've had for entirely too long. The simple fact that all of the media and drives are BRAND NEW should make our completion rate go very close to 100%. This excites me.

Friday we did a backup of our main file-serving cluster and the Blackboard content volume in a single job that streamed to a single tape drive.

Total data backed up: 6.41TB
Total time: 1475 minutes
Speed: 4669 MB/Min, or 77 MB/s

Still not flank speed for LTO4 (that's closer to 120 MB/s) but still markedly faster than the SDLT stuff we had been doing. The similar backup on the Scalar 100 took around 36 hours (2160 minutes) instead of the 24ish hours this one took, and it used 4 tape drives to do it.

Ahhhh, modern technology, how I've desired you.

*pets it*

Now to resist taking a fire ax to the old library. We have to surplus it through official channels, and they won't take it if it has been "obviously defaced". Ah well.

Spent money

| 2 Comments | No TrackBacks
It has been a week plus since we spent a lot of money and the question has been raised, what are we doing with that storage? Exactly?

It isn't fancy storage. In fact, it is the cheapest performant storage we could budget for. It's not SATA, but it is 7.2K RPM SAS. And there are 35TB of it. It's a server, with direct attached storage. Not a dedicated storage unit. Not fibre channel. An off the shelf server, a high quality RAID card, a bunch of storage shelves, and a pair of network ports.

A final decision hasn't been made yet for how we're presenting this storage to consumers, but iSCSI of some kind is the 90% likely choice. Whether that's Linux (a.k.a. the free option) or something else (a.k.a. the pay option) remains to be seen. The whole point of this storage is to be cheap per GB.

We're also adding a pair of Fibre Channel Drive enclosures to our EVA4400 to provide true high-speed low-latency service at a much reduced cost versus the EVA6100. Yes FC drives are EOL in a very short while, but the EVA4400 doesn't support SAS (yet). This is where our ESX cluster is likely to expand into when the time comes, that kind of stuff.

And a new tape library. It's an HP StorageWorks MSL4048. It is LTO4, fibre-attached, and has lots of slots. The native capacity of this guy is 37.5TB which is a whole lot larger than the 7.9TB of our current SDLT320 unit. It only has two drives for now which will limit flexibility somewhat, but it is upgradeable to four drives later when the money tree starts producing again. If we really need to, we can stack another MSL4048 on top of it for even more storage. Because it is drive-limited, we kind of have to stage all backups to disk and then copy from disk to tape; we won't be doing any backups directly to tape.

When it'll get here is anyone's guess. Purchasing is currently digging out of an avalanche of last-minute orders just like ours, so they're w-a-y backed up down there.

Spending money

| 2 Comments | No TrackBacks
Today we spent more money in one day than I've ever seen done here. Why? Well substantiated rumor had it that the Governor had a spending freeze directive on her desk. Unlike last year's freeze, this one would be the sort passed down during the 2001-02 recession; nothing gets spent without OFM approval. Veterans of that era noted that such approval took a really long time, and only sometimes came. Office scuttle-butt was mum on whether or not consumable purchases like backup tapes would be covered.

We cut purchase orders today and rushed them through Purchasing. A Purchasing who was immensely snowed under, as can be well expected. I think final signatures get signed tomorrow.

What are we getting? Three big things:
  1. A new LTO4 tape library. I try not to gush lovingly at the thought, but keep in mind I've been dealing with SDLT320 and old tapes. I'm trying not to let all that space go to my head. 2 drives, 40-50 slots, fibre attached. Made of love. No gushing, no gushing...
  2. Fast, cheap storage. Our EVA6100 is just too expensive to keep feeding. So we're getting 8TB of 15K fast storage. We needs it, precious.
  3. Really cheap storage. Since the storage area networking options all came in above our stated price-point, we're ending up with direct-attached. Depending on how we slice it, between 30-35TB of it. Probably software ISCSI and all the faults inherent in the setup. We still need to dicker over software.
But... that's all we're getting for the next 15 months at least. Now when vendors cold call me I can say quite truthfully, "No money, talk to me in July 2011."

The last thing we have is an email archiving system. We already know what we want, but we're waiting on determination of whether or not we can spend that already ear-marked money.

Unfortunately, I'll be finding out a week from Monday. I'll be out of the office all next week. Bad timing for it, but can't be avoided.

Budget plans

| No Comments | No TrackBacks
Washington State has a $2.6 Billion deficit for this year. In fact, the finance people point out that if something isn't done the WA treasury will run dry some time in September and we'll have to rely on short-term loans. As this is not good, the Legislature is attempting to come up with some way to fill the hole.

As far as WWU is concerned, we know we'll be passed some kind of cut. We don't know the size, nor do we know what other strings may be attached to the money we do get. So we're planning for various sizes of cuts.

One thing that is definitely getting bandied about is the idea of 'sweeping' unused funds at end-of-year in order to reduce the deficits. As anyone who has ever worked in a department subject to a budget knows, the idea of having your money taken away from you for being good with your money runs counter to every bureaucratic instinct. I have yet to meet the IT department that considers themselves fully funded. My old job did that; our Fiscal year ended 12/15, which meant that we bought a lot of stuff in October and November with the funds we'd otherwise have to give back (a.k.a. "Christmas in October"). Since WWU's fiscal year starts 7/1, this means that April and May will become 'use it or lose it' time.

Sweeping funds is a great way to reduce fiscal efficiency.

In the end, what this means is that the money tree is actually producing at the moment. We have a couple of crying needs that may actually get addressed this year. It's enough to completely fix our backup environment, OR do some other things. We still have to dicker over what exactly we'll fix. The backup environment needs to be made better at least somewhat, that much I know. We have a raft of servers that fall off of cheap maintenance in May (i.e. they turn 5). We have a need for storage that costs under $5/GB but is still fast enough for 'online' storage (i.e. not SATA). As always, the needs are many, and the resources few.

At least we HAVE resources at the moment. It's a bad sign when you have to commiserate with your end-users over not being able to do cool stuff, or tell researchers they can't do that particular research since we have no where to store their data. Baaaaaad. We haven't quite gotten there yet, but we can see it from where we are.
Our tape library is showing its years, and it's time to start moving the mountain required to get it replaced with something. So this afternoon I spent some quality time with google, a spread-sheet, and some oldish quotes from HP. The question I was trying to answer is what's the optimal mix of backup to tape and backup to disk using HP Data Protector. The results were astounding.

Data Protector licenses backup-to-disk capacity by the amount of space consumed in the B2D directories. You have 15TB parked in your backup-to-disk archives, you pay for 15TB of space.

Data Protector has a few licenses for tape libraries. They have costs for each tape drive over 2, another license for libraries with between 61-250 slots, and another license for unlimited slots. There is no license for fibre-attached libraries like BackupExec and others do.

Data Protector does not license per backed up host, which is theoretically a cost savings.

When all is said and done, DP costs about $1.50 per GB in your backup to disk directories. In our case the price is a bit different since we've sunk some of those costs already, but they're pretty close to a buck fiddy per GB for Data Protector licensing alone. I haven't even gotten to physical storage costs yet, this is just licensing.

Going with an HP tape library (easy for me to spec, which is why I put it into the estimates), we can get an LTO4-based tape library that should meet our storage growth needs for the next 5 years. After adding in the needed DP licenses, the total cost per GB (uncompressed, mind) is on the order of $0.10 per GB. Holy buckets!

Calming down some, taking our current backup volume and apportioning the price of largest tape library I estimated over that backup volume and the price rises to $1.01/GB. Which means that as we grow our storage, the price-per-GB drops as less of the infrastructure is being apportioned to each GB. That's a rather shocking difference in price.

Clearly, HP really really wants you to use their de-duplication features for backup-to-disk. Unfortunately for HP, their de-duplication technology has some serious deficiencies when presented with our environment so we can't use it for our largest backup targets.

But to answer the question I started out with, what kind of mix should we have, the answer is pretty clear. As little backup-to-disk space as we can get away with. The stuff has some real benefits, as it allows us to stage backups to disk and then copy to tape during the day. But for long term storage, tape is by far the more cost-effective storage medium. By far.

Bad tapes

| 4 Comments | No TrackBacks
It seems that HP Data Protector and BackupExec 10 have different opinions on what constitutes a bad tape. BackupExec seems to survive them better. This means that as we cycle old media into the new Data Protector environment we're getting the occasional bad tape. We've been averaging 3 bad tapes per 40 tape rotation.

While that not may sound like a lot, it really is. Our very large backups are extremely vulnerable to bad tapes, since all it takes is one bad tape to kill an entire backup session. When you're doing a backup of 1.3TB of data, you don't want those backups to fail.

Take that 1.3TB backup. We're backing up to SDL320 media, so we're averaging somewhere between 180GB and 220GB a tape depending on what kinds of files are being backed up. So that's 7-8 tapes for this one backup. How likely is it that this 7 to 8 tape backup will include at least one of the 3 bad tapes?

When the first tape is picked the chance is 3 in 40 (7.5%).
When the second tape is picked, assuming the first tape was good, the chance is 3 in 39 (7.69%).
When the third tape is picked, presuming the first two were good, the chance is 3 in 38 (7.89%).
When the 7th tape is picked, presuming the first six were good, the chance has increased to 3 in 34 (8.82%)

8.82% doesn't sound like much. However, the probability is cumulative. The true probability can be computed:

(3/40)+(3/39)+(3/38)+(3/37)+(3/36)+(3/35)+(3/34) = 0.56923444 or 56.92%

So with 3 bad tapes in a given 40 tape set, the chance of this one 7 tape backup having at least one of them in the tape set is over 50%. For an 8 tape backup the probability increases to 66.01%.

The true probability is a different number, since these backups are taken concurrent with other backups. So when the 7th tape gets picked, the number of available tapes is much less than 34, and the number of bad tapes still waiting to be found may not be 3. Also, these backups are mutliplexed so the true tape set may be as high as 9 tapes for this backups if that one backup target is slow in sending data to the backup server.

So the true probability is not 56.92%, it changes on a week to week basis. However, 56.92% (or 66%) is a good baseline. Some weeks it'll be a lot more. Others, such as weeks where the bad tapes are found by other processes and the target server is streaming fast, less.

We have a couple more weeks until we've cycled through all of our short-retention media. At that point our error rate should drop a lot. Until then, it's like dodging artillery shells.

That TCP Windowing fault

| 2 Comments | No TrackBacks
Here is the smoking gun, let me show you it (new window).

That's an entire TCP segment. Packet 339 there is the end of the TCP window as far as the NetWare side is concerned. Packet 340 is a delayed ACK, which is a normal TCP timeout. Then follows a somewhat confusing series of packets and the big delay in packet 345.

That pattern, the 200ms delay, and 5 packets later a delay measurable in full seconds, is common throughout the capture. They seem to happen on boundaries between TCP windows. Not all windows, but some windows. Looking through the captures, it seems to happen when the window has an odd number of packets in it. The Windows server is ACKing after every two packets, which is expected. It's when it has to throw a Delayed ACK into the mix, such as the odd packet at the end of a 27 packet window, is when we get our unstable state.

The same thing happened on a different server (NW65SP8) before I turned off "Receive Window Auto Tuning" on the Server 2008 server. After I turned that off, the SP8 server stopped doing that and started streaming at expectedly high data-rates. The rates still aren't as good as they were when doing the same backup to the Server 2003 server, but at least it's a lot closer. 28 hours for this one backup versus 21, instead of over 5 days before I made the change.

The packets you see are for an NW65 SP5 server after the update to the Windows server. Clearly there are some TCP/IP updates in the later NetWare service-packs that help it talk to Server 2008's TCP/IP stack.

Sniffing packets

| 2 Comments | No TrackBacks
When I first started this sysadmin gig 'round about 1997, Windows based packet sniffers were still in their infancy. In fact, the word 'sniffer' was (and probably still is) a trademarked term for the software and hardware package for, er, sniffing packets. Sniffer. So when I needed to figure out a problem on the network, I went to the Network Guys who plugged their Sniffer into any available port on the 10baseT hub I needed analysis on and went to work. They told me what was wrong. Like a JetDirect card transmitting packets whenever it sensed a packet on the wire, thus bringing the network to is knees. Things like that.

Time passed and Sniffer was bought by Network Associates. Who then added a zero to the price because that package really did have a lock on the market. The next rev then more than doubled the already inflated price. So when it came time to renew/upgrade, our Sniffer couldn't handle Fast Ethernet, the price was eye watering. So. On came the free sniffers.

At first I was using Ether Boy, a now long lost packet sniffer. But eventually I found Ethereal (now WireShark), and I went to work. By the time I left my old job in 2003 I already had a rep for knowing WTF I was looking at, and the network guys didn't bat an eyelash when I asked for a span port. This ability was very handy when diagnosing slow Novell logins.

Fast forward to now. Right now I'm trying to figure out why the heck a certain NetWare server is so slow talking to the Data Protector media agent. It isn't obviously a TSA problem, but I've had problems with DP and NW talking to each other on the TCP level so that's where I'm looking now. Unfortunately for me, the desktop-grade GigE nic I have on the span isn't, shall we say, resourced enough to sniff a full GigE stream without at least a few buffer overruns. So I'm not getting ALL of the packets.

When I asked for the span port, the telecom guy said he greatly respected my ability to dig in to TCP issues. And said it in the voice of, "I think you're better at that kind of troubleshooting than we are." Which is a bit disconcerting to hear from your telecom router-gods. But there it is. What it means is that I can't very well ask for help interpreting these traces.

So far I've been able to determine that there is something hinky going on with network delays. There are some 200ms delays in there, which hints strongly at a failed protocol negotiation somewhere. But there are some rather longer delays, and it could be due to window size negotiation problems. Server 2008, the media-agent server, has a much newer TCP/IP stack than NetWare so it is entirely possible that they just don't work well together. I don't understand that quite well enough to manually deconstruct what's going on, so that's what I'm googling on right now.

And why Saturday? Because of course the volume that's doing this is our single largest and it is on the weekend where it is in the failed state where I can pry the hood off and look. Who knows, I may resort to posting packets and crowd sourcing the problem.

Update 12/23/09: Found it.
It has been no secret that we've been trying to migrate away from BackupExec (10d) to HP's Data Protector. Originally this was due to cost reasons, but we were either sold a bill of goods, or there was a fundamental misunderstanding somewhere along the way. Your choice as to which it really was. In short, the costs have been about the same or even a bit more than staying with BE. However, sunk costs are sunk, so once the switch was made DP became the cheaper course of the future.

Which brings us to the current state. We've finally pried loose funds to license the Scalar 100 we have for our tape backup solution, and we're in the process of getting that working with DP. As with all backup software, it behaves a bit differently than others.

And now, a digression.

It is my opinion that all backup software everywhere is fundamentally cantankerous, finicky, and locked in obscure traditions. The traditions I'm speaking of are sourced in the ancestral primary supported platforms, and the UI and rotation metaphors created 5, 10, 15, 20 years ago. The cantankerous and finicky parts come from a combination of supporting the cantankerous and finicky tape hardware and the process of getting data off of servers for backup.

I know there are people who love their backup solutions with an unholy passion. I do not understand this. Perhaps I just haven't worked with the right environment.

Back to my point.

Data Protector continues the tradition of cantankerous, finicky software locked into an obscure tradition. I am not surprised by this, but it is still disheartening. How it interacts with our tape robot is sub-optimal in many ways, which will ultimately require hand editing a text file to configure timeout paramters in an optimal way. This reflects DP's origin as a UNIX based backup ported to Windows. It only got a usable GUI very recently, and until DP 6.10 came out required rsh and rcp for their Unix deployment server. I kid you not. DP6.11 at least supports ssh and scp.

It's also not working well with our NetWare backups. I've blogged about this one before, but didn't end up posting the solution to the last round of problems. It turned out to be an out of date driver on the part of the DP backup-to-disk server, as it wasn't ACKing packets fast enough. Updated the driver, the backup started flying. Now that we've got the Scalar in the mix, and backing up to a new server, some new problems have emerged. So far they look to be in the NetWare TSA stack rather than on the DP side (at least, that's what the symptoms look like. I still need to look at packets to be sure), which is unfortunate since 1: Novell isn't going to fix the TSAs on NetWare, and 2: We're getting rid of NetWare in the near future. But not near enough that we can just forget the backups until we migrate. Suck-up-and-deal appears to be our solution. (DP does have OES2 agents, by the way)

Our Windows backups all are looking decent, though. That's something anyway. At least, when the Scalar isn't throwing monkey wrenches into DP's little world.

Other Blogs

My Other Stuff

About this Archive

This page is an archive of entries from June 2010 listed from newest to oldest.

May 2010 is the previous archive.

July 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.