Thursday, October 15, 2009

It's the little things

Right now our Microsoft migration schedule is hung up on backup licenses. Backing up clustered servers requires extensions, which we didn't notice back when we priced out the project. It is things like these that make for cost-overruns. The long and the short of it is, we're not migrating anything until we can legally back up the new environment. Period. That's just how it is.

As most of the budget arm-wrestling happens above me, I only get bits and pieces. Since we don't spend our money, we spend other people's money, we have to convince other people that this money needs to be spent. I understand there was some pushback when the quote came in, and we've been educating about what exactly it would mean if we don't do this.

I understand the order is in the works, and we're just waiting on license codes. But until they arrive (electronic delivery? What's dat?) we simply can not move forward. That's just how it is.

Labels: ,


Wednesday, July 15, 2009

Where DIY belongs

The question of: "When should you built it your self and when should you get it off the shelf?" is one that varies from workplace to workplace. We heard several different variants of that when were interviewing for the Vice Provost for IT last year. Some candidates only did home-brew when no off the shelf package was available, others looked at the total cost of both and chose from there. This is a nice proxy question for, "What is the role of open source in your environment," as it happens.

Backups are one area where duct tape and bailing wire is to be discouraged most emphatically.

And now, a moment on tar. It is a very versatile tool, and is what a lot of unixy backup packages are built around. The main problem with backup and restore is not getting data to the backup medium, it is keeping track of what data is on which medium. Also in these days of the backup-to-disk, de-duplication is also in the mix and that's something tar can't do yet. So while you can build a tar-and-bash backup system from scratch without paying a cent, it will be lacking in certain very useful features.

Also? Tar doesn't work nearly as well on Windows.

Your backup system is one area you really do not want to invest a lot of developer creativity. You need it to be bullet proof, fault tolerant, able to handle a variety of data-types, and easy to maintain. Even the commercial packages fail some of these points some of the time, and the home brew systems fall apart much more often relative to these. The big backup boys have agents that allow backups of Oracle DBs, Linux filesystems, Exchange, and Sharepoint all to the same backup system, a home-brew application would have to get very creative to do the same thing; the problem gets even worse when it comes to restore.

Disaster Recovery is another area in which duct tape and bailing wire are to be discouraged most emphatically.

There are battle-tested open-source packages out there that will help with this (DRBD for one), depending on your environment. They're even widely used so finding someone to replace the sysadmin who just had a run in with a city bus is not that hard. Rsync can do a lot as well, so long as the scale is small. Most single systems can have something cobbled together.

Problems arise when you start talking Windows, very complex installations, or money is a major issue. If you throw enough money at a problem, most disaster recovery problems become a lot less complex. There is a lot of industry investment in DR infrastructure, so the tools are out there. Doing it on a shoe-string means that your disaster recovery also hangs by a shoe-string. If you're doing DR just to satisfy your auditors and don't plan on ever actually using it, that's one thing. But if you really expect to recover from a major disaster on that shoe-string you'll be sorely surprised when that string snaps.

Business Continuity is an area where duct tape and bailing wire should be flatly refused.

BC is in many ways DR with a much shorter recovery time. If you had problems getting your DR funded correctly, BC shouldn't even be on the timeline. Again, if it is just so you can check a box on some audit report, that's one thing. Expecting to run on such a rig is quite another.

And finally, if you do end up cobbling together backup, disaster recovery, or business continuity systems from their component parts, testing the system is even more important. In many cases testing DR/BC takes a production outage of some kind, which makes it hard to schedule tests. But testing is the only way to find out if your shoe-string can stand the load.

Labels: , ,


Tuesday, March 31, 2009

When perfection is the standard

The disaster recovery infrastructure is an area where perfection is the standard, and anything less than perfection is a fault that needs fixing. It shares this distinction with other things like Air Traffic Control and sports officiating. In any area where perfection is the standard, any failure of any kind brings wincing. There are ways to manage around faults, but there really shouldn't be faults in the first place.

In ATC there are constant cross-checks and procedures to ensure that true life-safety faults only happen after a series of faults. In sports officiating, the advent of 'instant replay' rules assist officials in seeing what actually happened from angles other than the ones they saw, all as a way to improve the results. In DR, any time a backup or replication process fails, it leaves an opening through which major data-loss can possibly occur. Each of these have their unavoidable, "Oh *****," moments. Which leads to frustration when it happens too often.

At my old job we had taken some paperwork steps towards documenting DR failures. We didn't have anything like a business-continuity process, but we did have tape backup. When backups failed, there was a form that needed to be filled out and filed, explaining why the fault happened and what can be done to help it not happen again. I filled out a lot of those forms.

Yeah, perfection is the standard for backups. We haven't come even remotely close to perfection for many, many months. Some of it is simple technology faults, like DataProtector and NetWare needing tweaking to talk to each other well or over-used tape drives giving up the ghost and requiring replacement. Some of it is people faults, like forgetting to change out the tapes on Friday so all the weekend fulls fail due to a lack of non-scratch media. Some of it is management process faults, like discovering the sole tape library fell off of support and no one noticed. Some of it is market-place faults, like discovering the sole tape library will be end-of-lifed by the vendor in 10 months. Some of these haven't happened yet, but they are areas that can fail.

If the stimulus fairy visits us, backup infrastructure is top of the list for spending.

Labels: ,


Tuesday, February 17, 2009

tsatest and incrementals

Today I learned how to tell TSATEST to do an incremental backup. I also learned that the /path parameter requires the DOS namespace name. Example:

tsatest /V=SHARE: /path=FACILI~1 /U=.username.for.backup /c=2

That'll do an incremental (files with the Archive bit set) backup of that specific directory, on that specific volume.

Labels: , , ,


Wednesday, February 11, 2009

Performance tuning Data Protector for NetWare

HP Data Protector has a client for NetWare (and OES2, but I'm not backing up any of those yet). This is proving to take a bit of TSA tuning to work out right. I haven't figured out where the problem exactly is, but I've worked around it.

The following settings are what I've got running right now, and seems to work. I may tweak later:

tsafs /readthreadsperjob=1
tsafs /readaheadthrottle=1

This seems to get around a contention issue I'm seeing with more aggressive settings, where the TSAFS memory will go the max allowed by the /cachememorythreshold setting and sit there, not passing data to the DP client. This makes backups go really long. The above setting somehow prevent this from happening.

If these prove stable, I may up the readaheadthrottle setting to see if it halts on that. This is an EVA6100 after all, so I should be able to go up to at least 18 if not 32 for that setting.

Labels: , ,


Tuesday, February 03, 2009

More on DataProtector 6.10

We've had DP6.10 installed for several weeks now and have some experience with it. Yesterday I configured a Linux Installation Server so I can push agents out to Linux hosts without having to go through the truly horrendous install process that DP6.00 forced you to do when not using an Installation Server. This process taught me that DataProtector grew up in the land of UNIX, not Linux.

One of the new features of DP6.10 is that they now have a method for pushing backup agents to Linux/HP-UX/Solaris hosts over SSH. This is very civilized of them. It uses public key and the keychain tool to make it workable.

The DP6.00 method involved tools that make me cringe. Like rlogin/rsh. These are just like telnet in that the username and password is transmitted over the wire in the clear. For several years now we've had a policy in place that states that protocols that require cleartext transmission of security principles like this are not to be used. We are not alone in this. I am very happy HP managed to get DP updated to a 21st century security posture.

Last Friday we also pointed DP at one of our larger volumes on the 6-node cluster. Backup rates from that volume blew our socks off! It pulled data at about 1600GB/Minute (a hair under 27MB/Second). For comparison, SDL320's native transfer rate (the drive we have in our tape library, which DP isn't attached to yet) is 16MB/Second. Considering the 1:1.2 to 1:1.4 compression ratios typical of this sort of file data, the max speed it can back up is still faster than tape.

The old backup software didn't even come close to these speeds, typically running in the 400MB/Min range (7MB/Sec). The difference is that the old software is using straight up TSA, where DP is using an agent. This is the difference an agent makes!

Labels: , ,


Tuesday, January 06, 2009

DataProtector 6.00 vs 6.10

A new version of HP DataProtector is out. One of the nicest new features is that they've greatly optimized the object/session copy speeds.

No matter what you do for a copy, DataProtector will have to read all of one Disk Media (50GB by default) to do the copy. So if you multiplex 6 backups into one Disk Writer device, it'll have to look through the entire media for the slices it needs. If you're doing a session copy, it'll copy the whole session. But object copies have to be demuxed.

DP6.00 did not handle this well. Consistently, each Data Reader device consumed 100% of one CPU for a speed of about 300 MB/Minute. This blows serious chunks, and is completely unworkable for any data-migration policy framework that takes the initial backup to disk, then spools the backup to tape during daytime hours.

DP6.10 does this a lot better. CPU usage is a lot lower, it no longer pegs one CPU at 100%. Also, network speeds vary between 10-40% of GigE speeds (750 to 3000 MB/Minute), which is vastly more reasonable. DP6.10, unlike DP6.00, can actually be used for data migration policies.

Labels: , , , ,


Friday, September 19, 2008

Moving storage around

The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
  1. Rebuilding all of the Disk Arrays.
  2. Creating LUNs expressly for Backup-to-Disk functionality.
  3. Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
  4. Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

Labels: , , , ,


Tuesday, June 24, 2008

Backing up NSS, note for the future

According to this documentation, the storing of NSS/NetWare metadata in xattrs is turned off by default. You turn it on for OES2 servers through the "nss /ListXattrNWMetadata" command. This allows linux level utilities (i.e. cp, tar) to be able to access and copy the NSS metadata. This also allows backup software that isn't SMS enabled for OES2 to be able to backup the NSS information.

This is handy, as HP DataProtector doesn't support NSS backup on Linux. I need to remember this.

Labels: , , , , ,


Monday, May 12, 2008

DataProtector 6 has a problem, continued

I posted last week about DataProtector and its Enhanced Incremental Backup. Remember that "enhincrdb" directory I spoke of? Take a look at this:

File sizes in the enhincr directory

See? This is an in-progress count of one of these directories. 1.1 million files, 152MB of space consumed. That comes to an average file-size of 133 bytes. This is significantly under the 4kb block-size for this particular NTFS volume. On another server with a longer serving enhincrdb hive, the average file-size is 831 bytes. So it probably increases as the server gets older.

On the up side, these millions of weensy files won't actually consume more space for quite some time as they expand into the blocks the files are already assigned to. This means that fragmentation on this volume isn't going to be a problem for a while.

On the down side, it's going to park (in this case) 152MB of data on 4.56GB of disk space. It'll get better over time, but in the next 12 months or so it's still going to be horrendous.

This tells me two things:
  • When deciding where to host the enhincrdb hive on a Windows server, format that particular volume with a 1k block size.
  • If HP supported NetWare as an Enhanced Incremental Backup client, the 4kb block size of NSS would cause this hive to grow beyond all reasonable proportions.
Some file-systems have real problems dealing with huge numbers of files in a single directory. Ext3 is one of these, which is why the b-tree hashed indexes were introduced. Reiser does better in this case out of the box. NSS is pretty good about this, as all GroupWise installs before GW became available for non-NetWare platforms created this situation by the sheer design of GW. Unlike NSS, ext3 and reiser have the ability of being formatted with different block-sizes, which makes creating a formatted file-system to host the enhincrdb data easier to correctly engineer.

Since it is highly likely that I'll be using DataProtector for OES2 systems, this is something I need to keep in mind.

Labels: , , , , ,


Wednesday, May 07, 2008

DataProtecter 6 has a problem

We're moving our BackupExec environment to HP DataProtector. Don't ask why, it made sense at the time.

Once of the niiiice things about DP is what's called, "Enhanced Incremental Backup". This is a de-duplication strategy, that only backs up files that have changed, and only stores the changed blocks. From these incremental backups you can construct synthetic full backups, which are just pointer databases to the blocks for that specified point-in-time. In theory, you only need to do one full backup, keep that backup forever, do enhanced incrementals, then periodically construct synthetic full backups.

We've been using it for our BlackBoard content store. That's around... 250GB of file store. Rather than keep 5 full 275GB backup files for the duration of the backup rotation, I keep 2 and construct synthetic fulls for the other 3. In theory I could just go with 1, but I'm paranoid :). This greatly reduces the amount of disk-space the backups consume.

Unfortunately, there is a problem with how DP does this. The problem rests on the client side of it. In the "$InstallDir$\OmniBack\enhincrdb" directory it constructs a file hive. An extensive file hive. In this hive it keeps track of file state data for all the files backed up on that server. This hive is constructed as follows:
  • The first level is the mount point. Example: enhincrdb\F\
  • The 2nd level are directories named 00-FF which contain the file state data itself
On our BlackBoard content store, it had 2.7 million files in that hive, and consumed around 10.5GB of space. We noticed this behavior when C: ran out of space. Until this happened, we've never had a problem installing backup agents to C: before. Nor did we find any warnings in the documentation that this directory could get so big.

The last real full backup I took of the content store backed up just under 1.7 million objects (objects = directory entries in NetWare, or inodes in unix-land). Yet the enhincrdb hive had 2.7 million objects. Why the difference? I'm not sure, but I suspect it was keeping state data for 1 million objects that no longer were present in the backup. I have trouble believing that we managed to churn over 60% of the objects in the store in the time I have backups, so I further suspect that it isn't cleaning out state data from files that no longer have a presence in the backup system.

DataProtector doesn't support Enhanced Incrementals for NetWare servers, only Windows and possibly Linux. Due to how this is designed, were it to support NetWare it would create absolutely massive directory structures on my SYS: volumes. The FACSHARE volume has about 1.3TB of data in it, in about 3.3 million directory entries. The average FacStaff User volume (we have 3) has about 1.3 million, and the average Student User volume has about 2.4 million. Due to how our data works, our Student user volumes have a high churn rate due to students coming and going. If FACSHARE were to share a cluster node with one Student user volume and one FacStaff user volume, they have a combined directory-entry count of 7.0 million directory entries. This would generate, at first, a \enhincrdb directory with 7.0 million files. Given our regular churn rate, within a year it could easily be over 9.0 million.

When you move a volume to another cluster node, it will create a hive for that volume in the \enhincrdb directory tree. We're seeing this on the BlackBoard Content cluster. So given some volumes moving around, and it is quite conceivable that each cluster node will have each cluster volume represented in its own \enhincrdb directory. Which will mean over 15 million directory-entries parked there on each SYS volume, steadily increasing as time goes on taking who knows how much space.

And as anyone who has EVER had to do a consistency check of a volume that size knows (be it vrepair, chkdsk, fsck,or nss /poolrebuild), it takes a whopper of a long time when you get a lot of objects on a file-system. The old Traditional File System on NetWare could only support 16 million directory entries, and DP would push me right up to that limit. Thank heavens NSS can support w-a-y more then that. You better hope that the file-system that the \enhincrdb hive is on never has any problems.

But, Enhanced Incrementals only apply to Windows so I don't have to worry about that. However.... if they really do support Linux (and I think they do), then when I migrate the cluster to OES2 next year this could become a very real problem for me.

DataProtector's "Enhanced Incremental Backup" feature is not designed for the size of file-store we deal with. For backing up the C: drive of application servers or the inetpub directory of IIS servers, it would be just fine. But for file-servers? Good gravy, no! Unfortunately, those are the servers in most need of de-dup technology.

Labels: , , , , ,


Thursday, September 14, 2006

Backups for OES

One of the things that has prevented us from seriously considering a move to OES-Linux has been the backup problem. Apparently there has been some movement on that issue. At Brainshare this year SyncSort was quite prominent in pointing out that they had full support for backing up NSS volumes on Linux.

Today over at Cool Blogs, Richard Jones posted about the progress of this technology in the industry. The short version is that Novell implemented SMS on Linux, and for vendors that already had a solid Linux client it required them to completely rewrite it. Which would explain why it has taken almost two years for the big storage players to come out with supported product. Novell has taken steps to support the really big storage players in UnixLand (IBM, et. al.) in their clients, using extended attributes (Xattrs).

Turns out that xattr thing was slipped into a patch on the 11th of August. I wonder if that's the same package that had shadow volumes included?

Tags: ,

Labels: , ,


Wednesday, June 15, 2005

Veritas Panther

We received a marketing mail from Veritas, hyping the beta for their "panther" product. The boss asked us if we wanted to take a look at it. So I checked it out.

Oy. My verdict? Pointless.

What it is, as distilled from the marketing
Salvage that integrates with your backup system, but for Windows. Since it hooks into the backup system, it has a higher capacity than the Salvage that has been with NetWare since the, oh NW2.1x days in the 80's.

First and foremost, it only works on files kept on Windows servers. Since all of our fileserving is done from NetWare, that means we can't use it for anything but keeping our developers happy, and the lone FrontPage/SharePoint server.

Second, even if it did support NetWare, it doesn't make a lot of sense. NetStorage in NW65SP3 has the ability to salvage files from the web, a key feature of Panther. Changing the low-water mark for when you add storage to your NSS pools is very probably more cost-effective than the Panther product would be. So instead of adding storage when free-space crosses the 15% line, add space when it crosses the 30% line. The extra expense is very probably cheaper than the Veritas product would be.

Labels:


Thursday, December 09, 2004

Backup speeds

The GigE switch is in, the jacks are wired. Now to plug servers into it and see if we get any increased speed out of the thing. I'm hoping we will, but the challenge of getting a new network cable into production systems is a touch tricky. Tonight half our Exchange cluster will land on a server on GigE, which will give us a better idea how I/O vs CPU bound the Exchange backup is.

Labels:


Monday, August 30, 2004

Speeeeeed with data!

All values Megs/Minute:
[Using TSAFS]
StuSrv1/Stu1
FacSrv1/Share1 FacSrv2/User3 FacSrv2/Share3 StuSrv2/Class1
908 2106 1565 2492 846
930 2105 1536 2447 885
925 2070 1582 2519

[Using TSA600]
StuSrv1/Stu1 FacSrv1/Share1 FacSrv2/User3 FacSrv2/Share3 StuSrv2/Class1
238 875 625 1160 221
265 904 656 1160 218
272 895 670 1184
287




The format of the test is nodename/volume. Each dataset had more data than known buffers in the data path in an effort to minimize the accelleration gained by such. In most tests, subsequent runs of the same dataset were faster than previous runs due to the caching on the server itself and on the EVA back-end SAN. The data were gathered using the TSATEST shipping in the NW6SP4 service-pack. The FacSrv2 tests are unique in that first User3 was run at 5GB then Share3 was run at 5GB in order to bust buffers where possible. The only variable between datasets was the TSA loaded and any itty bitty file changes that may have happened during normal usage.

What this test clearly shows is that TSAFS is faster than TSA600. In all cases tested the speed-up was north of 200%.

Other observations

One thing I did observe is that the FacSrv servers are faster than the StuSrv servers when it comes to TSA backups. This did merit investigation and I have come to the conclusing that the PSM file used has impacts on speed. Spurious interupts (displayable by "display interupts" from console) increment dramtatically when CPQACPI.PSK (v1.06 5/12/03) is loaded, as it is on the StuSrv-series of servers. The FacSrv series of servers are old enough that they load CPQMPK.PSM instead, and the spurious interupts on those servers are worlds lower than the numbers reported on the StuSrv series. I predict that backup speed improvements are to be had by changing which PSM we load.

It also looked like data characterization played a role in how fast things backed up. User volumes went slower than 'shared' volumes (Class1 is the student version of Share1). This may have something to do with the average file size being smaller on User volumes, or perhaps the churn-rate on the user-volumes leads to noticable fragmentation.

Another test I ran once just because I was curious was to back up Share1 from FacSrv1 and Share3 from FacSrv2 at the same time with TSAFS and monitor the results. Both servers backed up 5GB of data and when they hit the 5GB mark I recorded the rate. When the first server hit that mark I let the backup continue in order to provide the same contentious I/O path for the slower server. The totals were:

FacSrv1/Share1 @ 1862 MB/Min
FacSrv2/Share3 @ 2065 MB/Bin
Aggregate throughput of SAN I/O channel: 3927 MB/Min

This tells me a number of things:
  • Parallel backups won't drop absolute performance below the rated tape-drive performance spec
  • The SAN storage really is fast
  • A third stream (Share2?) could be added and still probably maintain real-world network speeds.
Conclusions

We have some work to do. For one, we need to either get Compaq to address the spurious interupt issue, or drop back to CPQMPK.PSM in order to get good results out of the StuSrv series. Once this is done, I hope that the StuSrv series will be able to provide performance matching or besting that of the older FacSrv line.

The networking infrastructure needs attention. The maximim theoretical throughput for 100Mb ethernet is 750 megs/minute, and each one of the TSAFS backups went faster then that. The maximum speeds observed are fast enough to dent even Gig Ethernet, though at those speeds tape-drive latency comes into play. The current 100Mb connections are not adequate.

The backup server needs to be robust. With the possibility of multiple high-rate streams coming to the server, its own Gig Ethernet connection may become saturated if all four tape-drives are streaming from a fast source. With speeds up that high, PCI-bus contention actually becomes a factor here so the server has to be built with high I/O in mind. No "PC on steroids" here. Best case would be multiple PCI busses, or at the very minimum 64-bit PCI; neither option shows up on sub $2000 hardware.

Our storage back-end is robust. The EVA on the back end can pitch data at amazing speeds. It is very nice to see rates like that, even on volumes that have had 12 months of heavy usage to fragment the bejebers out of them. We can scale this system out a lot further before running into caps.

We need to verify if our backup software solution is compatible with TSAFS. We use BackupExec for Windows, and that answer is not quite clear yet. BackupExec for NetWare is already there, so at least part of the product line knows how to handle it. We need a decision on how to handle open files before progressing on that front. But TSAFS versus BEREMOTE is something I can't test easilly.

TSAFS is the way to go, but it is only one piece of the puzzle.

Labels:


Friday, August 27, 2004

Speeeeeeed

Backing Up Vol:   SHARE1:

Read Count: 90253
Last Read Size: 17566
Total Bytes Read: 4664767995
Raw Data MB/min: 2231.01
Backup Sets: 26424
Ave. Open Time: 000uss
Ave. Close Time: 000us
Effective MB/min: 2183.85
Holy fiber channel, Batman! I've never seen a stream go that fast. This is a TSATEST backup using TSAFS (TSA600 was giving me still-impressive numbers around 900 MB/min for this dataset), with an EVA300 back-end.

Labels:


Tuesday, August 10, 2004

Lice-ridden backups

There is a reason I hate them.

I hate them because if they fail, we have a hole that we hope no one notices.

I hate them because they can always be a little faster than they have been going.

I hate them because they take too many tapes.

I hate them because the backup agents never perform the way I NEED them to.

I hate them because two otherwise identical servers can back-up at completely different speeds.

I hate them because we never have enough time to get them all in.

I hate them because we can never leverage the resources we need to get the damned things to work well.

I hate them because our backup server is a homebrew frankenstien that probably can't keep up with the brand spanking new tape library.

And most of all, I hate them because we get to manage the backup infrastructure with no budget of our own.

Labels:


Wednesday, August 04, 2004

Backup fun, again

No crashes last night! I have hopes that the newer code is doing its job. Some of the incrementals are taking way too long, so I changed one of them to Full since its taking that long anyway. We'll see tonight how things go.

So far, so good!

Labels:


Tuesday, August 03, 2004

Backup fun

The backup fun continues. Had a few more crashes last night, and this has now reached a big problem.

It turns out that there are a few others folk out there who are experiencing crashes on their clusters when TSAFS is loaded. Interesting, thought I. In my pre-incident checkout, I found that Novell has some newer TSA's than what I have loaded. Rather than burn an incident to get told, "update, you idiot," I updated. We'll see how things run tonight.

http://support.novell.com/cgi-bin/search/searchtid.cgi?/2969042.htm

The TID doesn't have the list of issues, but here is an extract of some of the neat stuff this update does:
TSAFS:
7. TSAFS was limited to 1024 bytes for the path names. This limitation has been
removed.
10. Fixed a problem with remote selective restores. An error would be returned:
"Could not connect to the TSA. Check the TSA is loaded correctly
SMDR : The required entry for protocol or service is not found.
Register appropriate protocol or service."
11.Fixed some pagefault processor abends.
13. Fixed a problem where ACCESS DENIED errors were returned backing up certain
directories. This happend because the zOpen used by SMS is requesting read access to
any object and the starting point for the backup is a location that the
specified user has visibility in but no read access.
16. fixes the invalid name parameters that caused abends. This was due to invalid
data structures that were being passed into tsafs.nlm. They were in the wrong
format.
17. Fixed abends that would result if the server was in an out of memory condition.
This was caused by the tsafs.nlm read ahead cache pointer that was getting
messed up. Check the server to see if it's getting low on ram.
19. Fixed abends in tsafs doing restores

TSA600:
2. Fixes a problem with the archive bit not being cleared. This would result
in too much data being backed up.
3. Fixed a problem with the tsa600.nlm leaking memory on restores and copies
because RestoreFullPaths was not calling Free.
7. Fixed abend problems with the tsas amd smdr
10. Fixed problems with slow incremental backups. The full backups would be fast.
Only the incrementals would be slow.

SMDR:
1. IPX dependency of SMDR removed. [YAY!]
13. Smdr has been enhanced to resolve the server/cluster poolname to an ip
address thru the sys:etc\hosts file.
14. IP address mgmt support added.
17. SMDR has been modified so that it does not need to parse the SYS:ETC\HOSTS file on it own.
It uses the gethostbyname() function to do that. In addition, this method
expands its name resolution to use DNS as well.

Labels:


Wednesday, June 09, 2004

Backup speeds

They managed to find the correct combination of driver and voodoo and things are now talking. He was chorteling over a 700 mb/min backup speed earlier, of which I fully understand. When you've been looking at backups that barely peak over 200 megs/minute, seeing one spew forth at 700 megs a minute is cause for giggling with glee. And seeing a Verify go at 1600 megs/minute is cause for laughing out loud with joy.

Depending on how things go, we may end up picking up the fibre-channel interface for the Scalar. It'll allow our backups to stream that much faster, plus allow for future expansion in the backup-heads. The server that is driving it right now is a wee bit under-powered when it comes to PCI bus, so we'll start bottlenecking well before we hit the theoretical limits on dual GigE ports.

Labels:


Friday, March 12, 2004

Tonight's backup tapes have been changed around. And I found out that an abend on one of the servers will prevent backup processing until we get reboots. Not like I haven't done thousands of those over the years. Aie. This is why I don't run backup software (except for agents) on servers that have anything resembling uptime requirements.

Labels:


This page is powered by Blogger. Isn't yours?