Migration headaches

Today marked the day we cut over the largest volume we have, the Fac/Staff shared volume. That's 1.9TB of 'disorganized file data' (a.k.a. bog standard file-server) to migrate. This is the last of the major volumes to move, and this was done intentionally. Because of this, we have our system down. Unfortunately, a wrench was thrown into the works. But before I get to the wrench, a description of how we migrated this puppy from NetWare.

At M-18 days, we performed an initial sync of the data via robocopy.
At M-16 days, when the first sync completed (it took about 29 hours) we performed a delta-sync.
At M-17 days we performed another delta sync, 24 hours after the previous, so we could get a feel for how long a daily 'copy the changed files' job would take.
M-16 days, create a daily copy-job (robocopy source dest /mir /r:1 /xo /log:e:somewhere)
M-14 days, we perform the rights migration, and open up the new share to everyone with sufficient rights to change permissions on the volume. Inform these people to fix broken rights on the Microsoft share.
M-12 days, after feedback from the techs, release guidance for how to re-organize directories to better work with Microsoft permissions.
M-12 to M-1 day, Technicians reorganize data and repermission as needed, with our assistance.
M-12 hours, we do a delta sync
Migration: Change login scripts, kick off terminal delta-sync to get net-change.
M+2 hours, 8am arrives, script is done, we are done. Yay! Start working problems as reported.

The problem occurred between steps 8 and 9. One department decided that migration-night was the perfect time to reorganize over 150GB of data. They would have struggled to find a worse time for it. The result of this is that the terminal delta-sync in step 9 will end up taking far, far longer than the 2 hours budgeted.

The problem here is that when people start logging in at 8am, all of their data isn't there. There were some people who worked right up until the M-12 hour mark reorganizing data and were surprised when it wasn't on the new system yet. These people were alphabetically below the department that moved 150GB of data last night, so they hadn't been synced yet. So they're seeing and working with old files while the new ones copy in.

The worry for me is PST and MDB files that have a tendency to be open all day. The copy script will not be able to replace these open files, so they will in effect experience data-loss because of this department. There is not much we can do about that. We can troll through the log file for the files listed as failed-to-copy-due-to-lock and hand copy them afterwards, after clearing locks. In which case they'll lose whatever data they committed to these files during the morning. So these files? There WILL be data-loss, guaranteed.

The other problem we ran into is one department set up their rights to lock us godlike admins out of certain directories, something you can do on Microsoft filesystems since there is no equivalent to Novell's "Supervisor" trustee right. We didn't notice this until step 9 when the log-files filled up with 'access denied' errors, and the 30 second retry it causes, which further delayed execution of the terminal sync script. Obviously, those files will not get synced.

I hate hate hate it when this kind of thing happens.

4 Comments

natxo asenjo | February 26, 2010 11:59 PM

it sure sounds like you must have been busy :-)

I just have one question: why did you not isolate the (old) file server (in a vlan or with firewall rules, for instance) so that your clients could not interfere with the migration?

SysAdmin1138 replied to comment from natxo asenjo | February 27, 2010 7:35 AM

A 'cold' migration, where we kick everyone out until the job is done, would have taken between 28-36 hours to complete. Upper management was not willing to approve that kind of downtime. There are two times in a year when we can pull a stunt like that; summer break (late Aug to late Sept) and winter break (christmas). We were left with a 'hot' migration, where we have to do most of the work with users in the system.

Our entire migration strategy was designed to minimize how much downtime we actually had to incur, and it has worked really really well for all the other volumes we've done (over 3TB of data). So when several departments changed a quarter of the data on migration day it blew our carefully crafted plans right out of the water.

mjohnsonabg | December 18, 2012 9:15 AM

Hello

Looking for the scripts that you used in this migration. Can you point me in the right direction.

Thank you

SysAdmin1138 replied to comment from mjohnsonabg | December 19, 2012 5:28 AM

We wrote them ourselves. I wrote one, and a coworker wrote the other. I'm not certain I still have the sync-script, or the one that compiles trustee data, but I definitely don't have the one that takes trustee.nlm output and converts it into icacls calls.

Migration headaches

Categories:

4 Comments