Downtime

We've had some curious events happen on the cluster lately. The situation described here is pretty close to what we've had happen. The MyFiles service for students (a.k.a. NetStorage for you non-WWU people) has been crashing lately, forcing abends. Apparently our nodes have learned a new trick related to this, in that they're flushing a bit of I/O to the mounted volumes when we leave the abend screen. This is bad since when we leave the abend screen the volumes had been housed on other servers already, and the error gets thrown.

So we have a chance of random file-system corruption! Whee! So we need to fsck/PoolRebuild the things, and that takes time. We did some last night during our normal Tuesday night maintenance window (I be tired), but didn't get all of it. As ATUS just mailed out, we'll be finishing it off Friday night. Last night's fun didn't discover anything significant, just the normal file-system entropy of a few corrupted file-names and some files missing their parent links^[1].

Friday night starting at midnight, the U: drives for two thirds of our students will go away. This being a Holiday weekend, I expect police-dispatch (our off-hours 'helpdesk') to only get a couple calls as a result. We're also doing the big shared volume on the Fac/Staff side (our largest single volume), and that'll be down probably from Midnight to pretty close to the 10am mentioned in the mail.

[1] NSS is different than POSIX file-systems in that each node has both child and parent links. This is nifty in that it allows inherited permissions to work easier. The files in question could still be accessed since their parent node, a directory, had a child-link to them. If that child-link was also missing, they'd be a truly orphaned file.