August 2010 Archives

The budget crisis gets deeper

We were told last week that Olympia is requiring WWU to find another 4% to cut from this fiscal year, and another 10% for next fiscal. Fortunately (?) this is within our own internal budget forecasting so we at least have a plan for dealing with it, mostly. The hard part will be the 4% right now.

This is leading to creative thinking. We've done a lot of that over the last two years but now we're scraping the bottom of the barrel. We got told late last week that Technical Services will no longer be able to use the ADMCS supply closet for office supplies and we have to make our own. There are all of 7 of us in this department, we don't go through a lot of Post-Its, pens, and DVD blanks. What we do go through is paper and toner, because one or two of us still prints off 200-600 page manuals once in a while (the rest of us just keep the PDFs around).

And yet, I just turned in a pair of hardware quotes that came to low six figures. A lot of us are confused as to why we're even bothering if money is that tight, but apparently The Powers That Be are confident that there really is money. I do know that there are different flavors of money out there; Capital Funds can't be swept to fix operational budget holes, for instance. Apparently the money for these quotes is coming out of a similarly protected fund, but I don't know what it is or how it works. It'll be nice to get that hardware as it'll keep me busy for the better part of a month.

And of course, the 10% for next fiscal is causing everyone sweat. Technical Services hasn't had to take a layoff yet in the two rounds we've had so far, and it just might be our turn. A 10% cut to our budget is either a person, or a handful of Furlough Days.

A personal first

Yesterday I got indirectly Slashdotted. A question I answered over on ServerFault last week turned out to be my biggest earning answer of all time. Yesterday, someone submitted the question to Slashdot. Who posted it:

Before it got slashdotted I was sitting at around 42 upvotes, which beats my previous top answer on datacenter fire-suppression systems (31 upvotes). As of right now I've just about doubled that number, and there is an outside chance I might crack 100 by the end of the week. We'll see.

Interestingly, very few people seem to have clicked through to my profile and followed on to this blog though I might have gained a new subscriber or two.

Watching the number of visits on that question (the number is on the right side of the screen once you get to the question) shows just how heavy traffic was. 25K visits (it already had 2K when the question posted) in a day is a very solid showing, though pretty weak by slashdot standards. My MyWeb servers of old could have handled loads like that.

Windows Media Services permissions

One vexing problem that I only just solved is how to regulate who has the ability to change settings in Windows Media Services 2008. The documentation is hard to find. The clue came here:

You have to go into Component Services to change DCOM permissions. Only, you can't do that on 2008 R2. The relevant Security tab is grayed out. You can fix this:

  1. Open RegEdit
  2. Search HKLM for "Windows Media Services" in "Data", not keys or values.
  3. You should find something hiding in HKLM\Windows\Software\Classes\AppID
  4. On that CLSID, right click on the key and go to Permissions
  5. Click Advanced.
  6. Go to the Ownership tab and give your user Owner. Check the "apply to child objects" box. Apply
  7. Go to the Permissions tab, and give your user Full Control. Apply
  8. Open Component Manager
THEN you can go to "Windows Media Services" to change the Access Permissions. Add a group here so you don't have to give local-server Administrator access to the users who want to use Media Services.

I/O Operations, what they mean

A lot of storage is rated by the I/O Operations it can handle. This is a bit counter-intuitive when MB/s is a lot easier to measure. Where the difference comes is in 1000 2kb accesses vs 1000 32kb accesses. They both pound the storage quite a bit, but the later has a higher transfer rate. Storage bottlenecks more on random accesses than it does on raw throughput, at least with rotational media.

Figuring out the theoretical I/O operations a specific drive can handle is fairly simple. Lets take a look at a top tier 15K RPM 6GB SAS drive, the Seagate Cheetah 15K.7. The important stats are:

Average Latency:2.0ms
Random seek read time:3.4ms
Random seek write time:3.9ms

Note, capacity doesn't matter. The maximum theoretical I/O operations this drive can sustain is defined by the formula:

1000 / (average-latency+random-seek)

So for Write I/O, it can sustain 169 I/O ops, and for Read I/O it can sustain 185 I/O ops. When used in RAID configs, I/O operation capacity aggregates. Parity and mirror RAID will consume I/O ops as overhead though.

Significantly sequential I/O access patterns won't require the average-seek measurement, since the drive is just accessing the next block on the disk. In that case, the drive can pull off 500 I/O ops. As you can see, how many I/O ops a drive can actually sustain depends on the I/O access patterns.  

Time for a real-world example! Our EVA6100 has 80 Fibre Channel drives in it that spin at 10K RPM. After I do the above math for one drive and then multiply by 80, I get a theoretical maximum I/O Operation capacity of 11,120. After doing some monitoring of IO Ops throughout a day, I find that we can do sustained IO ops of about 8K (those Exchange Online Defrags are very hard on storage!) and occasionally burst up to about 10.5K for very short periods (5-10 seconds).

Now for our EVA4400. It has 48 7.2K RPM drives in it. Its theoretical maximum I/O op count is about 3.5K. Doing the same monitoring as I did with the 6100 I see that we have one 15 minute period where we drove 4.3K I/O operations. Above max. How is this possible? At that time the SQL backups were firing which is a significantly sequential operation since that volume is not significantly fragmented. The theoretical max for 100% sequential operations on these drives is about 11.5K. The EVA architecture introduces a significant amount of randomization even in otherwise sequential access patterns, but enough pure sequential I/O happened to gain some of the benefits.

This is why finding a Solid State Drive capable of sustaining 25,000 I/O ops all by itself (Seagate again, just an example not an endorsement) is such a major thing. ONE DRIVE can out-perform our entire EVA6100. Of course, you only get 200GB on that drive where with our 6100 we have two orders of magnitude more storage.

As a theoretical exercise, if we filled the 4400 with those 200GB Seagate SSD drives, we'd get a maximum storage capacity of about 6.5TB, but it would theoretically be able to handle 1.2 million I/O operations. We'd hit controller limits well before we got that far. As it is right now, our controller CPU rarely moves above 25% even during busy times.

Clearing up confusion: Local System

Today I spent a bit too long explaining the different way that Windows XP displays the security principle identified by SID S-1-5-18.

LocalSystem from caclsThe System account when viewed from CACLS
Local System from ExplorerThe System account according to Explorer
Local System in ServicesThe System account according to Services
All three of these entities are the exact same Security entity in Windows XP. However, they have different names.

  • NT Authority\System
  • System
  • Local System
Same thing.

In a domain context they're still the same thing.The machine account is represented by "Network Service", which is the same as "NT Authority\Network". It can do everything System can, but has visibility in AD. It also requires a login.

Same thing, three different ways of saying it. Classic Microsoft.

Reverting LVM snapshots

Yesterday I learned that LVM gained the ability to revert snapshots when the 2.6.33 kernel released. While nifty, I hadn't been under the impression that this functionality was lacking. Why that's so is interesting in and of itself.

My first exposure to a copy-on-write filesystem that allowed snapshots was Novell's NSS filesystem. We never used it in production but I did play around with it a fair amount, and it was useful in the terminal stages of migrating off of NetWare. It was very nifty. And also lacked the ability to revert.

Then I started using VMWare Workstation a lot. It has a snapshot ability built into it. It even does a kind of copy-on-write in the form of the differencing disks it uses to support snapshots. Snapshots are golden in that it allows you to undo things to VMs. Repeatedly installing something that can only install once? A snapshot right before you do the install will allow you to back out of it and retry. And a whole bunch of other very nifty usage scenarios. Of COURSE a snapshot facility should have a revert.

Of course, reverting an entire filesystem means that all data changed since the last snapshot is now gone. That's kind of the point. But it does beg the question, what kind of use-cases exist where such a thing is actually desired?
  • If you have a VM product that doesn't do snapshots inside of it (some Xen versions), this is a way to fake it.
  • If you need to take an 'instant backup' before a large application install, this allows a way to return to a known config without having to tar/restore an entire filesystem. Potentially very good time savings.
  • If you have a Time Machine backup going to a dedicated partition and the Time Machine archive keeps getting mysteriously deleted every month or two, this is a way to get the old archive back without having to do a full backup, and it keeps the backup history.
And others, I'm sure. What's the command to revert a snapshot? It's hiding in lvconvert:

lvconvert --merge volumeGroup/TMArchive_snap

Very simple. You'll also need a newer LVM Tools package in order to get the functionality in lvconvert.

Trusting your admins

Every so often I see or hear from environments where the top level executives demand absolute privacy. They don't want the SysAdmins to have any access into their data. This can happen in tiny little shoe-string-IT non-profits and large companies.

In short, they don't trust the people they've given the keys to the kingdom to behave ethically. I can understand this in the shoe-string-IT non-profit where the SysAdmin is most likely a volunteer. But in larger corporations where the SysAdmin is a paid position? I don't buy that.

In Microsoft-land, 'Administrator' is very similar to 'root' in Unix-land in that they can get anywhere. Novell allowed locking out Admin, but Microsoft and Unix don't. Admin/root can always get places if they really want to. Doing the same in a Microsoft environment generally requires a completely separate authentication/administration domain.

You need to trust your system administrators. If you don't trust them to not poke their nose into things that are not directly business related, then you need new system administrators. Professional ethics say that I don't go perusing through confidential budget deliberation documents so I can get advanced notice of impending budget cuts so I can start spending my budget now. Or digging in my boss's email to figure out who is being considered for layoff lists. That's BOFH stuff, and we don't do that for a reason.

If Management finds out that they have a SysAdmin who has been doing that kind of thing, they are perfectly within their rights to fire their ass. For cause. They will not get an office 'fare well' party.

One of the harder things for newer sysadmins to grasp is the concept of, "Just because you may be allowed to see information, does not imply you are permitted to." Yes, I read other people's email, but only when troubleshooting specific problems or I've been invited in.

And yet... sensitive information leakage from a company comes from priviledged users more than it does normal users, in large part due to the priviledged users having access to more company data. It makes some sense to firewall off certain documents from your regular IT staff.

This can still be handled with correct IT rights structures. You shouldn't have umpteen people in Domain Admins, ideally you should have the absolutely trusted few in there, and everyone else granted rights to their specific areas. We've built tools that allow proxying specific Domain Admin tasks to people who aren't in Domain Admins, just so we can keep that membership low. There are three of us who can get anywhere in the Microsoft environment, and a slightly larger list of people who can get to any file in the Microsoft environment (that list includes the Domain Admins and a few mid-level managers in the Desktop organization). It's hard to do out of the box, which is one of the weaknesses of AD/Windows.

Lazy IT is what allows every person who needs to domain computers to be put into Domain Admins. Lazy IT is what grants helpdesk technicians unrestricted access into every mailbox. Lazy IT is a major information security threat. Lazy IT is what drives CxO's to want to firewall themselves from IT for privacy reasons.

Don't be Lazy IT. Have professional ethics, and ensure the sysadmins you hire also have them.
One of the trickier things we're dealing with these days are multi-function-devices. Or in specific, copiers that can email or save-to-server PDF/BMP/JPG/TIFF images of documents. You'd think this would be easy, but no.

On the one hand, we COULD just let these devices email blindly. Which would allow anonymous users to send butt-scans to the University President, a thing we generally try to avoid.

Or we could configure them with a user ID and password to send authenticated SMTP messages, and still butt-scan bomb the University President.

Ideally every one would have a specific login on these devices so there would be a full audit trail. This kind of thing can be done, but there are caveats. Swipe-card systems require all the devices to use the same back-end processor, and probably mean all the devices come from the same company. We can't use our universal login and passwords since that would require a full keyboard on these devices, and that just isn't going to happen any time soon.

The solution we've come up with isn't a good one, but it's still better than banning the enhanced functionality of these devices. Scanning can indeed reduce the amount of paper we push around. We're disabling the butt-scan vulnerability and forcing potential butt-scanners to drop the scans on a specific file-share where they'll have to forward it from a real email account.

I suspect "email my PDF" will be disabled until such time as we get a card-swipe system, or some other way to individually authenticate copier users.