January 2010 Archives

Evolving best-practice

As of this morning, everyone's home-directory is now on the Microsoft cluster. The next Herculean task is to sort out the shared volume. And this, this is the point where past-practice runs smack into both best-practice, and common-practice.

You see, since we've been a NetWare shop since, uh, I don't know when, we have certain habits ingrained into our thinking. I've already commented on some of it, but that thinking will haunt us for some time to come.

The first item I've touched on already, and that's how you set permissions at the top of a share/volume. In the Land of NetWare, practically no one has any rights to the very top level of the volume. This runs contrary to both Microsoft and Posix/Unix ways of doing it, since both environments require a user to have at least read rights to that top level for anything to work at all. NetWare got around this problem by creating traverse rights based on rights granted lower down the directory structure. Therefore, giving a right 4 directories deep gave an inplicit 'read' to the top of the volume. Microsoft and Posix both don't do this weirdo 'implicit' thing.

The second item is the fact that Microsoft Windows allows you to declare a share pretty much anywhere, and NetWare was limited to the 'share' being the volume. This changed a bit when Novell introduced CIFS to NetWare, as they introduced the ability to declare a share anywhere; however, NCP networking still required root-of-volume only. At the same time, Novell also allowed the 'map root' to pretend there is a share anywhere but it isn't conceptually the same. The side-effect of being able to declare a share anywhere is that if you're not careful, Windows networks have share-proliferation to a very great extent.

In our case, past-practice has been to restrict who gets access to top-level directories, greatly limit who can create top-level directories, and generally grow more permissive/specific rights-wise the deeper you get in a directory tree. Top level is zilch, first tier of directories is probably read-only, second tier is read/write. Also, we have one (1) shared volume upon which everyone resides for ease of sharing.

Now, common-practice among Microsoft networks is something I'm not that familiar with. What I do know is that shares proliferate, and many, perhaps most, networks have the shares as the logical equivalent of what we use top-level directories for. Where we may have a structure like this, \\cluster-facshare\facshare\HumRes, Microsoft networks tend to develop structures like \\cluster-facshare\humres instead. Microsoft networks rely a lot on browsing to find resources. It is common for people to browse to \\cluster-facshare\ and look at the list of shares to get what they want. We don't do that.

One thing that really gets in the way of this model is Apple OSX. You see, the Samba version on OSX machines can't browse cluster-shares. If we had 'real' servers instead of virtual servers this sort of browse-to-the-resource trick would work. But since we have a non-trivial amount of Macs all over the place, we have to pay attention to the fact that all a Mac sees when they browse to \\cluster-facshare\ is a whole lot of nothing. We're already running into this, and we only have our user-directories migrated so far. We have to train our Mac users to enter the share as well. For this reason, we really need to stick to the top-level-directory model as much as possible, instead of the more commonly encountered MS-model of shares. Maybe a future Mac-Samba version will fix this. But 10.6 hasn't fixed it, so we're stuck for another year or two. Or maybe until Apple shoves Samba 4 into OSX.

Since we're on a fundamentally new architecture, and can't use common-practice, our sense of best-practice is still evolving. We come up with ideas. We're trying them out. Time will tell just how far up our heads are up our butts, since we can't tell from here just yet. So far we're making extensive use of advanced NTFS permissions (those permissions beyond just read, modify, full-control) in order to do what we need to do. Since this is a deviation from how the Windows industry does things, it is pretty easy for someone who is not completely familiar with how we do things to mess things up out of ignorance. We're doing it this way due to past-practice and all those Macs.

In 10 years I'm pretty sure we'll look a lot more like a classic Windows network than we do now. 10 years is long enough for even end-users to change how they think, and is long enough for industry-practice to erode our sense of specialness more into a compliant shape.

In the mean time, as the phone ringing off the hook today foretold, there is a LOT of learning, decision-making, and mind-changing to go through.

Storage tiers

| 1 Comment
Events have pushed us to give a serious look at cheaper storage solutions. What's got our attention most recently is HP's new LeftHand products. That's some nice looking kit, there. But there was an exchange there that really demonstrated how the storage market has changed in the last two years:

HP: What kind disk are you thinking of?
US: Oh, probably mid tier. 10K SAS would be good enough.
HP: Well, SAS only comes in 15K, and the next option down is 7.2K SATA. And really, the entire storage market is moving to SAS.

Note the lack of Fibre Channel drives. Those it seems are being depreciated. Two years ago the storage tier looked like this:
  1. SATA
  3. FC
Now the top end has been replaced.
  1. SATA
  2. SAS
  3. SSD
We don't have anything that requires SSD-levels of performance. Our VMWare stack could run quite happily on sufficient SAS drives.

Back in 2003 when we bought that EVA3000 for the new 6 node NetWare cluster, clustering required shared storage. In 2003, shared storage meant one of two things:
  1. SCSI and SCSI disks, if using 2 nodes.
  2. Fibre Channel and FC Disks if using more than 2 nodes.
With 6 nodes in the cluster, Fibre Channel was our only choice. So that's what we have. Here we are 6+ years later, and our I/O loads are very much mid-tier. We don't need HPC-level I/O ops. CPU on our EVA controllers rarely goes above 20%. Our I/O is significantly randomized, so SATA is no good. But we need a lot of it, so SSDs become prohibitive. Therefore SAS is what we should be using if we buy new.

Now if only we had some LTO drives to back it all up.

Migrating knowledge bases

This morning we moved the main class volume from NetWare to Windows. We knew we were going to have problems with this since some departments hadn't migrated key groups into AD yet, so the rights-migration script we wrote just plain missed bits. Those have been fixed all morning.

However, it is becoming abundantly clear that we're going to have to retrain a large portion of campus Desktop IT in just what it means to be dealing with Windows networking. We'd thought we'd done a lot of it, but it turns out we were wrong. It doesn't help that some departments had delegated 'access control' rights to professors to set up creative permissioning schemes, this morning the very heated calls were coming in from the professors and not the IT people.

There are two things that are tripping people up. One has been tripping people up on the Exchange side since forever, but the second one is new.
  1. In AD, you have to log out and back in again for new group-memberships to take.
  2. NTFS permissions do not grant the pass-through right that NSS permissions do. So if you grant a group rights to \Science\biology\BIOL1234, members of that group will NOT be able to pass through Science and Biology to get to BIOL1234.
We have a few spots here and there where for one reason or another rights were set at the 2nd level directories instead of the top level dirs. Arrangements like that just won't work in NTFS without busting out the advanced permissions.

An area we haven't had problems yet, but I'm pretty certain we will have some are places where rights are granted and then removed. With NSS that could be done two ways: an Inherited Rights Filter, or a direct trustee grant with no permissions. With NTFS the only way to do that is to block rights inheritance, copy the rights you want, and remove the ones you don't. That sounds simple, but here is the case I'm worried about:


At 'HumRes' the group grp.hr is granted 'read' rights, and the HR director is granted Modify directly to their user (bad practice, I know. But it's real-world).
At 'JobReview' the group grp.hr.jobreclass is granted 'Modify'
At 'VPIT' Inheritance is Blocked and rights copied.
At 'JohnSmith' the HR user AngieSmith is granted the DENY right due to a conflict of interest.

Time passes. The old director retires, the new director comes in. IT Person gets informed that the new director can't see everything even though they have Modify to the entire \Humres tree. That IT person will go to us and ask, "WTH?" and we will reply with, "Inheritance is blocked at that level, you will need to explicitly grant Modify for the new director on that directory."

So this is a bit of a sleeper issue.

Meanwhile, we're dealing with a community of users who know in their bones that granting access to 'JohnSmith' means they can browse down from \HumRes to that directory just on that access-grant alone. Convincing them that it doesn't work that way, and working with them to rearrange directory structures to accommodate that lack will take time. Lots of time.

A fluff piece

| 1 Comment
Much has been made in certain circles about the lack of a right-side shift key on certain, typically Asian designed, keyboards. This got me thinking. So I took a look at my own keyboards. The one I'm typing on right now at work has obvious signs of wear, where the textured black plastic has been worn smooth and shiny. Also, some letters are missing. What can I learn by looking at the wear on my keyboard?
  • I use the left-shift key almost exclusively.
  • I use both thumbs for the shift key, with a somewhat preference for my right thumb.
  • The M and C key text are completely erased, as well as the entire left-hand home row, and the U and O keys.
  • The right Ctrl and Alt keys show almost no sign of use.
Now you know. And I'm a lefty. It shows.

Like many people my age, I learned to type on those old IBM clicky keyboards. I don't miss those keyboards, but it does mean I tend to use more force per key-press than I strictly need to. Especially if I'm on a roll with something and let my fingers to the driving. I don't think I could use one of those old keyboards any more, the noise would get to me. I make enough noise as it is, I don't need people two offices down to hear how fast I type.

The things you learn

We had cause to learn this one the hard way this past week. We didn't know that Windows Server 2008 (64-bit) and Symantec Endpoint Protection just don't mix well. It affected SMBv1 clients, SMBv2 clients (Vista, Win7) were unaffected.

The presentation of it at the packet-level was pretty specific, though. XP clients (and Samba clients) would get to the second step of the connection setup process for mapping a drive and time out.

  1. -> Syn
  2. <- Syn/Ack
  3. -> NBSS, Session Request, to $Server<20> from $Client<00>
  4. <- NBSS, Positive Session Response
  5. -> SMB, Negotiate Protocol Request
  6. <- Ack
  7. [70+ seconds pass]
  8. -> FIN
  9. <- FIN/Ack
Repeat two more times, and 160+ seconds later the client times out. The timeouts between the retries are not consistent so the time it takes varies. Also sometimes the server issues the correct "Protocol Request Reply" packet and the connection continues just fine. There was no sign in any of the SEP logs that it was dropping these connections, and the Windows Firewall was quiet as well.

In the end it took a call to Microsoft. Once we got to the right network person, they knew immediately what the problem was.

ForeFront is now going on those servers. It really should have been on a month ago, but because these cluster nodes were supposed to go live for fall quarter they were fully staged up in August, before we even had the ForeFront clients. We never remembered to replaced SEP with ForeFront.

NTFS and fragmentation

I've known for a while that filesystem fragmentation can seriously affect NTFS performance. This isn't just run of the mill, "frag means using random access patterns for what should be sequential I/O," performance degradation. I'm seeing it on volumes backed by EVA arrays which purposely randomize I/O to the disk spindles. Clearly something in the meta-data handling degrades significantly when frag gets to a certain point.

Today I found out why that is.


As the number of fragments increase, the MFT table has to track more and more fragments. Once the number of fragments exceeds how much can be stored directly into the MFT, it starts adding indirection layers to track the file extents.

If you have a, say, 10GB file on your backup-to-disk system, and that file has 50,000 fragments, you are absolutely at the 'stage 4' listed in that blog post. Meta-data operations on that file, such as tracking down the next extent to read from if you're doing a restore or copy, will be correspondingly more expensive than a 10GB file with 4 fragments. At the same time, attempting to write a large file that requires such massive fragmentation in turn requires a LOT more meta-data operations than a big-write on an empty filesystem.

And this, boys and girls, is why you really really really want to avoid large fragmentation on your NTFS-based backup-to-disk directories. Really.

The costs of backup upgrades

| 1 Comment
Our tape library is showing its years, and it's time to start moving the mountain required to get it replaced with something. So this afternoon I spent some quality time with google, a spread-sheet, and some oldish quotes from HP. The question I was trying to answer is what's the optimal mix of backup to tape and backup to disk using HP Data Protector. The results were astounding.

Data Protector licenses backup-to-disk capacity by the amount of space consumed in the B2D directories. You have 15TB parked in your backup-to-disk archives, you pay for 15TB of space.

Data Protector has a few licenses for tape libraries. They have costs for each tape drive over 2, another license for libraries with between 61-250 slots, and another license for unlimited slots. There is no license for fibre-attached libraries like BackupExec and others do.

Data Protector does not license per backed up host, which is theoretically a cost savings.

When all is said and done, DP costs about $1.50 per GB in your backup to disk directories. In our case the price is a bit different since we've sunk some of those costs already, but they're pretty close to a buck fiddy per GB for Data Protector licensing alone. I haven't even gotten to physical storage costs yet, this is just licensing.

Going with an HP tape library (easy for me to spec, which is why I put it into the estimates), we can get an LTO4-based tape library that should meet our storage growth needs for the next 5 years. After adding in the needed DP licenses, the total cost per GB (uncompressed, mind) is on the order of $0.10 per GB. Holy buckets!

Calming down some, taking our current backup volume and apportioning the price of largest tape library I estimated over that backup volume and the price rises to $1.01/GB. Which means that as we grow our storage, the price-per-GB drops as less of the infrastructure is being apportioned to each GB. That's a rather shocking difference in price.

Clearly, HP really really wants you to use their de-duplication features for backup-to-disk. Unfortunately for HP, their de-duplication technology has some serious deficiencies when presented with our environment so we can't use it for our largest backup targets.

But to answer the question I started out with, what kind of mix should we have, the answer is pretty clear. As little backup-to-disk space as we can get away with. The stuff has some real benefits, as it allows us to stage backups to disk and then copy to tape during the day. But for long term storage, tape is by far the more cost-effective storage medium. By far.

Looking for a new laptop

| 1 Comment
I've been needing a new laptop for a while now. The left mouse button on the trackpad is getting a bit deaf, and my batteries for it just died a couple weeks ago. I've been waiting for the Arrandale launch for some time, as I wanted both more power and less power usage in my new laptop. The specs for it are pretty simple:
  • No smaller than 15" screen. My eyes are beginning to get old.
  • Vertical screen resolution no smaller than 800px.
  • 2GB RAM minimum
  • The option of an add-in graphics card or integrated (I won't be gaming with this thing)
  • A wireless card with good Linux support
  • 4 hours of independently benchmarked battery life
  • 7200 RPM disk options
  • Core i5-mobile
The above laptop doesn't quite exist. There are plenty in the 16+" category that all have primo graphics cards, 4 hour battery life if they're lucky, and cost over $1200. There are a few in the desktop-replacement category, in which they don't have as hot-shot graphics cards but still pretty expensive. Then there are the 'ultra-portables' which never come in a screen size larger than 14".

Unfortunately for me, Intel released some ASUS laptops to benchmarkers a while back. Intel lifted the benchmark embargo earlier this week and the results were disappointing. More processing power, absolutely. Less juice sucked, not so much. Or more specifically, the processor performed the same amount of work for about the same power requirement, but performed that work faster. Since the work-per-battery-charge ratio is not changing, this new processor is not going to give us really performant laptops that can run for 6 hours. At least not without improved battery tech, that is.

So, very sadly, I just may have to put off this purchase until this summer. That's when the power-enhanced versions of these chips will drop. At that point, laptop makers will start making laptops I want in sufficient quantities to allow competition.

Desktop virtualization

Virtualizing the desktop is something of a rage lately. Last year when we were still wondering how the Stimulus Fairy would bless us, we worked up a few proposals to do just that. Specifically, what would it take to convert all of our labs to a VM-based environment?

The executive summary of our findings: It costs about the same amount of money as the normal regular PC and imaging cycle, but saves some labor compared to the existing environment.

Verdict: No cost savings, so not worth it. Labor savings not sufficient to commit.

Every dollar we saved in hardware in the labs was spent in the VM environment. Replacing $900 PCs with $400 thin clients (not their real prices) looks cheap, but when you're spending $500/seat on ESX licensing/Storage/Servers, it isn't actually cheaper. The price realities may have changed from 12 months ago, but the simple fact remains that the stimulus fairy bequeathed her bounty upon the salary budget to prevent layoffs rather than spending on swank new IT infrastructure.

The labor savings came in the form of a unified hardware environment minimizing the number of 'images' needing to be worked up. This minimized the amount of time spent changing all the images in order to install a new version of SPSS for instance. Or, in our case, integrating the needed changes to cut over from Novell printing to Microsoft printing.

This is fairly standard for us. WWU finds it far easier to commit people resources to a project than financial ones. I've joked in the past that $5 in salary is equivalent to $1 cash outlay when doing cost comparisons. Our time management practices generally don't allow hour by hour level accounting for changed business practices.

On living as Root

Yesterday on Slashdot, one of the Ask Slashdot questions was: " In your experience, do IT administrators abuse their supervisory powers?"

That's a good question. BOFH humor aside, it has been my experience that the large majority of us don't do so intentionally. Most of what happens is the petty stuff that even regular helpdesk staff do, like take home enterprise license keys. We shouldn't do that, and licensing technology is improving to the point where such pilferage is becoming a lot easier to detect; at some point Microsoft will blacklist some large org's enterprise key for having been pirated and woe unto the IT department that lets that happen.

But what about IT administrators?

First, IT Administrators come in many types. But I'll focus on my own experiences living with enhanced privs. As it happens, I've spent the large majority of my IT career with a user account with better than average privs.

File Access

I can see everything! One of the harder things to keep in mind is what files I can see as me, and what files I can see as my role as sysadmin. This can be hard, especially when I'm rooting about for curiosity. We still add my user to groups even though I can see everything, and I consciously limit myself to only those directories those groups have access to when privately data-mining. You want this. This is one of the top hardest things for a new sysadmin to get used to.

With my rights it is very easy for me to pry into HIPPA protected documents, confidential HR documents, labor-relations negotiations documents, and all sorts of data. I don't go there unless directed as part of the normal execution of my duties. Such as setting access controls, troubleshooting inaccessible files, and restoring data.

I haven't met any sysadmins who routinely spelunk into areas they're not supposed to. They are out there, sadly enough, but it isn't a majority by any stretch.

Email Access

I read your email, but only as part of my duties. Back when we were deploying the Symantec Brightmail anti-spam appliances I read a lot of mail tagged as 'suspect'. I mean, a lot of it. It took a while to tune the settings. Even just subject-lines can be damning. For instance, the regular mails from Victoria Secret were getting flagged as 'suspect' so anyone who ordered from them and used their work account as the email account was visible to me. A BOFH would look for the male names, print out the emails, and post them on the office bulletin board for general mockery. Me? I successfully forgot who got what.

One gray area is the 'plain view' problem. If I'm asked to set or troubleshoot Outlook delegates on a specific mailbox, I have to open their mailbox. During that time certain emails are in plain view as I navigate to the menu options I need to go to in order to deal with delegates. Some of those emails can be embarrassing, or downright damning. So far I don't officially notice those mails. Very happily, I've yet to run into anything outright illegal.

Another area that has me looking for specific emails is phishing. If we identify a phishing campaign, the Exchange logs are very good at identifying people who respond to it. I then take than list and look for specific emails in specific mailboxes to see what exactly the response was. While this also has the plain-view problem described above, it does allow us to identify people who gave legitimate password info, and those replying with derision and scorn (a blessed majority). Those that reply with legitimate login info get Noticed.

Internet Monitoring

This varies a LOT from organization to organization. WWU doesn't restrict internet access except in a few cases (no outbound SMTP, no outbound SMB), so we're not a good example. My old job was putting into place internet blockers and an explicit login before access to the internet was granted, which allowed very detailed logs to be kept on who went where. As it happened, IT was not the most privileged group; that honor was held by the Attorney's office.

While IT was restricted, I knew the firewall guys. They worked two cubes down. So if I needed to access something blocked, I could walk down the hall and talk to them. I'd have to provide justification, of course, but it'd generally get granted. The fact that I was one of the people involved with Information Security and helped them make the filters unavoidable helped in this.

But the Slashdot questioner does make a good point. Such IT sites do generally get let through the filters. I strongly suspect this is because the IT users are very close to the managers setting filtering policy so are able to make the convincing, "but these sites are very useful in support of my job," arguments. Sites such as serverfault and stackoverflow are very useful for solving problems without expensive vendor contracts. Sites supporting the function of non-IT departments are not so lucky.

Whether or not the grand high IT Admins get unrestricted access to the internet depends a LOT on the organization in question. My old place was good about that.

Firewall Exceptions

This is much more of a .EDU thing since we're vastly more likely to have a routable IPv4 address on our workstation than your non-educational employers. In smaller organizations where your server guys are the same guys who run the network, good-ole-boys comes into play exceptions are much more common. For larger orgs like ours that have server-admin and network-admin split out, it depends on how buddy-buddy the two are.

This is one area where privilege hath its perks.

As it happens, I do have the SSH port on my workstation accessible from the internet. The firewall guys let me have that exception because I also defend servers with that exception, and therefore I know what I'm doing. Also, it allows me into our network in case either VPN server is on the fritz. And considering that I manage one of the VPN servers, having a back-door in is useful.

Other areas

Until a couple weeks ago the MyWeb service this blog used to be served from was managed by me. Which meant I got to monitor the log files for obvious signs of abuse. Generally, if something didn't break the top 10 access files I officially didn't notice. If a specific file broke 25% of total traffic, I had to take notice. Sometimes those files were obviously fine files (home shot video pre-YouTube), others (MP3 archives, DIVX movies) were not so innocent.

One day the user in question was a fellow IT admin. This was also the first time I saw staff doing this, so the protocols were non-existent. What I did was print off the report in question, circle the #1 spot they occupied, and wrote a note that said, in brief:
If this had been a student, the Provost would have been notified and their accounts suspended. The next time I'll have to officially notice.
And then put it on their chair. It never happened again.

Another area is enterprise software licenses. I mentioned that at the top of this post, but as more and more software gets repackaged for point-n-click distribution fewer and fewer IT staff need to know these impossible to memorize numbers. Also helping this trend is the move towards online License Servers, where packages (I'm thinking SPSS right now) need to be told what the license server is before they'll function; you can't take something like that home with you.

Things like Vista or Windows 7 activation codes are another story, but Microsoft has better tracking than they did in the XP days. If you activate our code on your home network, Microsoft will notice. The point at which they'll take action is not known, but when it does all Vista or Windows 7 machines we have will start throwing the, "your activation code has been reported as stolen, please buy a new one," message.

The software industry as a whole is reconfiguring away from offline activation keys, so this 'perk' of IT work is already going by the way-side. Yes, taking them home has always been illegal in the absence of a specific agreement to allow that. And yet, many offices had an unofficial way of not noticing that IT staff were re-using their MS Office codes at home.

That got long. But then, there are a lot of facets to professional ethics and privilege in the IT Admin space.