June 2009 Archives

Super users

| 1 Comment
Having been a 'super user' for most of my career, I do not have the same perspective other people do when it comes to interacting with corporate IT. Because of what I do, I see everything. That's part of my job, so that's what I see. I have to know it is there.

However, how each company handles their elevated privilege accounts varies. Some of it depends on what system you're working in, of course.

Take a Windows environment. I see three big ways to handle the elevated user problem:
  1. One Administrator account, used by all admins. Each admin has a normal user account, and log in as Administrator for their adminly work.
    • Advantages Only one elevated account to keep track of.
    • Disadvantages Complete lack of auditing if there is more than one admin around. Also, unless said admin has two machines, or has a VM for adminly work, they're logged in as Administrator more often than they're logged in as themselves.
  2. One Administrator account, admins user accounts are elevated to Administrator. Each admin's normal account is elevated. Administrator is relegated to a glorified utility account, useful for backups, other automation, or if you need to leave a server logged in for some reason.
    • Advantages Audit trail. Changes are done in the name of the actual admin who performed the change.
    • Disadvantages These users really need to be exempted from any Identity Management system. Since there are only going to be a few of them, this may not matter. Also, these users need to treat these passwords like the Administrator password.
  3. Each admin gets two accounts, normal and elevated As with the above, Administrator is a glorified utility account. But each admin gets two accounts; a normal account for every day use (me.normal) and an elevated account (me.super) for functions that need that kind of access.
    • Advantages Provides audit trail, and allows the admin's normal account to be subject to identity-management safely. Easy availability of 'normal' account allows faster troubleshooting of permissions issues (hard to check when you can see everything)
    • Disadvantages Admin users are juggling two accounts again, with the same problems as option 1.
I personally haven't seen the third option in actual use anywhere, even though that's my favorite one. Unixy environments are a bit different. The ability to 'sudo' seems to be the key determiner of elevated access, with ultimate trust granted to those who learn the root password outright. Sudo is the preferred method of doing elevated functions due to its logging capability.

What other methods have you seen in use?

Changes are coming

| 7 Comments
Due to technical reasons I'll be getting to in a moment, this blog will be moving off of WWU's servers in the next few weeks. I have high confidence that the redirects I'll be putting in place will work and keep any existing links to the existing content still ultimately pointing at their formal home. In fact, those of you reading by way of the RSS or Atom feeds won't even notice. Images I link in will probably load a bit slower(+), and that's about it.

And now for the technical reasons. I've been keeping it under my hat since it has politics written all over it and I so don't go there on this blog. But WWU has decided (as of last September actually) that they're dropping the Novell contract and going full Microsoft to save money. And really, I've seen the financials. Much as it pains this red heart, the dollars speak volumes. It really is cheaper to go Microsoft, to the tune of around $83,000. In this era of budget deficits, that's most of an FTE. Speaking as the FTE most likely to get cut in this department, that makes it kind of personal.

Microsoft? The cheap option?

Yes, go fig. But that's how the pricing is laid out. We were deep enough into the blue beast already (Exchange, MS-SQL, SharePoint is embryonic but present and going to grow, there is Office on every Windows desktop) that going deeper wasn't much of an extra cost per year. To put it even more bluntly, "Novell did not provide enough value for the cost."

The question of what's happening to our SLES servers is still up for debate. We could get those support certificates from Microsoft directly. Or buy them retail from Novell. I don't know what we're doing there.

Which means that we're doing a migration project to replace the WUF 6-node NetWare cluster with something on Windows that does the same things. NetStorage is the hardest thing to replace (I know I'm going to miss it), but the file-serving and printing are challenging but certainly manageable. The "myweb" service will continue, and be served by a LAMP server with the home directories Samba-mounted to it, so it will continue as Apache. It could have been done with IIS, but it was an ugly hack.

As soon as we get hardware (7/1 is when the money becomes available) we'll be hitting the fast phase of the project. We hope to have it all in place by fall quarter. We'll still maintain the eDirectory replica servers for the rest of the Novell stuff on campus that is not supported (directly) by me. But for all intents and purposes, Technical Services will be out of the NetWare/OES business by October.

OH MY GOD! YOU'RE LEAVING! THAT'S WHY YOU'RE MOVING THE BLOG!

No, no. That's not the reason I'm moving this blog. Unfortunately for this blog, there was exactly one regular user of the SFTP service we provided(*). Me. So that's one service we're not migrating. It could be done with cygwin's SSH server and some cunning scripting to synchronize the password database in cygwin with AD, if I really wanted to. But... it's just me. Therefore, I need to find an alternate method for Blogger to push data at the blog.

Couple that with some discrete hints from some fellow employees that just maybe, perhaps, a blog like mine really shouldn't be run from Western's servers, and you have another reason. Freedom of information and publish-or-perish academia not withstanding, I am staff not tenured faculty. Even with that disclaimer at the top of the blog page (that you RSS readers haven't seen since you subscribed) that says I don't speak for Western, what I say unavoidably reflects on the management of this University. I've kept this in mind from the start, which is why I don't talk about contentious issues the University is facing on any term other than how they directly affect me. And also why this is the first time I've mentioned the dropping of the Novell contract until it is effectively written in stone.

So. It's time to move off of Western's servers. The migration will probably happen close to the time we cut-over MyWeb to the new servers. Which is fitting, really, as this was the first web-page on MyWeb. This'll also mean that this blog will no longer be served to you by a NetWare 6.5 server. Yep, for those that didn't know this blog's web-server is Apache2 running on NetWare 6.5.

(+) Moving from a server with an effective load-average of 0.25 to one closer to 3.00 (multi-core, though) does make a difference. Also, our pipes are pretty clean relatively speaking.

(*) Largely because when we introduced this service, NetWare's openssh server relied on a function in libc that liked to get stuck and render the service unusable until a reboot. MyWeb was also affected by that. That was back in 2004-06. The service instability drove users away, I'm sure. NetStorage is more web-like anyway, which users like better.

It happens

Someone burnt a bagel this morning. It got smokey enough to trigger the alarms, so the building evacuated. It happened before I got to the office, so when I got here the facilities guys were wrestling in the biiiiig fans into the hallways to get a good cross-breeze going to cut down the burnt smell. Now everyone who comes in is dropping by asking what burned ;). According to my office-mate who was here when it happened, a burnt bagel produces a surprising amount of smoke.

This particular toaster oven is a complete loss. Not surprising.
We missed a step or something in decommissioning our Exchange2003 servers. As a result, we have a whole lot of... stuff going 'unresolvable' due to how Outlook and Exchange work. There is an attribute on users and group called LegacyExchangeDN. Several processes store this value as the DN of the object. If that object was created in Exchange 2003 (or earlier) it's set to a location that no longer exists.

The fix is to add an X500 address to the object. That way, when the resolver attempts to resolve that DN it'll turn up the real object. So how do you add an X500 address to over 5000 objects? Powershell!

$TargetList = get-distributiongroup grp.*

foreach ($target in $TargetList) {
$DN=$target.SamAccountName
$Leg=$target.LegacyExchangeDN
$Email=$target.EmailAddresses
$Has500=0

if ($leg -eq "/O=WWU/OU=WWU/cn=Recipients/cn=$DN") {
foreach ($addy in $Email) {
if ($addy -eq [Microsoft.Exchange.Data.CustomProxyAddress]("X500:" + $Leg)) {
$Has500=1
}
}
if ($Has500 -eq 0) {
$Email += [Microsoft.Exchange.Data.CustomProxyAddress]("X500:" + $Leg)
$target.EmailAddresses = $Email
set-distributiongroup -instance $target
write-host "$DN had X500 added" -foregroundcolor cyan
} else {
write-host "$DN already had X500 address" -foregroundcolor red
}
} else {
write-host "$DN has correct LegacyExchangeDN" -foregroundcolor yellow
}
}


It's customized for our environment, but it should be obvious where you'd need to change things to get it working for you. When doing users, use "get-mailbox" and "set-mailbox" instead of "get-distributiongroup" and "set-distributiongroup". It's surprisingly fast.

My contribution to the community!

ForeFront and spam

They have an option to set a custom X-header for indicating spam. The other options are subject-line markup and quarantine on the ForeFront servers. What they never document is what they set the header to. As it happens, if the message is spam it gets set like this:
X-WWU-JunkIt: This message appears to be spam.
Very basic. And not documented. Now that we know what it looks like we can create a Transport Rule that'll direct such mail to the junk folder in Outlook. Handy!

Why ext4 is very interesting

The new ext4 file-system is one I'm very interested in, especially since the state of reiserfs in the Linux community's gestalt is so in flux. Unlike ext3, it does a lot of things nicer:
  • Large directories. It can now handle very large directories. One source says unlimited directory sizes, and another less reliable source says sizes up to 64,000 files. This is primo for, say, a mailer. GroupWise would be able to run on this, if one was doing that.
  • Extent-mapped files. More of a feature for file-systems that contain a lot of big files rather than tens of millions of 4k small files. This may be an optional feature. Even so, it replicates an XFS feature which makes ext4 a top contender for the MythTV and other linux-based PVR products. Even SD video can be 2GB/hour, HD can be a lot larger.
  • Delayed allocation. Somewhat controversial, but if you have very high faith in your power environment, it can make for some nice performance gains. It allows the system to hold off on block allocation until a process either finishes writing, or triggers a dirty-cache limit, which further allows a decrease in potential file-frag and helps optimize disk I/O to some extent. Also, for highly transient files, such as mail queue files, it may allow some files to never actually hit disk between write and erase, which further increases performance. On the down side, in case of sudden power loss, delayed writes are simply lost completely.
  • Persistent pre-allocation. This is a feature XFS has that allows a process to block out a defined size of disk without actually writing to it. For a PVR application, the app could pre-allocate a 1GB hunk of disk at the start of recording and be assured that it would be as contiguous as possible. Certain bittorrent clients 'preallocate' space for downloads, though some fake it by writing 4.4GB of zeros and overwriting the zeros with the true data as it comes down. This would allow such clients to truly pre-allocate space in a much more performant way.
  • Online defrag. On traditional rotational media, and especially when there isn't a big chunk of storage virtualization between the server and the disk such as that provided by a modern storage array, this is much nicer. This is something else that XFS had, and ext is now picking up. Going back to the PVR again, if that recording pre-allocated 1GB of space and the recording is 2 hours long, at 2GB/hour it'll need a total of 4 GB. In theory each 1GB chunk could be on a completely diffrent part of the drive. An online defrag allows that file to be made contiguous without file-system downtime.
    • SSD: A bad idea for SSD media, where you want to conserve your writes, and doesn't suffer a performance penalty for frag. Hopefully, the efstools used to make the filesystem will be smart enough to detect SSDs and set features appropriately.
    • Storage Arrays: Not a good idea for extensive disk arrays like the HP EVA line. These arrays virtualize storage I/O to such an extent that file fragmentation doesn't cause the same performance problems as would a simple 6-disk RAID5 array would experience.
  • Higher resolution time-stamps. This doesn't yet have much support anywhere, but it allows timestamps with nanosecond resolution. This may sound like a strange thing to put into a 'nifty features' list, but this does make me go "ooo!".

Also, as I mentioned before, it looks like openSUSE 11.2 will make ext4 the default filesystem.

IPv6 and the PCI DSS standards

| 2 Comments

The Payment Card Industry Data Security Standard (PCI DSS) applies to a couple of servers we manage. In those standards is section 1.3.8. It reads

Implement IP masquerading to prevent internal addresses from being translated and revealed on the Internet, using RFC 1918 address space. Use network address translation (NAT) technologies—for example, port address translation (PAT).

With the testing procedure listed as:

For the sample of firewall and router components, verify that NAT or other technology using RFC 1918 address space is used to restrict broadcast of IP addresses from the internal network to the Internet (IP masquerading).

Which is sound practice, really. But we're running into an issue here that may become more of an issue once IPv6 gets deployed more widely. We're a University that received it's netblock back when they were still passing out Class B networks to the likes of us (140.160.0.0/16 in case you care). IPv4 address starvation is not something we experience. Because of this, NAT and IP-Masq have very little presence on our network.

We also believe in firewalls. Just because the address of my workstation is not in an RFC 1918 netblock, doesn't mean you can get uninvited packets to me. This is even more the case for the servers that handle credit-card data.

It is my belief that the intent of this particular standard-line is to prevent scouting of internal networks in the aid of directed penetration attempts. Another line that should probably be in this standard to support this, would be something similar to:
Implement separate DNS servers for public Internet and Internal usage, and prevent public Internet access to the internal DNS servers.
Because the same DNS servers we use internally are the same ones that are in our Name Server records for the WWU.EDU domain, you can do a lot of recon of our internal networks from home. We don't allow zone transfers, of course, but enough googling around our various sites and reverse-IP-lookups will reveal the general structure of our network, such as which subnets contain most of our servers and which are behind the innermost firewalls.

This is a long way of saying that our IPv4 network functions a lot like the network envisioned when IPv6 was first ratified. Because of this, we're running into some problems with the PCI standards that IPv6 will probably run into as well.

Take the requirement to have the PCI-subject servers on an RFC1918 IP number. RFC1918 only applies to IPv4. IPv6's version of that is RFC4193, so the standard will have to be modified to mandate IPv6 numbers be on RFC4193 numbers. Therefore, for strictest compliance no PCI servers can move to pure IPv6. Servers that have both IPv4 and v6 numbers on them are an interesting case, where the v4 number may be an RFC1918 number, but the v6 number is NOT private. To my knowledge, the standards are unclear on this topic.

We had to create NAT gateways for our PCI servers, and create RFC1918 addresses for them just for PCI compliance. The NAT gateway is behind the innermost firewall. These are our only servers behind a NAT gateway of any kind.

In the beginning, IPv6 expressly did NOT have NAT; it was designed to get rid of NAT. However, in recent years the case for IPv6 NAT has been pressed, and there is movement to get something like that working. In my opinion, a lot of that push was to allow NAT to continue as an obscuration-gateway (or low-cost stateless 'firewall') between internal resources and external hostile actors. I strongly suspect that when the PCI DSS standard gets it IPv6 update, they will continue to mandate some form of IP Masquerade.

Not done yet

We just received another email from President Shepard. He has been informed by the governor that the state revenue forecast for 2009-11 is being revised downwards again, and to expect reductions at some point. No specifics were given, but they'll come later.

Ah, nuts.

At least they're predicting that we've hit the troff. That's something anyway.

But still, this'll result in more layoffs. Ick.

Ah, yes. Size

From the ISC SANS Diary:
Explaining Defense in Depth
Once an organization reaches a certain size, you end up with a situation where separate groups are responsible for firewalls, IDS, anti-virus, etc. Often these groups will not share a common chain of command.
Ayep. WWU is large enough we have entirely separate functions doing firewalls and network access controls, and server ACLs and patching. This creates some interesting meetings when functions clash, like, say, Network Access Control that requires a client/agent on each desktop. Happily, we all report to the same CIO, and the managers of each department talk to each other regularly.

This is part of the natural progression of IT departments. In The Beginning, One Guy Does It All. From creating patch cables, keeping the firewall running, patching servers, to fixing broken Outlook installs, and maybe also fixing broken fax machines. Add a couple of people and you start getting some specialization. Then an actual manager gets involved to act as interface between the tech-nerds and the end users.

The first specialty to cleave off of the IT blob is likely to be either a desktop/everything-else or network/everything-else, depending on the people in the IT blob. The second specialty to cleave off will be either network or desktop, whichever didn't cleave first. By this point, you may even have six people in IT; a manager, a network guy who probably also does phones, a server guy, two helpdesk-types, and a jack-of-all-trades of some kind who may do programming as well.

Poof, growth happens. Depending on the business, a dedicated programming section may form. The helpdesk will formalize into a call-out center. Another server-jock or two will be picked up. The phones-and-network section will pick up more people, but not as many as the helpdesk. This kind of thing will happen to any company.

When a dedicated security management center arrives depends on the industry and corporate culture. WWU doesn't have one yet, outside of University Police. Banks have had them for decades. The security manager may be outside of the IT stack completely, and have their own IT assets. Or the security manager will be a subordinate of the CIO.

Which means that when a company of a certain larger size decides to try and implement a security-related something that crosses major functional groups, it's a good test of inter-departmental communication! This kind of project can really help illuminate where the fundamental communications breakdowns are. Don't forget ownership issues! If the Desktop Support group 'owns' the desktop, and the security office wants to roll out a new asset-inventory agent on all desktop and server assets, it can cause pissing matches. Oh yes.

But, we're pretty good at talking with each other here. Anything that firmly crosses functional groups like a NAC does will still have project delays due to the need for cross-talk, but we do get working on it. There is a happy lack of personal grudges between the managers of each of our functional groups.

End of an era

From internal email:
As some of you may know, the WWU dial-up modem Student and Faculty/Staff pools are being discontinued on Monday, June 29, 2009.
That's right. We're getting out of the modem business. We still had a couple regular users. If I remember right, the modem gear we had was bought with student-tech-fee money a LONG time ago and was beginning to fail. And why replace modem gear in this age? We're helping the few modem users onto alternate methods.

Like many universities, we were the sole ISP in town for many years. That changed when the Internet became more than just the playground of universities and college kids, but it was true for a while. The last vestige of that goes away on the 29th.

Historical data-center

As I've mentioned several times here, our data-center was designed and built in the 1999-2000 time frame. Before then, Technical Services had offices in Bond Hall up on campus. The University decided to move certain departments that had zero to very little direct student contact off of campus as a space-saving measure. Technical Service was one of those departments. As were Administrative Computing Services, Human Resources, Telecom, and Purchasing.

At that time, all of our stuff was in the Bond Hall data-center and switching room. This predates me (December 2003), so I may be wrong on some of this stuff. That's a tiny area, and the opportunity to design a brand new data-center from scratch was a delightful one for those who were here to partake of it.

At the time, our standard server was, if I've got the history right, the HP LH3. Like this:
An HP LH3
This beast is 7U's high. We were in the process of replacing them with HP ML530's, another 7U server, when the data-center move came, but I'm getting a bit ahead of myself. This means that the data-center was planned with 7U servers in mind. Not the 1-4U rack-dense servers that were very common at that time.

Because the 2U flat-panel monitor and keyboard drawers for rack-dense racks were so expensive, we decided to use plain old 15-17" CRTs and keyboard drawers in the racks themselves. These take up 14U. But that's not a problem!

A 42U rack can take 4x of those 7U servers, and one of the 14U monitor/keyboard combinations for a total of...42U! A perfect fit! The Sun side of the house had their own servers, but I don't know anything about those. With four servers per rack, we put in a Belkin 4-port PS-2 KVM switch (USB was still too new fangled in this era, our servers didn't really have USB ports in them yet) in each. As I said, a perfect fit.

And since we could plan our very own room, we planned for expansion! A big room. With lots of power overhead. And a generator. And a hot/cold aisle setup.

Unfortunately... the designers of the room decided to use a bottom-to-top venting strategy for the heat. With roof mounted rack fans.
Rack fans

And... solid back doors.

Rack back doors

We got away with this because only HAD four servers per rack, and those servers were dual processor 1GHz servers. So we only had 8 cores running in the entire rack. This thermal environment worked just fine. I think each rack probably drew no more than 2KW, if that much.

If you know anything about data-center air-flow, you know where our problems showed up when we moved to rack-dense servers in 2004-8 (and a blade rack). We've managed to get some fully vented doors in there to help encourage a more front-to-back airflow. We've also put some air-dams on top of the racks to discourage over-the-top recirculation.

And picked up blanking panels. When we had 4 monster servers per rack we didn't need blanking panels. Now that we're mostly on 1U servers, we really need blanking panels. Plus a cunning use of plexi-glass to provide a clear blanking panel for the CRTs still in the racks.

And now, we have a major, major budget crunch. We had to fight to get the fully perforated doors, and that was back when we had money. Now we don't have money, and still need to improve things. We're not baking servers right now, but temperatures are such that we can't raise the temp in the data-center very much to save on cooling costs. Doing that will require spending some money, and that's very hard right now.

Happily, rack-dense servers and ESX have allowed us to consolidate down to a lot fewer racks, where we can concentrate our good cooling design. Those are hot racks, but at least they aren't baking themselves like they would with the original kit.

Editing powershell scripts in VIM

Vim. The vi replacement on opensuse, has a downloadable PowerShell syntax filter.

vim with syntax coloring for powershell

The fact that powershell scripting has gotten enough traction that someone felt the need to code up a powershell syntax file for vim is very interesting. I'm perfectly willing to reap the rewards.

We go through paper

Spring quarter is done, and grades are turned in. Time to take a look at student printing counts.

Things are changing! Printing costs do keep rising, and for a good discussion of realities check this article from ATUS News. The numbers in there are from Fall quarter. Now I'll give some Spring quarter numbers (departmental break down is different than the article, since I'm personally not sure which labs belong to which department):

Numbers are from 3/29/09-6/17/09 (8:17am)
Total Pages Printed: 1,840,186
ATUS Labs: 51.5%
Library Printers: 22.9%
Housing Labs: 14%
Comms Bldg: 6.8%
Average pages printed per user: 152
Median pages printed per user: 112
Number of users who printed anything: 12,119
Number of users over 500 pages: 292
Number of users over 1000 pages: 7
"Over quota" pages printed: 181,756

Students get a free quota of 500 pages each quarter. Getting more when you run out is as simple as contacting a helpdesk and getting your quota upped. As you can see, 2.4% of users managed to account for 9.9% of all pages printed.

It is a common misconception that the Student Tech Fee pays for the 500 pages each student gets. In actuality, that cost is eaten by each of the departments running a student printer, which means that printing has historically been a gratis benefit. There is a proposal going forward that would change things so you'd have to pay $.05/page for upping your quota. The top print-user of this quarter would have had to have paid a bit over $69 in additional quota if we were using the pay system. If ALL of the users who went over 500 pages had to pay for their additional quota, that would have generated a bit under $9088 in recovered printing costs.

Of course, if added print quota were a cost-item it will depress that overage activity. Another curve-ball is the desire of certain departments to offer color printing, but they really want to recapture their printing costs for those printers. So far we haven't done a good job of that, but we are in the process of revising the printing system to better handle it. It'll require some identity-management code changes to handle the quota management process differently, but it is at least doable.

Once you start involving actual money in this transaction, it opens up a whole 'nother can o worms. First off, cash-handling. Our help-desks do not want to be in the business of dealing with cash, so we'll need some kind of swipe-kiosh to handle that. Also, after-hours quota-bumps will need to be handled since our helpdesks are not 24/7 and sometimes students run out of quota at 6am during finals week and have an 8am class.

Second, accounting. Not all labs are run by ATUS. Some labs are run by departments that charge lab-fees that ostensibly also support lab printing. Some way to apportion recovered printing costs to departments will have to be created. Those departments that charge lab-fees may find it convenient to apply a portion of those fees to their student's page-quota.

Third, extra-cost printing like color. PCounter allows printers to be configured to not allow the use of 'free quota' for printing. So in theory, things like color or large-format-printers could be configured to only accept paid-for quota. These devices will have a cost higher than $.05/page, perhaps even $.25/page for color, or much higher for the LFP's. It could also mean some departmental lab-managers may decide to stop accepting free-quota in their labs and go charge-only.

Fourth, carry-over of paid quota. If a student forks over $20 for added quota, that should carry over until it is used. Also, it is very likely that they'll demand a refund of any unspent print-quota on graduation. Some refund mechanism will also need to be created.

I don't know at what step the proposal is, all I know is that it hasn't been approved yet. But it is very likely to be approved for fall.

I want my SSH

| 5 Comments
It seems that anything that runs over TCP can be tunneled over HTTPS. This is not a good thing when it comes to network security. As an example of the kind of whack-a-mole this can result in, I give you the tale of SSH. With a twist.

Power User: Surfs web freely. It is good.

Corporate Overlords: "In the interests of corporate productivity, we will be blocking certain sites." Starts blocking Facebook, MySpace, Twitter, and all sorts of other popular time-wasters like espn.com.

Power User: Is thwarted. Hunts up an open proxy server on the net. Surfs Facebook in the clear again. It is good.

Corporate Overlords: Informs network security office, creating it from scratch if need be, that it has come to their attention that the blocks are being circumvented, and that This Needs To Stop. Make it no longer so.

Network Security: "Yessir, will do sir. Will need funds for this."

Corporate Overlords: Supplies funds.

Network Security: Adds known-proxies to the firewall block list.

Power User: Is thwarted. Googles harder, finds an open proxy not on the list. Unrestricted internet access restored. It is good.

Network Security: Subscribes to a service that supplies open-proxy block lists.

Power User: Is thwarted. Googles harder. Can't find accessible open proxy anywhere. Decides to make their own. Downloads and installs Squid on their home Linux server. Connects to home server over SSH, tunnels to squid over SSH. Unrestricted internet access restored. It is good.

Network Security: Notices uptick in TCP/22 traffic. Helpdesk tech gets busted for surfing YouTube while on the job. Machine dissection reveals SSH tunnel. Blocks TCP/22 at the router.

Power User: Is thwarted. When home next, moves SSH port to TCP/8080. Gets to work, uses TCP/8080 for SSH session. Unrestricted internet access restored. It is good.

Corporate Overlords: "In the interests of productivity, instant messaging clients not on the corporate approved lists are now banned."

Power User: Is not affected. Continues surfing in the clear. It is good.

Corporate Overlords: "In the interests of productivity, all unapproved non-HTTP off-network internet access is now banned."

Power User: Is thwarted. Moves SSH to TCP/80. Unrestricted internet access restored. It is good.

Network Security: Implements deep packet inspection on the firewall to make sure TCP/80 and TCP/443 traffic really is HTTP.

Power User: Is thwarted. Spends a week, gets SSH-over-HTTP working. Unrestricted internet access restored. It is good.

Network Security: Implements mandatory HTTP proxy, possibly enforcing it via WCCP.

Power User: Is thwarted, cache mucks up ssh sessions. Moves to HTTPS. Unrestricted internet access restored. It is good.

Network Security: Subscribes to a firewall block-list that lists broadband DHCP segments. Blocks all unapproved access to these IP blocks.

Power User: Is thwarted. Buys ClearWire WiMax modem. Attaches to work machine via 2nd NIC. Unrestricted internet access restored. It is very good, as access is faster than crappy corporate WAN. Should have done this much earlier.

Network Security: Developer busted for having a Verizon 3G USB modem attached to machine. Buys desktop inventorying software. Starts inventorying all workstations. Catches several others.

Power User: Sees notice about inventorying. Starts bringing home laptop to work to attach to ClearWire modem. Workstation is squeaky clean when inventoried. Uses USB stick to transfer files between both machines. Unrestricted internet access maintained. It is good.

Network Security: Starts random inspections of cubes for unauthorized networking and computing gear. Catches wide array of netbooks and laptops.

Power User: Hides ClearWire modem in cunningly constructed wooden plant-stand. Buys hot-key selectable KVM switch to hide in desk. Hides netbook in back of file-drawer, runs KVM cable to workstation keyboard/mouse/monitor. Runs USB hub to netbook, hides hub in plain sight next to keyboard. Is smug. Unrestricted internet access maintained. It is good.

Now that 3G and WiMax are coming out, it is a lot harder to maintain productivity-related network blocks. The corporate firewall is no longer the sole gateway between users and their productivity destroying social networking sites. A Netbook with an integrated 3G modem in it will give them their fix. As will most modern SmartPhones these days.

As for information leakage, that's another story all together. The defensive surface in an environment that includes ubiquitous wireless networking now includes the corporate computing hardware, not just the border network gear. This is where USB/FireWire attachment policies come into play. A workstation with two NICs can access a second network, so the desktop asset inventorying software needs to alarm when it discovers desktop machines with more than one IPed interface.

And yet... the only way to be sure to catch the final end-game I lined out above, an air-gapped external network connection, is through physical searches of workspaces. That's a lot of effort to go to prevent information leakage, but if you're in that kind of industry where that's really important the effort will be invested. In such environments being caught breaking network policy like that can be grounds for termination. And yes, this is a lot of work for the security office.

All in all, it is easier to prevent information leakage than it is to prevent productivity losses due to internet-based goofing off. Behold the power of slack.

What does this have to do with SSH, which is what I titled this post about? You see, SSH is just a tool. It is a very useful tool for dealing with abstract policies of the http-restricting kind, but just a tool. It can get around all sorts of half-assed corporate block attempts. It has been the focus of many security articles over the years, and because of this it is frequently specifically included in corporate security policies.

Focusing policies on banning tools is short-sighted, as evidenced by the 3G/WiMax end-run around the corporate firewall. Since technology moves so fast, policies do need to be somewhat abstract in order to not be rendered invalid due to innovation. A policy banning the use of SSH to bypass the web filters does nothing for the person caught surfing using their own Netbook and their own 3G modem. A policy banning the use of any method of circumventing corporate block policy does. A block list is an implementation of policy, not the policy itself.

openSUSE 11.2 milestone 2

I'm doing some testing of openSUSE. Pitching in, getting my open-source groove on. One of the new things is decidedly minor, but it makes good eye-candy. They have a new title font!

New title font

See?

Also, as of milestone 2, ext4 is the default file-system.

Fire protection done right

What kinds of things do you need to consider when deciding on fire protection for your data-center?

Check local fire-codes. Really. In 2003 I was involved in setting up a new data-center for my old job. My job was more moving the gear safely and setting it up in the new location, not wrangling with the architects and contractors who were building it.

Imagine my surprise when I found sprinkler heads in the data-center during my first walk through. I got about a quarter of the way to indignant outrage before my boss short-circuited me with logic. It seems that local fire code actually covers data-centers, and it mandates sprinklers. I was assured, assured I tell you, that they wouldn't go off unless the FM-200 system failed to snuff the fire. I was dubious, but the fire inspectors really did mean that.

Anyway, there are a series of things you need in a fire suppression system.

  • An Emergency Power Off function If there is a fire, the EPO will drop power to the room hard. Yes, that'll cause data damage, but so does fire. If the fire is electrical in nature, this may stop it. Also, if all the gear is de-powered, a water dump does less damage.
  • A sealed room You want sealed for correct HVAC anyway. You don't want to rely on building HVAC unless the building was designed with that room in mind in the first place. Also, this allows you to use...
  • A gas-based suppression system FM-200 is popular choice for this. Unlike the halon systems of old, it isn't as environmentally evil and doesn't leave a mess behind. OldJob had a Halon dump in the 80's due to a burned bag of popcorn ("The $20,000 bag of popcorn"). It was... bad.
  • A water based backup suppression system If the FM-200 fails, you need to get the fire out. After the EPO has fired, and the FM-200 dumps, if there is still a fire then you need old fashioned water
  • Water detection sensors in/on the floor If you have any water pipes overhead, you need water sensors in the floor. This is more of an asset-protection thing, but if you DO have sprinklers you need water sensors to detect leaks. Also good for detecting leaks in your HVAC chillers.
  • Call-out capabilities If the fire system trips, you want to notify both Facilities people, as well as data-center staff and management. Obviously, this system should NOT rely upon assets in the data-center that's on fire. This can be hard.

There may be more, but that's off the top of my head. The EPO can be a destructive option, so I don't know how wide-spread they are. But they make all kinds of sense in a room where a water dump is possible.

If you have to retrofit a pre-existing room, some of the above may not be possible. As a fire-inspector once told me, to extinguish a fire you need one of three things:

  • Remove the fuel
  • Remove the oxidizer
  • Cool the reaction below the combustion point

The system I lined out above does all three. The EPO removes fuel and can cool the reaction to below the combustion point. The FM-200 partially removes oxygen, but mostly cools the reaction below the combustion point. The water dump smothers the fire due to lack of oxygen, and also cools it. For a high-value asset like a data-center, you want at least two of these.

Because of this, I'd say that your top priority is to see if you can get a gas-based extinguishing system in place as it does far less damage than water does (even with an EPO on your power-distribution-unit or main breaker panel). A truly good system, no matter what the actual suppression technology, has a flexible notification system that allows more than just the facilities supervisor to be notified of the fire-suppression systems activating.

As for hand-held extinguishers, use Class C extinguishers. But be careful. Dry chemical style extinguishers blow a powder everywhere. And that powder is somewhat corrosive. In the typical high airflow data-center, a fired extinguisher's residue can get everywhere. If the powder gets inside server intakes, it can cause higher equipment failure rates for the next several years and the total cost may be more than the system that was on fire. We've had demonstrations of extinguishing fires at our workplace, and have seen how messy it can get. When you buy your extinguishers for in-center usage, use the gas-style Class C extinguishers.

Explaining LDAP.

| 1 Comment
The question was asked recently...

"How would you explain LDAP to a sysadmin who'd've heard of it, but not interacted with it before."

My first reaction illustrates my own biases rather well. "How could they NOT have heard of it before??" goes the rant in my head. Active Directory, choice of enterprise Windows deployments everywhere, includes both X500 and LDAP. Anyone doing unified authentication on Linux servers is using either LDAP or WinBind, which also uses LDAP. It seems that any PHP application doing authentication probably has an LDAP back end to it1. So it seems somewhat disingenuous to suppose a sysadmin who didn't know what LDAP was and could do.

But then, I remind myself, I've been playing with X500 directories since 1996 so LDAP was just another view on the same problem to me. Almost as easy as breathing. This proposed sysadmin probably has been working in a smaller shop. Perhaps with a bunch of Windows machines in a Workgroup, or a pack of Linux application servers that users don't generally log in to. It IS possible to be in IT and not run into LDAP. This is what makes this particular question an interesting challenge, since I've been doing it long enough that the definition is no longer ready to my head. Unfortunately for the person I'm about to info-dump upon, I get wordy.

LDAP.... Lightweight Directory Access Protocol. It came into existence as a way to standardize TCP access to X500 directories. X500 is a model of directory service that things like Active Directory and Novell eDirectory/NDS implemented. Since X500 was designed in the 1980's it is, perhaps, unwarrantedly complex (think ISDN), and LDAP was a way to simplify some of that complexity. Hence the 'lightweight".

LDAP, and X500 before it, are hierarchical in organization, but doesn't have to be. Objects in the database can be organized into a variety of containers, or just one big flat blob of objects. That's part of the flexibility of these systems. Container types vary between directory implementation, and can be an Organizational Unit (OU=), a DNS domain (DC=), or even a Cannonical Name (CN=), if not more. The name of an object is called the Distinguished Name (DN), and is composed of all the containers up to root. An example would be:

CN=Fred,OU=Users,DC=organization,DC=edu

This would be the object called Fred, in the 'Users' Organizational Unit, which is contained in the organization.edu domain.

Each directory has a list of classes and attributes allowable in the directory, and this is called a Schema. Objects have to belong to at least one class, and can belong to many. Belonging to a class grants the object the ability to define specific attributes, some of which are mandatory similar to Primary Keys in database tables.

Fred is a member of the User class, which itself inherits from the Top class. The Top class is the class that all other classes inherit from, as it defines the bare minimum attributes needed to define an object in the database. The User class can then define additional attributes that are distinct to the class, such as "first name", "password", and "groupMembership".

The LDAP protocol additionally defines a syntax for searching the directory for information. The return format is also defined. Lets look at the case of an authentication service such as a web-page or Linux login.

A user types in "fred" at the Login prompt of an SSH login to a linux server. The linux server then asks for a password, which the user provides. The Pam back-end then queries the LDAP server for objects of class User named "fred", and gets one, located at CN=Fred,OU=Users,DC=organization,DC=edu. It then queries the LDAP server for objects of class Group that are named CN=LinuxServerAccess,OU=Servers,DC=Organization,DC=EDU, and pulls the membership attributes. It finds that Fred is in this group, and therefore allowed to log in to that server. It then makes a third connection to the LDAP server and attempts to authenticate as Fred, with the password Fred supplied at the SSH login. Since Fred did not fat finger his password, the LDAP server allows the authenticated login. The Linux server detects the successfull login, and logs out of LDAP, finally permiting Fred to log in by way of SSH.

As I said before, these databases can be organized any which way. Some are organized based on the Organizational Chart of the organization, not with all the users in one big pile like the above example. In that case, Fred's distinguished-name could be as long as:

CN=Fred,ou=stuhelpdesk,ou=DesktopSupport,ou=InfoTechSvcs,ou=AcadAffairs,dc=organization,dc=edu

How to organize the directory is up to the implementers, and is not included in the standards.

The higher performing LDAP systems, such as the systems that can scale to 500,000 objects or higher, tend to index their databases in much the same way that relational databases do. This greatly speeds up common searches. Searching for an object's location is often one of the fastest searches an LDAP system can perform. Because of this LDAP very frequently is the back end for authentication systems.

LDAP is in many ways a speciallized identity database. If done right, on identical hardware it can easilly outperform even relational-databases in returning results.

Any questions?
Yeah, I get wordy.

1: Yes, this is wrong. MySQL contains a lot, if not most, of these PHP-application logins on the greater internet. I said I had my biases, right?

Outlook for everything

| 2 Comments
Back when I worked in a GroupWise shop, we'd get the occasional request from end users to see if Outlook could be used against the GroupWise back-end. Back in those days there was a MAPI plug-in for Outlook that allowed it to talk natively to GroupWise and it went rather well. Then Microsoft made some changes, Novell made some changes, and the plug-in broke. GW Admins still remember the plug-in, because it allowed a GroupWise back end to look exactly like and Exchange back end to the end users.

Through the grapevine I've heard tales of Exchange to GroupWise migrations almost coming to a halt when it came time to pry Outlook out of the hands of certain highly placed noisy executives. The MAPI plugin was very useful in quieting them down. Also, some PDA sync software (this WAS a while ago) only worked with Outlook, and using the MAPI plugin was a way to allow a sync between your PDA and GroupWise. I haven't checked if there is something equivalent in the modern GW8 era.

It seems like Google has figured this out. A plugin for Outlook that'll talk to Google Apps directly. It'll allow the die-hard Outlook users to still keep using the product they love.
Yesterday we got some concerned mails from the one of the groups who sends mail by way of one of our web-servers. It's a somewhat critical function they do, so we paid attention to it. It seems they were getting bounce-messages from comcast.net. The bounce said that the incoming IP address did not have a reverse lookup (PTR record) and they don't talk to people like that.

This was confusing. Because we really do have a PTR record for that particular mailer. And yet, getting bounces. So one of the Webdevs calls Comcast to ask politely what the heck, and the Comcast support person walks them through a series of steps to demonstrate what went wrong. According to them, or so implied the webdev who doesn't speak SMTP as well as we do, the problem was that 'wwu.edu' does not resolve to an IP address.

There are reasons we haven't done this, and they have to do with mail delivery. Certain stupid mailers will deliver to a resoveable host before searching MX records, and if "wwu.edu" is resoveable, it'll attempt delivery to THAT instead of where it should. The server that runs 'www.wwu.edu' is the one that we'd have to point 'wwu.edu' to, and it is not a mail host. Far from. This seemed to be a strange requirement of Comcast.

I cracked it earlier today. You see, if you take a look at the NameServer records for the "wwu.edu" domain you will find three records.

140.160.242.13
140.160.240.12
216.186.4.245

It's that last one that's the problem. For some reason, our offsite DNS didn't have that particular reverse-lookup domain replicated to it. So if Comcast used it for resolving the incoming IP, it would get 'UNKNOWN' and block the connection. If they picked one of the other two, it would resolve and delivery would continue. Tada! The Comcast error message really was true, we just didn't realize one of our DNS servers didn't have all the data it needed. Oops.

Finals week

The time of the academic year when we touch nothing substantive, and all problems are given a heightened priority. Most of the time it is pretty quiet. We plan for stuff we can plan for, but implementation is left for next week.

Grades have to be turned in on Tuesday, so we can't touch Blackboard before then. So all the intersession stuff gets done between Wednesday and Sunday. Summer session starts Tuesday the 23rd.

Power update

Last night's shutdown went fine. We lost a single hard-drive somewhere, probably on the Solaris/Linux side since I know we didn't lose any on the Windows/NetWare side. The power guys found the problem, and were able to get the UPS up and running. We're running protected right now.

What was the problem? Well, I know what I was told yesterday, and I'm not able to translate that into anything intelligible. I'm not an electrical engineer. If I caught it right, and it is decidedly possible I didn't, the circuit breakers in the UPS cabinet were configured to trip for the wrong condition. An overly conservative condition. This was done when the UPS was installed back in 1999-2000, and we only just discovered it because we haven't had to take the UPS down since then.

We get to do a last set of maintenance in two weeks. This is where they move the breaker to a new electrical panel. This will be done with the UPS on bypass, and shouldn't interrupt the load. We'll be keeping an eagle eye, of course, but don't expect any problems.

When good power becomes bad power

A good thing is happening. We're replacing the generator backing up the datacenter with a unit large enough to run both HVAC units. When the room was built in the 1999/2000 timeframe, it was presumed that one would be enough to keep the room cool enough. That's true to a point, but it didn't take into account localized hot-spots due to very hot running servers like the ones in our ESX cluster. Testing we've done shows that the temps do fall out of tolerance between 30 to 45 minutes after running on only one HVAC unit. So, we're setting things up to run both HVAC units. Good! It'd be even better if we could get a newer UPS since this one was nearly EOL when we bought it. But that's something for another capitol request.

Because we're replacing a generator, this means some unavoidable periods of time when the room is not fully protected and we'll be running on naked utility power. Like I said, this is unavoidable. Happily, utility power is pretty stable this time of year. We're having a bit of a hot-snap right now, so there is some concern about AC-related brown-outs but it isn't quite that hot yet. That's why the work is scheduled to be done during the cool part of the day.

Murphy did not agree with us. Yesterday, they spliced in the new generator transfer switch to the Bypass circuit of the UPS. This should have been a non-event since the main circuit was just fine and feeding load. Unfortunately for us, the monitor card on the UPS saw the Bypass circuit failing as a UTILITY FAIL event. What's more, it erroneously fired the ON_BATTERY event even though the UPS was not actually on battery. This started the shutdown timers on the servers with the UPS shutdown-service client on it. This is why things got Very Exiting around 8:57am yesterday, as these servers shut themselves down. On the plus side, things worked as they should. On the negative side, we were trusting a signal source that it turns out we shouldn't trust. Crap.

Then this morning. This morning they were splicing in the new transfer switch to the mains circuit, and during this time the UPS would be on Bypass leaving us on naked utility power. Once done, the new generator would be supporting the UPS. The next outage would be similar, putting the UPS on bypass, while they cut over to the new electrical panel downstairs.

Unfortunately for us, when the work was completed and we went through the UPS startup procedure, two things happened. First, we discovered that the Input breaker had tripped some time between when we shut the UPS down and opened the doors to start it back up. We (actually the WWU Facilities electricians, I was just shoulder surfing at this point) flipped the breaker to the On position, which gave the datacenter a transient power flicker on the order of 50ms-100ms, which didn't bring anything down. Second, when we got to the part of the startup procedure that says 'tell the UPS to turn on,' it failed with an error to the effect of, "incorrect phase rotation, startup aborted." This caused the electricians some great concern, and they went about validating their wiring.

Which tested out fine. The phases well and truely are wired in correctly. They have very high confidence in this. Which leaves something in the UPS being wonky. So they call the UPS vendor, who ends up sending a technician up from Seattle to look things over. He should be here any time now. Meanwhile, we've been on naked utility power since 7am this morning.

The electricians are very concerned about that Input breaker tripping. This is a 50KVA 3-phase UPS, and when those short out the arc it generates is more accurately described as an explosion. The breaker caught it, as it should, but it shows a highly energetic event was avoided. They do not have confidence that we can bring the UPS up without a blip in power to the main load. If not a full on surge if it fails the wrong way.

The decision was made to prepare for shutting the whole machine room down. This is not a decision made lightly, this is the week before finals so uptime is even more critical right now. This decision will have to be made by the Vice Provost or the President, and we haven't had word yet what they've decided. We hear they're considering the full shutdown to start at 1am. We're still planning for it.

This would mark the first time since the datacenter went production back in 2000 that we've had to gracefully shut the whole thing down. The closest we've come was last September when we had to shutdown the EVA3000 in order to upgrade it to an EVA6100, and all servers connected to it had to be shut down. We're guessing that it'll take 45 minutes to get everything down, and close to 90 minutes to bring it all up in the correct order. When the room is down is when the electricians will attempt to restart the UPS.

This is an all-hands thing, and we'll have to get in contact with the University parties that have servers in there so they can either shutdown for the night, or be here to shut down in person. We've designated a pair of admins to sleep through the event so they can be fresh for the morning disasters while the rest of us sleep in.

Of course, the powers-that-be may decide to risk another UPS restart with load. Who knows.

Once a decision has been made about what to do, I'm fairly certain an all-points email will go out if we decide the full shutdown is needed. This is why we get paid the big money.

EDIT: It is official. We're taking everything down starting at 1am tonight.