Recently in sysadmin Category

Being the WTF person

| 2 Comments | No TrackBacks
At both this job and my last one I have ended up becoming the WTF person. The WTF person is the person people go to when things are acting strangely, they can't figure it out, and need another set of eyes. Preferably a set of eyes with a reputation for pulling rabbits out of hats.

WTF people are the kind of people that end up on level 2 or 3 tech support, because that's who you want to have at that level. People who solve weird stuff.

At a place like ours where the support relationships are largely informal, at least among people who dink around with servers, the concept of L2 or L3 support doesn't really exist. It manifests as phone-calls or emails from people with strange questions, looking for leads in their own inquiries. Or in the case of my immediate co-workers, a head poked around the door, and, "I'm lost, can you take a look?"

As I alluded to before, becoming the WTF person takes time. You have to make some awesome saves so people notice, and then continue to crack weird, hard to describe problems. It helps a lot to have a deep understanding of the technology you work with. I suspect being ebullient about how you found the problem and describing the problem once it was resolved helps in this.

Once you get there, though, you do get passed some strange, strange things. I've been asked advice on figuring out how something broke in that specific way when the symptoms described... have no causal relationship I can think of. I also get passed weird questions in areas I don't know much about (MS Office for one), but at least those can be deflected.

Honest to goodness bugs are perhaps the hardest to figure out. These are problems that take a few conditions to set up, and it isn't always clear that those conditions are in place. This skill got a lot of work back when I was working on the OES2 SP1 beta. On software that's already been through a beta-test and perhaps a service-pack or two, the bug conditions can be very arcane.

One-man IT shops tend to attract WTF people, simply due to the breadth and complexity of the environment. People who thrive in such environments definitely are. They do a little bit of everything, which sets them up to make connections that other people miss.

At the other end of the IT spectrum, highly specialized IT people in large organizations, you still find WTF people. They're perhaps not as common, but they do exist. And strange but awesome synchronicities can occur if WTF people from different specialties start hammering on a problem together. This kind of thing sometimes happens when I talk to L2/3 vendor-support.

I'm proud to see this happen, even if in the moment I'm also going WTF?? in my head.
Today I'm spending most of it sheepdogging a vendor installing an application. This vendor is VPNing in, and such access is a key part of the product's support contract.

This is something I've noticed recently. Several of the server-based off the shelf apps I've installed lately have had a requirement that the vendor have access to the server in some way. Some of it is so they can do the install. Some of it is so they can update it so we don't have to. Some of it is just in case we ever call for support and need their help.

I have a theory for why this is. I have a sneaking suspicion that its because that's how these vendors support installs in environments where the sysadmin is a desktop person who got handed a server and was asked, "make it work." This kind of vendor-based hand-holding makes the ongoing maintenance of applications lower on the client side of the equation, which can lead to more sales. But, I'm not sure if that's it or not.

This is causing some grumbling in the ranks, since it means untrusted parties have to be allowed to log in to servers in the domain. Before this recent spate of applications, vendors demanding such access had their apps relegated to servers not in the domain at all. This doesn't work when the app requires domain access. Console access to servers is a sensitive thing for us, so we don't like to hand it out on demand to vendors.

Especially when we weren't involved in the purchase process to begin with. Many a time we've been told:

Client: We spent umpty thousand dollars on this ap. Install it.
Us: *reads install document, cringes* They need Administrator access to the whole box and a tunnel into the inner Banner fortress. I don't want to.
Client: What part of umpty thousand dollars don't you understand? Make it work.
Us to Management: Insecure! Violates best practices!
Management to Us: It's too late to get a refund, and upper management was involved in the decision. Make it work.
Us: Wilco.
Or words to that effect.

Ahem.

How are y'all handling this kind of thing, presuming you're also seeing it and it isn't just me getting lucky.
We have a user type that is pretty much unique to the higher-educational world.

The Emeritus professor.

I'm unclear on what, exactly, Emeritus professors get in the way of continued access to WWU resources, but I do know they can have things like email accounts. As you can probably guess, this population is not the most technically savvy bunch. They also represent a very unique population that requires a fair amount of exception-processing in several procedures.

We've been in a multi-year process of eliminating the 'cc.wwu.edu' domain from Campus usage. Way back in the beginning all WWU email came from cc.wwu.edu. Then the Microsoft Mail system came in and Faculty/Staff moved to a different domain, but students stayed on cc.wwu.edu. When we upgraded from MS-Mail to Exchange 5.5 Fac/Staff moved to @wwu.edu instead and that's where we are today. Students were migrated to '@students.wwu.edu' coming on two years ago as part of the Windows Live @ EDU program. The only people on @cc.wwu.edu were people who opted out of Exchange, preferring the more pure text-mode email interface of pine over telnet/ssh.

Getting people off of cc.wwu.edu has been a long process. The fact that most of our Emeritus were over there, and had been there since time began, caused a fair amount of work. The fact that some professors had published articles and books with their @cc.wwu.edu addresses in them caused a fair amount of pain as well. We worked through them (thank you, ATUS!) and we're almost ready to turn cc.wwu.edu off for good.

So of course today I dig up another Emeritus with a Contact in Exchange that forwards to a cc.wwu.edu address.

Another area where Emeritus caused some pain is when we turned off our modem dial-up service. Our only consistent users were a small handful of faculty and Emeritus.

"Accounts for Life" is a tricky service to provide. Yes indeedy.

It has been cold

| No Comments | No TrackBacks
In most of the rest of the US June is actually a summer month, but not here in the Pacific Northwest. For us, Summer typically starts on July 12th, give or take a day. I typically make it longer for me by visiting the warmer parts of the country over the 4th of July weekend. But this June has been unusually gloomy and chilly. Take a look at the monthly temp chart from the Seattle airport. We're usually 1-5 degrees (F) colder than Seattle depending on a variety of things but the trend is still the same

KSEA June temperature record
The green band represents the normal high and low. As you can see, this time of year our highs should be in the 70's, but instead they've hung in the 60's. We had a nice patch late last week, but overall the month has been markedly colder than normal. You can see where we set a record low high-temp back on the 19th.

Even during normal years we only have three months with an expected high above 70 (very roughly, June 15 through September 15). What this means is that we're actually a pretty good candidate for ambient-air datacenter cooling. Those kinds of systems didn't really exist in any meaningful way back when this building was built, but if we were to build this building again something like that would be considered.

Universities in general have an environmentalist bent to them, and WWU is not immune. We have the Huxley College of the Environment, one of the first such programs in existence. The last few buildings we've built on campus have been LEED certified to various degrees. With that kind of track-record, an ambient air system for a new large data-center is something of a gimmie.

Heck, I would not be surprised if a Capital Request gets put in sometime in the 5-10 year range to try and convert our current system to at least be partially ambient. We're running up against a power and cooling wall right now. Virtualization has helped with that quite a bit, but our UPS has been running in the 70-85% range for several years now. We're going to have to address that at some point. Since that'll also require shutting the room down for a while (eeeeek!) may as well redo cooling while they've got the availability.

We'll see if that actually happens.
My thoughts on this quote:

Theoretical risks and real risks are generally the same thing when you're talking about IT security.
In large part, this is correct. Especially when getting audited. We have regular audits here, both internal and external. We have servers that handle credit-card data, so we have to deal with PCI compliance as well. So yeah, we know about this. We're also familiar with the debate.

In order to get our PCI stuff certified we have to have security scans performed against our credit-card processing servers. In order to do this, we grant a specified IP address full and unrestricted access to an internal IP list. The third party then scans that from wherever they are, and sends us the report full of red Xes.

The internal debate goes like this. I'm not naming names for obvious reasons. I like my job.

Tech: Why do we have to let them in to scan? That's, like, completely bypassing the security provided by our firewall. Both firewalls. It's not like a regular hacker has that kind of access. These servers can not normally be reached from the internet at all! They should be scanning THAT!

Manager: Because that's what the PCI standard says they have to do.

Tech:  It makes no sense!
The reason for this is because they're testing how vulnerable we are if our other servers get hacked and they have enhanced access to that subnet. That's also very unlikely in our case (see also: two firewalls), but the fact remains that it still has to be checked. Because we've never been attacked that way (that we know of), that kind of attack is seen as theoretical rather than real.

All it takes is one attacker, or a group of attackers, to REALLY WANT SOMETHING for theoretical attacks to become real. The concerted attacker, as opposed to the casual attacker, is the one that'll employ novel methods of getting what they want. Door-rattlers looking for phat pipes for their warez repos are looking for any fat pipes they can find and the resources they expend per target are pretty small. Someone looking to break in for a specific reason is targeting us specifically, and the resources they'll expend to get it is a LOT higher.

It is the concerted attacker that'll spend the time to worm their way from internet-facing systems, to intranet-facing systems, to get to secure-net facing systems. It is this kind of attacker that'll do targeted phishing against user most likely to have inner-firewall access of some kind and then attempt to create VPN sessions with those credentials to do scanning from a far more advantageous network position. It is the concerted attacker that'll do targeted DNS hijacks in order to get better information. These are not the kinds of things that Joe Warezer or Ben BotHerder are going to bother with.

It is also true that the concerted attacker can be vastly more damaging than their younger cousin who is just looking to leech resources or reputation. So yeah, it's a very low likelihood of running into that kind of threat, but the risks of not doing something about it are pretty high. That's what makes the theoretical real. 
I've talked about this before, and I'm sure I'll do it again. We do need to reduce some of the excessive packaging on the things we get. I can completely understand the need to swaddle a $57,000 storage controller in enough packaging to survive a 3 meter drop. What I don't understand is shipping the 24 hard drives that go with that storage controller in individual boxes. It wouldn't take much engineering to come up with a 6-pack foam holder for hard-drives. It would seriously reduce bulk, which makes it easier and cheaper to ship, and there is less material used in the whole process. But I guess that extra SKU is too much effort.

Today I turned this:
HP-BoxesA.jpg

Into this:

HP-BoxesB.jpg

The big box at the top of the stack contained 24 individual hard-drive boxes. Each box had:
  • 1 hard-drive.
  • 1 anti-static bag requiring a knife to open.
  • 2 foam end-pieces to hold the drive in place in the box.
  • 1 piece of paper of some kind, white.
  • 1 cardboard box, requiring a knife to open.
When I was done slotting all of those in, I had a large pile of cardboard boxes, a big jumble of green foam bits, a slippery pile of anti-static bags, and a neat pile of paper. The paper and cardboard can easily be recycled. The anti-static bags and foam bits... not so much. Although, the foam bits were marked type 4 plastic (LDPE), which means they were possibly made from recyclable materials, right?

Right?

I'd still like to use less of it.

TCP problems

| 3 Comments | No TrackBacks
My testing for a cheap NAS solution has progressed to the option that costs the most money, Windows 2008 running KernSafe's iStorage. As it happens, it works really well when the iSCSI initiator is Windows but Linux clients don't really want to talk to it. Windows: 30-50 MB/s. Linux: 3-5 MB/s. Biiiig difference there.

Looking at packets I'm noticing a similar pattern on the wire to one I'd seen before. Back when I was troubleshooting exactly why NetWare backups to DataProtector were horrible I came across this problem. It seems that TCP Windowing is fundamentally broken between Server 2008 and NetWare which leads to really bad throughputs, which in turn is very bad for half TB backups. The receiving server seemed to feel the need to ACK after every two packets, which in turn really slowed things down. And that's what the Linux clients are doing for iSCSI to Server 2008.

It has to be something affecting basic TCP services but not complex protocols. Using smbclient to upload a 4GB DVD iso runs at 50MB/s but the iSCSI throughput on the same client is a piddly 3-5MB/s. I'm sure some kind of tuning on either side might be able to jar things loose, heaven knows Linux 2.6.31 is a heck of a lot more current on TCP settings than NetWare 6.5 SP8 is. I just haven't found it yet.

Conversely, Server 2008 talking to a Linux iSCSI client works at line speed pretty much. I'm testing this for completeness's sake. We need something that can serve up to 30TB via both iSCSI and SMB. My findings aren't fully complete yet, but in general:
  • OpenFiler: GREAT iSCSI host, completely blows for SMB in our environment.
  • OpenSolaris: Great iSCSI host, just can't convince the kernel-mode CIFS to join our domain. Also, worst-of-breed random I/O performance.
  • OpenFiler + Windows: OpenFiler for iSCSI, Windows (mounting an iSCSI share) for SMB. Should work GREAT. Current best-best for the future.
  • OpenSolaris + Windows: As previous option, but I/O problems make it less attractive.
  • Windows + KernSafe: GREAT SMB performance, solid iSCSI for Windows hosts. Linux hosts will take lots of tuning (perhaps, or it could be intractable).
Proxy ARP is enabled on our routers. I'm 100% certain this has saved the bacon of many of the technicians here on campus, since our subnet is in the Class-B range (140.160.0.0/16), and Windows knows this so applies a default subnet of 255.255.0.0 when setting up a static IP address. Without Proxy ARP, a tech who doesn't fix this will soon find that talking to anything on campus doesn't work, but talking to, say, sysadmin1138.net works just fine. With Proxy ARP, it all works just fine and the tech is never the wiser.

We just had this crop up on a server, only with a twist.

It turns out that the F5 BigIP will also issue a Proxy ARP for Virtual Servers that are configured on it. Which means that for some addresses on some subnets, we actually have two network devices issuing Proxy ARP packets. This, as you can well imagine, is sub-optimal. How it works is this, from a Layer 2 point of view...

Mailer: Who has 140.160.243.16? Tell Mailer.
BigIP: 140.160.243.16 is BigIP
Mailer to BigIP: TCP/25 to 140.160.243.16 [SYN]
Cisco: 140.160.243.16 is Cisco
BigIP to Mailer: TCP/43124 to Mailer [SYN/ACK]
Mailer to Cisco: TCP/25 to 140.160.243 [ACK]
Cisco to Mailer: [Reset]

What you're seeing is an ARP update in the middle of the TCP 3-way handshake. The Mailer server dutifully updates its ARP table for 140.160.243.16, which takes it down a different network path than the BigIP expects, and gets a TCP Reset issued.

What was throwing us on this one was that the connection would reset, but subsequent attempts would work just fine. This is because we were still within the ARP timeout value when the second attempt was made, and things just worked, at least for a little while.

Setting the network mask correctly forces the Mailer to realize that 140.160.243.1 is NOT a local address and the traffic transmits correctly through the gateway, and everything works.

A network problem

| 1 Comment | No TrackBacks
I have a server attempting to talk SMTP to our internal smart-host. But it seems our hardware load-balancer is getting in the way. When sniffing the switch-port the server is on, the  conversation goes like this:

Server -> Mailer [SYN]
Mailer -> Server [SYN, ACK]
Server -> Mailer [Ack]
Mailer -> Server [RST, ACK]
[3 seconds pass]
Mailer -> Server [SYN, ACK]
Server -> Mailer [RST]
[6 seconds pass]
Mailer -> Server [SYN, ACK]
Server -> Mailer [RST]

What's going on here?

Well, the first three packets are the classic TCP 3-step handshake. The Mailer then issues a Acknowledge-Reset packet, which shuts down the conversation. Then things get weird. Three seconds pass, and the mailer retransmits the second packet. The Server, having shut down the TCP conversation normally like it was told to in the 4th packet, just issues a RESET packet telling the sender there is no connection to ACK and to stop trying. This repeats 6 seconds later.

So how did the Mailer forget it had torn down the TCP connection? That is the mystery. I haven't had a chance to get a sniffer on the Mailer side of things yet, so I'm not certain what it's seeing. It could be the load-balancer is throwing a fit, and the follow-on packets at 3 and 6 seconds are from the Mailer server itself somehow.

Strange things.

Worst-case thinking

| No Comments | No TrackBacks
Worst-case thinking is something that Sysadmins are kind of prone to. We all know what level of disaster would cause us to lose everything, and it's not a good feeling. At my last job I was asked once what my worst-case scenario was. And it was a truck-bomb in the wrong spot that would cause our datacenter to suddenly drop a few floors, as well as do serious damage to most of our offices (and note, this was asked AFTER 9/11).

Fixing that was easy, don't allow traffic on that road. But that wasn't an option for us. So we just lived with it.

Having been around enough people worrying about this, the thinking goes that if we mitigate the worst-case we also mitigate the bad-cases too. Let's take a look at this, shall we?

If we HAD been able to stop traffic on that road, it would have done nothing for certain other just as costly incidents. A direct hit by a tornado would render the building structurally uncertain for a week or two as the engineers assessed its soundness, and that would cost us quite a lot thank you. A sprinkler release on the floor above the datacenter could cause water to fall into the datacenter, which would be bad. A fire on the same floor as the DC would cause a sprinkler release in the datacenter (no FM-200 system there!) and short a bunch of stuff out. None of this would have been mitigated by stopping traffic on that one road.

WWU is the kind of enterprise where physical presence is required for most of our business. The kind of disaster that would limit our ability to teach while not also affecting our classrooms themselves limits the kind of disaster to plan for. As it happens, cutting two fiber runs would stop most network-based instruction, so that's the disaster we plan for. This building sinking into the bog it was built on is... a dark fantasy, and only likely in the kind of earthquake that'd also do serious damage to campus itself.

So yes. Good risk-management involves looking at the probable risks, not the worst-case risks and hoping good overall coverage inherits from that.

Other Blogs

My Other Stuff

About this Archive

This page is an archive of entries from June 2010 listed from newest to oldest.

May 2010 is the previous archive.

July 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.