Thursday, October 29, 2009

A matter of policy

This has been a long standing policy in Technical Services, dating to the previous VP-IT and endorsed by the current one. This policy concerns email like this, generally from a manager of some kind:
"[Person X] no longer works here. Please change their password and give it to [Person Y] so they can handle email. And please set an out-of-office rule notifiying people of [Person X's] absence."
To which we politely decline. What we will do is set the out-of-office rule, that's just fine. We'll also either give a PST extract of Person X's mailbox, or if there really is no other way (the person was the Coordinator of the Z's for 20+ years and handled all the communications themselves before retiring/dying) we'll grant read-access to the mailbox to another person, and effectively turn the Person X account into a group account but lacking send-as rights.

What we will categorically not do is change a password for an inactive user and give the login to someone else. It comes down to identity theft. If we give Person Y the login info for Person X, Person Y can send email impersonating Person X. And that is wrong on a number of levels.

We resist giving access to the mailbox as well, since a non-trivial proportion of end-users give their work email as the email address for web-registration pages all over the internet. And thus that's where the "password reminder" emails get sent. Having access to someone else's mailbox is a good way to start the process of hacking an identity.

Yes, we do occasionally get a high level manager pushing us on this. But once we explain our rationalle, they've backed down so far. There is a reason we say no when we say no.

Labels: ,


Wednesday, October 28, 2009

You can tell I've been at this a while

Last night while I was sleeping, I had a dream. In my dream I was at my desk at work. I picked up my flashlight for some reason and just then the power decided to drop. DARKNESS. And the UPS alarm in the distance. This was concerning since my workstation is on a power outlet attached to the datacenter UPS, so if my computer was out, chances were real good the entire datacenter was also down. Very bad.

Happily I just happened to have my flashlight in hand! So I powered on and went to the datacenter door. But my access card wouldn't work. The card-reader has its own internal battery, so it not reading me at all, or even giving me the access-denied angry-beep, was doubly bad. Happily, coworker dropped by and could get in so I ghosted on in behind him. The room was noisy and had all the right lights. But the UPS was still alarming. Not surprising, it's supposed to do that.

Then I woke up. I checked the clock, still had power. And there was a beep in the distance.

A smoke alarm was crying for a new battery. At 5:30am. It's just a single beep, but it seems my unconscious mind interpreted that as a UPS alarm even though those are ususally three beeps.

Labels:


Thursday, October 22, 2009

Windows 7 releases!

Or rather, its retail availability is today. We're on a Microsoft agreement, so we've had it since late August. And boy do I know that. I've been having a trickle of calls and emails ever since the beta released about various ways Win7 isn't working in my environment and whether I have any thoughts about that. Well, I do. As a matter of fact, Technical Services and ATUS both have thoughts on that:

Don't use it yet. We're not ready. Things will break. Don't call us when it does.

But as with any brand new technology there is demand. Couple that with the loose 'corporate controls' inherent in a public Higher Ed institution and we have it coming in anyway. And I get calls when people can't get to stuff.

The main generator of calls is our replacement of the Novell Login Script. I've spoken about how we feel about our login script in the past. Back on July 9, 2004 I had a long article about that. The environment has changed, but it still largely stands. Microsoft doesn't have a built in login script the same way NetWare/OES has had since the 80's, but there are hooks we can leverage. One of my co-workers has built a cunning .VBS file that we're using for our login script, and does the kinds of things we need out of a login script:
  • Run a series of small applications we need to run, which drive the password change notification process among other things.
  • Maps drives based on group membership.
  • Maps home directories.
  • Allows shelling out to other scripts, which allows less privileged people to manage scripts for their own users.
A fair amount of engineering did go into that script, but it works. Mostly. And that's the problem. It works good enough that at least one department on campus decided to put Vista in their one computer lab and rely on this script to get drive mappings. So I got calls shortly after quarter-start to the effect of, "your script don't work, how can this be fixed." To which my reply was (summarized), "You're on Vista and we told y'all not to do that. This isn't working because of XYZ, you'll have to live with it." And they have, for which I am greatful.

Which brings me to XYZ and Win7.

The main incompatibility has to do with the NetWare CIFS stack. Which I describe here. The NetWare CIFS stack doesn't speak NTLMv2, only LM and NTLM. In this instance, it makes it similar to much older Samba versions. This conflicts with Vista and Windows 7, which both default their LAN Manager Authentication Level to "NTLMv2 Responses Only." Which means that out of the box both Vista and Win7 will require changes to talk to our NetWare servers at all. This is fine, so long as they're domained we've set a Group Policy to change that level down to something the NetWare servers speak.

That's not all of it, though. Windows 7 introduced some changes into the SMB/CIFS stack that make talking to NetWare a bit less of a sure thing even with the LAN Man Auth level set right. Perhaps this is SMB2 negotiations getting in the way. I don't know. But for whatever reason, the NetWare CIFS stack and Win7 don't get along as well as the Vista's SMB/CIFS stack did.

The main effect of this is that the user's home-directory will fail to mount a lot more often on Win7 than on Vista. Also, other static drive mappings will fail more often. It is reasons like these that we are not recommending removing the Novell Client and relying on our still in testing Windows Login Script.

That said, I can understand why people are relying on the crufty script rather than the just-works Novell Login Script. Due to how our environment works, The Vista/Win7 Novell Client is dog slow. Annoyingly slow. So annoyingly slow that not getting some drives when you log in is preferable to dealing with it.

This will all change once we move the main file-serving cluster to Windows 2008. At that point, the Windows script should Just Work (tm). At that point, getting rid of the Novell Client will allow a more functional environment. We are not at that point yet.

Labels: , , ,


Thursday, October 15, 2009

It's the little things

Right now our Microsoft migration schedule is hung up on backup licenses. Backing up clustered servers requires extensions, which we didn't notice back when we priced out the project. It is things like these that make for cost-overruns. The long and the short of it is, we're not migrating anything until we can legally back up the new environment. Period. That's just how it is.

As most of the budget arm-wrestling happens above me, I only get bits and pieces. Since we don't spend our money, we spend other people's money, we have to convince other people that this money needs to be spent. I understand there was some pushback when the quote came in, and we've been educating about what exactly it would mean if we don't do this.

I understand the order is in the works, and we're just waiting on license codes. But until they arrive (electronic delivery? What's dat?) we simply can not move forward. That's just how it is.

Labels: ,


Friday, September 25, 2009

More thoughts on the Novell support change

Something struck me in comments on the last post about this that I think needs repeating on a full post.

Novell spent quite a bit of time attempting to build up their 'community' forums for peer-support. Even going so far as to seed the community with supported 'sysops' who helped catalyze others into participating, and creating a vibrant peer support community. This made sense because it built both goodwill and brand loyalty, but also reduced the cost-center known as 'support'. All those volunteers were taking the minor-issue load off of the call-in support! Money saved!

Fast forward several years. Novell bought SuSE and got heavily into Open Source. Gradually, as the OSS products started to take off commercially, the support contracts became the main money maker instead of product licenses. Just as suddenly, this vibrant goodwill-generating peer-support community is taking vital business away from the revenue-stream known as 'support'. Money lost!

Just a simple shift in the perception of where 'support' fits in the overall cost/revenue stream makes this move make complete sense.

Novell will absolutely be keeping the peer support forums going because they do provide a nice goodwill bonus to those too cheap to pay for support. However.... with 'general support' product-patches going behind a pay-wall, the utility of those forums decreases somewhat. Not all questions, or even most of them for that matter, require patches. But anyone who has called in for support knows the first question to be asked is, "are you on the latest code," and that applies to forum posts as well.

Being unable to get at the latest code for your product version means that the support forum volunteers will have to troubleshoot your problem based on code they may already be well past, or not have had recent experience with. This will necessarily degrade their accuracy, and therefore the quality of the peer support offered. This will actively hurt the utility of the peer-support forums. Unfortunately, this is as designed.

For users of Novell's active-development but severe underdog products such as GroupWise, OES2, and Teaming+Conferencing, the added cost of paying for a maintenance/support contract can be used by internal advocates of Exchange, Windows, and SharePoint as evidence that it is time to jump ship. For users of Novell's industry-leading products such as Novell Identity Management, it will do exactly as designed and force these people into maintaining maintenance contracts.

The problem Novell is trying to address are the kinds of companies that only buy product licenses when they need to upgrade, and don't bother with maintenance unless they're very sure that a software upgrade will fall within the maintenance period. I know many past and present Novell shops who pay for their software this way. It has its disadvantages because it requires convincing upper management to fork over big bucks every two to five years, and you have to justify Novell's existence every time. The requirement to have a maintenance contract in order for your highly skilled staff to get at TIDs and patches, something that used to be both free and very effective, is a real-world major added expense.

This is the kind of thing that can catalyze migration events. A certain percentage will pony up and pay for support every year, and grumble about it. Others, who have been lukewarm towards Novell for some time due adherence to the underdog products, may take it as the sign needed to ditch these products and go for the industry leader instead.

This move will hurt their underdog-product market-share more than it will their mid-market and top-market products.

If you've read Novell financial statements in the past few years you will have noticed that they're making a lot more money on 'subscriptions' these days. This is intentional. They, like most of the industry right now, don't want you to buy your software in episodic bursts every couple years. They want you to put a yearly line-item in your budget that reads, "Send money to Novell," that you forget about because it is always there. These are the subscriptions, and they're the wave of the future!

Labels: , ,


Thursday, September 24, 2009

Very handy but terrible plugin

Yes, this plugin is a terrible idea.

But then, so are appliances with built in self-signed SSL certificates you can't change. You take what you can get.

Labels: ,


Tuesday, September 08, 2009

DNS and AD Group Policy

This is aimed a bit more at local WWU users, but it is more widely applicable.

Now that we're moving to an environment where the health of Active Directory plays a much greater role, I've been taking a real close look at our DNS environment. As anyone who has ever received any training on AD knows, DNS is central to how AD works. AD uses DNS the way WinNT used WINS, the way IPX used SAPs, or NetWare uses SLP. Without it things break all over the place.

As I've stated in a previous post our DNS environment is very fragmented. As we domain more and more machines, the 'univ.dir.wwu.edu' domain becomes the spot where the vast majority of computing resources is resolveable. Right now, the BIND servers are authoritative for the in-addr.arpa reverse-lookup domains which is why the IP address I use for managing my AD environment resolves to something not in the domain. What's more, the BIND servers are the DNS servers we pass out to every client.

That said, we've done the work to make it work out. The BIND servers have delegation records to indicate that the AD DNS root domain of dir.wwu.edu is to be handled by the AD DNS servers. Windows clients are smart enough to notice this and do the DNS registration of their workstation name against the AD DNS servers and not the BIND servers. That said, the in-addr.arpa domains are authoritative on the BIND servers so the client's attempt to register the reverse-lookup records all fail. Every client on our network has Event Log entries to this effect.

Microsoft has DNS settings as a possible target for management through Group Policy. This could be used to help ensure our environment stays safe, but will require analysis before we do anything. Changes will not be made without a testing period. What can be done, and how can it help us?

Primary DNS Suffix
Probably the simplest setting of the lot. This would allow us to force all domained machines to consider univ.dir.wwu.edu to be their primary DNS domain and treat it accordingly for Dynamic DNS updates and resource lookups.

Dynamic Update
This forces/allows clients to register their names into the domain's DNS domain of univ.dir.wwu.edu. Most already do this, and this is desirable anyway. We're unlikely to deviate from default on this one.

DNS Suffix Search List
This specifies the DNS suffixes that will be applied to all lookup attempts that don't end in period. This is one area that we probably should use, but don't know what to set. univ.dir.wwu.edu is at the top of the list for inclusion, but what else? wwu.edu seems logical, and admcs.wwu.edu is where a lot of central resources are located. But most of those are in univ.dir.wwu.edu now. So. Deserves thought.

Primary DNS Suffix Devolution
This determines whether to include the component parts of the primary dns suffix in the dns search list. If we set the primary DNS suffix to be univ.dir.wwu.edu, then the DNS resolver will also look in dir.wwu.edu, and wwu.edu. I believe the default here is 'True'.

Register PTR Records
If the in-addr.arpa domains remain on the BIND servers, we should probably set this to False. At least so long as our BIND servers refuse dynamic updates that is.

Registration Refresh Interval
Determines how frequently to update Dynamic registrations. Deviation from default seems unlikely.

Replace Addresses in Conflicts
This is a setting for handling how multiple registrations for the same IP (here defined as multiple A records pointing to the same IP) are to be handled. Since we're using insecure DNS updates at the moment, this setting deserves some research.

DNS Servers
If the Win/NW side of Tech Services wishes to open warfare with the Unix side of Tech Services we'll set this to use the AD DNS servers for all domained machines. For this setting overrides client-side DNS settings with the DNS servers defined in the Group Policy. No exceptions. A powerful tool. If we set this at all, it'll almost definitely be the BIND DNS servers. But I don't think we will. Also, it may be true that Microsoft has removed this from the Server 2008 GPO, as it isn't listed on this page.

Register DNS Records with Connection-Specific DNS Suffix
If a machine has more than one network connection (very, very few non VMWare host-machines will) allow them to register those connections against their primary DNS suffix. Due to the relative derth of configs, we're unlikely to change this from default.

TTL Set in the A and PTR Records
Since we're likely to turn off PTR updates, this setting is redundant.

Update Security Level
As more and more stations domain, there will come a time when we may wish to cut out the non-domained stations from updating into univ.dir.wwu.edu. If that times come, we'll set this to 'secure only'. Until then, won't touch it.

Update Top Level Domain Zones
This allows clients to update a TLD like .local. Since our tree is not rooted in a TLD, this doesn't apply to us.

Some of these can have wide ranging effects, but are helpful. I'm very interested in the search-list settings, since each of our desktop techs has tens of DNS domains to chose from depending on their duty area. Something here might greatly speed up resouce resolution times.

Labels: , ,


Tuesday, September 01, 2009

Pushing a feature

One of the things I have missed when Novell went from SLE9 to SLE10 was the lack of a machine name in the title-bar for YaST. It used to look like this:

The old YaST titlebar

With that handy "@[machinename]" in it. These days it is much less informative.

The new YaST titlebar

If you're using SSH X-forwarding to manage remote servers, it is entirely possible you'll have multiple YaST windows open. How can you tell them apart? Back in the 9 days it was simple, the window told you. Since 10, the marker has gone away. This hasn't changed in 11.2 either. I would like this changed so I put in a FATE request!

If you'd also like this changed, feel free to vote up feature 306852! You'll need a novell.com login to vote (the opensuse.org site uses the same auth back end so if you have one there you have one on OpenFATE).

Thank you!

Labels: , ,


Friday, August 28, 2009

Fabric merges

When doing a fabric merge with Brocade gear, when they say that the Zone configuration needs to be exactly the same on both switches, they mean that. The merge process does no parsing, it just compares the zone config. If the metaphorical diff returns anything it doesn't merge. So if one zone has a swapped order of two nodes but is otherwise identical, it'll not merge.

Yes, this is very conservative. And I'm glad for it, since failure here would have brought down our ESX cluster and that's a very wince-worthy collection of highly visible services. But it took a lot of hacking to get the config on the switch I'm trying to merge into the fabric to be exactly right.

Labels: , ,


Tuesday, August 18, 2009

Didn't know that

The integrated network card in the HP DL380-G2 doesn't have a Windows Server 2008 driver. Anywhere. And the forum post that says you can use the 2003 driver on it lies, unless there is some even sneakier way of getting a driver in than I know of.

This is a problem, as that's one of our Domain Controllers. But not much of one, since it's one of the three DC's in the empty root (our forest is old enough for that particular bit of discredited advice) and all it does is global-catalog work. And act as our ONLY DOMAIN CONTROLLER on campus. In the off chance that a back-hoe manages to cut BOTH fiber routes to campus, it's the only GC up there.

Also, since it couldn't boot from a USB-DVD drive I had to do a parallel install of 2008 on it. So I still had my perfectly working 2003 install available. So I just dcpromoed the 2003 install and there we are!

Once we get a PCI GigE card for that server I can try getting 2008 working again.

Labels: ,


Thursday, August 13, 2009

Why we still use WINS when we have AD

WINS... the Windows Internet Name Service. Introduced in, I believe, Windows NT 3.5 in order to allow Windows name resolution to work across different IP subnets. NetBIOS relies on broadcasts for name resolution, and WINS allowed it to work by using a unicast to the WINS server to find addresses. In theory, DNS in Active Directory (now nine years old!) replaced it.

Not for us.

There are two things that drive the continued existence of WINS on our network, and will ensure that I'll be installing the Server 2008 WINS server when I upgrade our Domain Controllers in the next two weeks:
  1. We still have a lot of non-domained workstations
  2. Our DNS environment is mind-bogglingly fragmented
Here is a list of domains we have, and this is just the domains we're serving with DHCP. There are a lot more:
  • admcs.wwu.edu
  • ac.bldg.wwu.edu
  • ae.bldg.wwu.edu
  • ah.bldg.wwu.edu
  • ai.bldg.wwu.edu
  • cv.bldg.wwu.edu
  • es.bldg.wwu.edu
  • om.bldg.wwu.edu
  • rh.bldg.wwu.edu
  • rl.bldg.wwu.edu
  • archives.wwu.edu
  • bh319lab.wwu.edu
  • bldg.wwu.edu
  • canada.wwu.edu
  • ci.wwu.edu
  • clsrm.wwu.edu
  • cm.wwu.edu
  • crc.wwu.edu
  • etd110.lab01.wwu.edu
  • fm.wwu.edu
  • hh101lab.wwu.edu
  • hh112lab.wwu.edu
  • hh154lab.wwu.edu
  • hh245lab.wwu.edu
  • history.wwu.edu
  • lab03.wwu.edu
  • math.wwu.edu
  • mh072lab.wwu.edu
  • psych.wwu.edu
  • soclab.wwu.edu
  • spmc.wwu.edu
  • ts.wwu.edu
There are more we're serving with DHCP, I just got bored making the list. The thing is, a lot of those networks, and especially the labs, contain 100% domained workstations. Since we only have the one domain, this means all those computers are in a flat DNS structure. In effect, each domained workstation on campus has two DNS names: the one on our BIND servers, and the one in the MS-DNS servers.

That said, for those machines that AREN'T in the domain the only way they can find anything is to use WINS. We will be using until the University President says unto the masses, "Thou Shalt Domain Thy PC, Or Thou Shalt Be Denied Service." Until then, WINS will continue to be the best way to find Windows resources on campus.

Labels: ,


Tuesday, August 11, 2009

Changing the CommandView SSL certificate

One of the increasingly annoying things that IT shops have to put up with is web based administration portals using self-signed SSL certificates. Browsers are increasingly making this setup annoying, and for a good reason. Which is why I try and get these pages signed with a real key if they allow me to.

HP's Command View EVA administration portal annoyingly overwrites the custom SSL files when it does an upgrade. So you'll have to do this every time you apply a patch or otherwise update your CV install.
  1. Generate a SSL certificate with the correct data.
  2. Extract the certificate into base-64 form (a.k.a. PEM format) in separate 'certificate' and 'private key' files.
  3. On your command view server overwrite the %ProgramFiles%\Hewlett-Packard\sanworks\Element Manager for StorageWorks HSV\server.cert file with the 'certificate' file
  4. Overwrite the %ProgramFiles%\Hewlett-Packard\sanworks\Element Manager for StorageWorks HSV\server.pkey file with the 'private key' file
  5. Restart the CommandView service
At that point, CV should be using your generated certificates. Keep these copied somewhere else on the server so you can quickly copy them back in when you update Command View.

Labels: , , ,


Non-paid work hours

Ars Technica has an article up today about workers who put in a lot of unpaid hours thanks to their mobile devices. This isn't a new dynamic by any means, we had a lot of this crop up when Corporate web-mail started becoming ubiquitous, and before that with the few employees using remote desktop software (PCAnywhere anyone?) to read email from home over corporate dialup. The BlackBerry introduced the phenomena to the rest of the world, and the smartphone revolution is bringing this to the masses.

My old workplace was union, so was in the process of figuring out how to compensate employees for after-hours call-out shortly after we got web-mail working. There were a few state laws and similar rulings that directed how it should be handled, and ultimately they decided on no less than 2-hours overtime pay for issues handled on the phone, and no less than 4-hours overtime pay for issues requiring a site-visit. Yet, no payment for being officially on-call with a mandatory response time; it was seen that actually responding to the call was the payment. Even if being on-call meant not being able to go to a child's 3 hour Dance recital.

Now that I'm an exempt employee, I don't get anything like overtime. If I spend 36 hours in a weekend shoving an upgrade into our systems through sheer force of will, I don't automatically get Monday off or a whonking big extra line-item on my next paycheck. It's between me and my manager how many hours I need to put in that week.

As for on-call, we don't have a formal on-call schedule. All of us agree we don't want one, and strive to make the informal one work for us all. No one wants to plan family vacations around an on-call schedule, or skip out of town sporting events for their kids just so they can be no more than an hour from the office just in case. It works for us, but all it'll take to force a formal policy is one bad apple.

For large corporations with national or global workforces, such gentleman's agreements aren't really doable. Therefore, I'm not at all surprised to see some lawsuits being spawned because of it. Yes, some industries come with on-call rotations baked in (systems administration being one of them). Others, such as tech-writing, don't generally have much after-hours work, and yet I've seen second hand such after hours work (working on docs, conference calls, etc) consume an additional 6 hours a day.

Paid/unpaid after hours work gets even more exciting if there are serious timezone differences involved. East Coast workers with the home-office on the West Coast will probably end up with quite a few 11pm conference calls. Reverse the locations, and the West Coast resident will likely end up with a lot of 5am conference calls. Companies that have drank deeply from the off-shoring well have had to deal with this, but have had the benefit of different labor laws in their off-shored countries.

"Work" is now very flexible. Certain soulless employers will gleefully take advantage of that, which is where the lawsuits come from. In time, we may get better industry standard practice for this sort of thing, but it's still several years away. Until then, we're on our own.

Labels: ,


Friday, August 07, 2009

Identity Management in .EDU land

We have a few challenges when it comes to an identity management system. As with any attempt to automate identity management, it is the exceptions that kill projects. This is an extension of the 80/20 rule, where 80% of the cases will be dead easy to manage, and it's the 20% that are special are where most of the business-rules meeting-time will be spent.

In our case, we have two major classes of users:
  • Students
  • Employees
And a few minor classes littered about like Emeritus Professors. I don't quite know enough about them to talk knowledgeably.

The biggest problem we have are how to handle the overlaps. Student workers. Staff who take classes. We have a lot of student workers, but staff who take classes are another story. The existence of these types of people make impossible having the two big classes as exclusive.

Banner handles this case pretty well from what I understand. The systems I manage, however, are another story. With eDirectory and the Novell Client, we had two big contexts named Students and Users. If your object was in one, that's the login script you ran. Active Directory was until recently Employee-only because of Exchange. We put the students in there (with no mailboxes of course) two years ago, largley because we could and it made the student-employee problem easier to manage.

One of the thorniest questions we have right now is defining, "when is a student a student with a job, and when is a student an employee taking classes." Unfortunately, we do not have a handy business rule to solve that. A rule, for example, like this one:
If a STUDENT is taking less than M credit-hours of classes, and is employed in a job-class of C1-F9, then they shall be reclassed EMPLOYEE.
That would be nice. But we don't have it, because the manual exception-handling process this kicks off is not quite annoying enough to warrant the expense of deciding on an automatable threshold. Because this is a manual process, people rarely get moved back across the Student/Employee line in a timely way. If the migration process were automated, certain individuals would probably flop over the line every other quarter.

This one nice example of the sorts of discussions you have to have when rolling out an identity management automation system. If we were given umpty thousand dollars to deploy Novell IDM in order to replace our home-built system, we'd have to start having these kinds of discussions again. Even though we've had some kind of identity provisioning system since the early 90's. Because we DO have an existing one, some of the thornier questions of data-ownership and workflow are already solved. We'd just have to work through the current manual-intervention edge cases.

Labels: ,


Monday, August 03, 2009

Robust NTP environments

Due to my background as a NetWare guy, time-synchronization is something I pay attention to. Early versions of NDS were touchy about that, since the time-stamp was used in the conflicting-edits resolution process. NetWare didn't use a full up NTP client for this, Novell built their own form of it based on NTP code and called it TimeSync. Unlike NTP, TimeSync did what it could to ensure the entire environment was within a second or two of a single time. Because of the lower time resolution, it synced a lot faster than NTP did, and this was considered a good thing since out-of-sync time was considered an outage.

With that in mind, it is no surprise that I like to have a solid time-sync process in place on my networks. One of the principles of Novell's TimeSync config was the concept of a time-group. A group of servers who coordinated time between themselves, and a bunch of clients who poll the members of the time-servers for correct time. Back before internet connections were as ubiquitous as air, this was a good way for an office network to maintain a consensus time. Later on, TimeSync gained the ability to talk over TCP/IP, and could use NTP sources for external time, and this allowed TimeSync to hook into the universial time coordinated (UTC) system.

You can create much the same kind of network with NTP as you could with TimeSync. It requires more than one time server, but your clients only have to directly speak with one of the time servers in the group. Yet the same type of robustness can be had.

The concept is founded in the "peer" association for NTP. The definition of this verb is rather dry:
For type s addresses (only), this command mobilizes a persistent symmetric-active mode association with the specified remote peer.
And doesn't tell you much. This is much clearer:
Symmetric active/passive mode is intended for configurations were a clique of low-stratum peers operate as mutual backups for each other. Each peer operates with one or more primary reference sources, such as a radio clock, or a set of secondary (stratum, 2) servers known to be reliable and authentic. Should one of the peers lose all reference sources or simply cease operation, the other peers will automatically reconfigure so that time and related values can flow from the surviving peers to all hosts in the subnet. In some contexts this would be described as a "push-pull" operation, in that the peer either pulls or pushes the time and related values depending on the particular configuration.
Unlike TimeSync, if all the peers lose their upstreams (the internet connection is down) then the entire infrastructure goes out of sync. This can be mitigated somewhat through judicious use of the 'maxpoll' parameter; set it high enough, and it can be hours (or days if you set it really high) before the peer even notices it can't talk to its upstream and will continue to report in-sync time to clients.

It is also a very good idea to use ACLs in your ntp.conf file to restrict what types of connections clients can mobilize. It is quite possible to be evil to NTP servers. You can turn on enough options to allow trouble-shooting, but not allow config changes.

It is a very good idea for your peers to be cryptographically associated with each other as well. There are at least two methods for this with NTP, v3's autokey, and v4's symmetric key. Autokey is a somewhat easier to set up preshared-key system, symmetric key is more secure, either is more preferable to nothing.

Here is a pair of /etc/ntp.conf files for a hypothetical set of WWU time-servers (items like drift-file and logging options have been omitted):
server 0.north-america.pool.ntp.org maxpoll 13
server 1.north-america.pool.ntp.org maxpoll 13
peer 140.160.247.31 key 1

enable auth monitor
keys /etc/ntp.keys
trustedkey 1
requestkey 1

restrict default ignore
restrict 140.160.0.0 mask 255.255.0.0 nomodify nopeer
restrict 140.160.247.31
server 2.north-america.pool.ntp.org maxpoll 13
server 3.north-america.pool.ntp.org maxpoll 13
peer 140.160.11.86 key 1

enable auth monitor
keys /etc/ntp.keys
trustedkey 1
requestkey 1

restrict default ignore
restrict 140.160.0.0 mask 255.255.0.0 nomodify nopeer
restrict 140.160.11.86

The 'maxpoll' values ensure that once time has been synchronized for long enough, the time between polls of the upstream NTP servers will be 137 minutes. Hopefully, any internet outages should be less then that. Setting max-poll to even higher values will allow longer times between polling intervals, and therefore longer internet outage tolerance. This can get QUITE long, I've seen some NTP servers that poll twice a week.

The key settings set up an Autokey-style crypto system. The "key 1" option on the peer line indicates that the designated connection should use crypto validation. The actual data passed isn't encrypted, the crypto is used for identity validation. This prevents spoofing of time, which can lead to wildly off time values.

The 'restrict' lines tell the NTPD to ignore off campus requests for time (it'll still listen, but return access-denied to all requests), allow on-campus users to get time and do time tracing but nothing else, and allow full access to the peer time server. In theory, inbound NTP traffic should be stopped at the border firewall but just in case it'll deny any that get through.

This is a two server setup, but three or more server could easilly be involved. For a network our size (large) and complexity (simple), two to three time-servers is probably all we need. The peered time-servers will all report in-sync so long as one still considers itself in-sync with an upstream time-server.

Because peers sync time amongst themselves, clients only have to talk to a single time-server to get valid time. Of course, that introduces a single-point-of-failure in the system if that time-host ever has to go down. Because of this, I strongly recommend configuring NTP clients to use at least two upstreams.

Enjoy high quality time!

Labels:


Thursday, July 30, 2009

Datacenter environment

We're having a major heat-wave. The Sea-Tac airport set a record yesterday for hottest temperature on record at 103 degrees. Bellingham too, the old record of 94 set in 2007 was surpassed by the 96 reading of yesterday. Today is cooler, but still well above average for out here.

Much as I was tempted to show up for work today in a tank top, shorts, and flip-flops, I resisted. First of all, I did have a meeting up on campus with some executive-types higher than me so I had to keep up appearances. Also, flip-flops aren't that good for hiking the mile plus to campus.

Of course, today is a day when I get to do a surprise server rebuild in the datacenter! I just spent the last hour standing on a vent tile setting up a server reformat. While I'm not wearing flip-flops, I am wearing shorts. I was cold, so I went for a walk around the building to warm up, and it performed admirably.

Happily, since we have a data-center in the building, the building itself has AC. Not all buildings here do. In fact, the building I had that meeting in did not have any AC, just some moving air.

We have enough AC in the datacenter that the room isn't any hotter today than it gets in mid January. That's nice to have.

Labels:


Monday, July 27, 2009

Service delivery in .EDU-land

Matt of standalone-sysadmin fame asked:
I take it from the terminology ("fall quarter") that you work at a university.

How often do you re-engineer your infrastructure, or roll out new servers? Do yo align them to the school quarters? I'm interested in knowing how other people make decisions on roll-outs.
Until a couple weeks ago, this blog was hosted on a server named, "myweb.facstaff.wwu.edu," which should give you a real good idea of where I work ;). So yes, a university. We're also on quarters, not semesters, so our school year is a bit different than those that have only three terms a year instead of four.

For things that will require disruptive downtime for critical systems that'll exceed a few hours, we keep those to the times we're not teaching. We have on the order of 21,000 actual students kicking around (the FTE count is much smaller, we have a lot of part-timers) so outages get noticed. We have students actively printing and handing in homework to Blackboard at 4am, so 'dark of night' is only good for so many things.

The biggest is the summer intersession, which this year is between 8/25 @ Noon (the point grades are due from faculty) to roughly 9/18 (when students start moving into the dorms), is reserved for the big and disruptive projects. Things like completely migrating every file we have to new hardware, upgrading the ERP system we use (SCT Bannder), replacing the router core, upgrading our SAN-based disk-arrays, or upgrading Blackboard. Winter break and Spring break are the other times during the year when this kind of activity can take place.

Winter has a couple weeks to work with, but we're generally rather short-staffed during that period so we try not to do big stuff. Spring is just a few days, so things like a quick point-level upgrade to Blackboard could be done, something that doesn't require extensive testing, validation, or data conversion. Summer intersession is where the big heavy lifting can take place, and we do try and work our various vacations around this particular time of the year.

But we can and do roll new stuff out during session. If the new thing isn't disruptive to established work-flow it is a lot easier, or it just adds functionality to something they're already using. Anything student-visible gets extra scruiteny, as the potential for massive amounts of work on the part of our helpdesk is a lot higher. A lot of our decisions have significant inputs from the, "How much extra work will our Helpdesk experience as a result of this change?" question.

Also, the work varies. Some years we have a lot going on in the summer. This year we only have the one major project. In years when we have a lot going on, we've started planning the summer project season as early as March. Some things, like the router core update and the Banner updates, are known about 18 months or more in advance due to budgeting requirements. Other things, like Blackboard updates and oddly enough this Novell -> Windows migration project, aren't really committed to until May or later.

As for determining when what gets updated/upgraded, that's the responsibility of the maintainers of that application, infrastructure, or hardware to start. Due to the budget cycle, big ticket items are generally known about very far in advance of the actual project implementation stage. Everything eventually falls into the project coordination sphere, which is a very large part of the Technical Services Manager's job (you too can be my new boss! But wouldn't THAT be awkward?) . The TS Manager coordinates with the Academic Computing director and the Administrative Computing director, as well as the Vice Provost of course, to mutually set priorities and allocate resources.

p.s.: The Technical Services page for Organization Size is horribly horribly wrong. We have more servers then that for both MS and Linux. We have less NetWare servers, and by now less Unix servers. And way more disk space then that.

Labels:


Tuesday, July 21, 2009

Digesting Novell financials

It's a perennial question, "why would anyone use Novell any more?" Typically coming from people who only know Novell as "That NetWare company," or perhaps, "the company that we replaced with Exchange." These are the same people who are convinced Novell is a dying company who just doesn't know it yet.

Yeah, well. Wrong. Novell managed to turn the corner and wean themselves off of the NetWare cash-cow. Take the last quarterly statement, which you can read in full glory here. I'm going to excerpt some bits, but it'll get long. First off, their description of their market segments. I'll try to include relevant products where I know them.

We are organized into four business unit segments, which are Open Platform Solutions, Identity and Security Management, Systems and Resource Management, and Workgroup. Below is a brief update on the revenue results for the second quarter and first six months of fiscal 2009 for each of our business unit segments:



Within our Open Platform Solutions business unit segment, Linux and open source products remain an important growth business. We are using our Open Platform Solutions business segment as a platform for acquiring new customers to which we can sell our other complementary cross-platform identity and management products and services. Revenue from our Linux Platform Products category within our Open Platform Solutions business unit segment increased 25% in the second quarter of fiscal 2009 compared to the prior year period. This product revenue increase was partially offset by lower services revenue of 11%, such that total revenue from our Open Platform Solutions business unit segment increased 18% in the second quarter of fiscal 2009 compared to the prior year period.

Revenue from our Linux Platform Products category within our Open Platform Solutions business unit segment increased 24% in the first six months of fiscal 2009 compared to the prior year period. This product revenue increase was partially offset by lower services revenue of 17%, such that total revenue from our Open Platform Solutions business unit segment increased 15% in the first six months of fiscal 2009 compared to the prior year period.

[sysadmin1138: Products include: SLES/SLED]



Our Identity and Security Management business unit segment offers products that we believe deliver a complete, integrated solution in the areas of security, compliance, and governance issues. Within this segment, revenue from our Identity, Access and Compliance Management products increased 2% in the second quarter of fiscal 2009 compared to the prior year period. In addition, services revenue was lower by 45%, such that total revenue from our Identity and Security Management business unit segment decreased 16% in the second quarter of fiscal 2009 compared to the prior year period.

Revenue from our Identity, Access and Compliance Management products decreased 3% in the first six months of fiscal 2009 compared to the prior year period. In addition, services revenue was lower by 40%, such that total revenue from our Identity and Security Management business unit segment decreased 18% in the first six months of fiscal 2009 compared to the prior year period.

[sysadmin1138: Products include: IDM, Sentinal, ZenNAC, ZenEndPointSecurity]



Our Systems and Resource Management business unit segment strategy is to provide a complete “desktop to data center” offering, with virtualization for both Linux and mixed-source environments. Systems and Resource Management product revenue decreased 2% in the second quarter of fiscal 2009 compared to the prior year period. In addition, services revenue was lower by 10%, such that total revenue from our Systems and Resource Management business unit segment decreased 3% in the second quarter of fiscal 2009 compared to the prior year period. In the second quarter of fiscal 2009, total business unit segment revenue was higher by 8%, compared to the prior year period, as a result of our acquisitions of Managed Object Solutions, Inc. (“Managed Objects”) which we acquired on November 13, 2008 and PlateSpin Ltd. (“PlateSpin”) which we acquired on March 26, 2008.

Systems and Resource Management product revenue increased 3% in the first six months of fiscal 2009 compared to the prior year period. The total product revenue increase was partially offset by lower services revenue of 14% in the first six months of fiscal 2009 compared to the prior year period. Total revenue from our Systems and Resource Management business unit segment increased 1% in the first six months of fiscal 2009 compared to the prior year period. In the first six months of fiscal 2009 total business unit segment revenue was higher by 12% compared to the prior year period as a result of our Managed Objects and PlateSpin acquisitions.

[sysadmin1138: Products include: The rest of the ZEN suite, PlateSpin]



Our Workgroup business unit segment is an important source of cash flow and provides us with the potential opportunity to sell additional products and services. Our revenue from Workgroup products decreased 14% in the second quarter of fiscal 2009 compared to the prior year period. In addition, services revenue was lower by 39%, such that total revenue from our Workgroup business unit segment decreased 17% in the second quarter of fiscal 2009 compared to the prior year period.

Our revenue from Workgroup products decreased 12% in the first six months of fiscal 2009 compared to the prior year period. In addition, services revenue was lower by 39%, such that total revenue from our Workgroup business unit segment decreased 15% in the first six months of fiscal 2009 compared to the prior year period.

[sysadmin1138: Products include: Open Enterprise Server, GroupWise, Novell Teaming+Conferencing,

The reduction in 'services' revenue is, I believe, a reflection in a decreased willingness for companies to pay Novell for consulting services. Also, Novell has changed how they advertise their consulting services which seems to also have had an impact. That's the economy for you. The raw numbers:


Three months ended


April 30, 2009

April 30, 2008

(In thousands)


Net revenue
Gross
profit


Operating
income (loss)


Net revenue
Gross
profit


Operating
income (loss)

Open Platform Solutions


$ 44,112
$ 34,756

$ 21,451

$ 37,516
$ 26,702

$ 12,191

Identity and Security Management



38,846

27,559


18,306


46,299

24,226


12,920

Systems and Resource Management



45,354

37,522


26,562


46,769

39,356


30,503

Workgroup



87,283

73,882


65,137


105,082

87,101


77,849

Common unallocated operating costs





(3,406 )

(113,832 )



(2,186 )

(131,796 )























Total per statements of operations


$ 215,595
$ 170,313

$ 17,624

$ 235,666
$ 175,199

$ 1,667



























Six months ended


April 30, 2009

April 30, 2008

(In thousands)


Net revenue
Gross
profit


Operating
income (loss)


Net revenue
Gross
profit


Operating
income (loss)

Open Platform Solutions


$ 85,574
$ 68,525

$ 40,921

$ 74,315
$ 52,491

$ 24,059

Identity and Security Management



76,832

52,951


35,362


93,329

52,081


29,316

Systems and Resource Management



90,757

74,789


52,490


90,108

74,847


58,176

Workgroup



177,303

149,093


131,435


208,840

173,440


155,655

Common unallocated operating costs





(7,071 )

(228,940 )



(4,675 )

(257,058 )























Total per statements of operations


$ 430,466
$ 338,287

$ 31,268

$ 466,592
$ 348,184

$ 10,148

So, yes. Novell is making money, even in this economy. Not lots, but at least they're in the black. Their biggest growth area is Linux, which is making up for deficits in other areas of the company. Especially the sinking 'Workgroup' area. Once upon a time, "Workgroup," constituted over 90% of Novell revenue.
Revenue from our Workgroup segment decreased in the first six months of fiscal 2009 compared to the prior year period primarily from lower combined OES and NetWare-related revenue of $13.7 million, lower services revenue of $10.5 million and lower Collaboration product revenue of $6.3 million. Invoicing for the combined OES and NetWare-related products decreased 25% in the first six months of fiscal 2009 compared to the prior year period. Product invoicing for the Workgroup segment decreased 21% in the first six months of fiscal 2009 compared to the prior year period.
Which is to say, companies dropping OES/NetWare constituted the large majority of the losses in the Workgroup segment. Yet that loss was almost wholly made up by gains in other areas. So yes, Novell has turned the corner.

Another thing to note in the section about Linux:
The invoicing decrease in the first six months of 2009 reflects the results of the first quarter of fiscal 2009 when we did not sign any large deals, many of which have historically been fulfilled by SUSE Linux Enterprise Server (“SLES”) certificates delivered through Microsoft.
Which is pretty clear evidence that Microsoft is driving a lot of Novell's Operating System sales these days. That's quite a reversal, and a sign that Microsoft is officially more comfortable with this Linux thing.

Labels: , , , , , , , ,


Monday, July 20, 2009

Powershell and ODBC

One nice thing about PowerShell is that it can talk to databases without a predefined ODBC connection. That makes them a lot more portable! I approve. However, I had trouble finding out how to set up and read data. So here is what I have.

##### Key variables
$SQLServerName="sqlserver"
$SQLDatabase="YourDatabaseInTheServer"

##### Start the database connection and set up environment
$DbString="Driver={SQL Server};Server=$SQLServerName;Database=$SQLDatabase;"
$DBConnection=New-Object System.Data.Odbc.OdbcConnection
$DBCommand=New-Object System.Data.Odbc.OdbcCommand
$DBConnection.ConnectionString=$DbString
$DBConnection.Open()
$DBCommand.Connection=$DBConnection

$InsertStatement="INSERT into Mbox_DB (MBServer, MBStore) values ('$MBServer', '$MBStore')"
$DBCommand.CommandText=$InsertStatement
$DBResult=$DBCommand.ExecuteNonQuery()

$SelectStatement="SELECT MBDBID From Mbox_DB WHERE (MBServer=$MBServer) AND (MBStore=$MBStore)"
$DBComand.CommandText=$SelectStatement
$DBResult=$DBCommand.ExecuteScalar()

Yes, this is part of a larger script I'm writing. When that finishes, I'll probably post it too.

Labels: ,


Wednesday, July 15, 2009

Where DIY belongs

The question of: "When should you built it your self and when should you get it off the shelf?" is one that varies from workplace to workplace. We heard several different variants of that when were interviewing for the Vice Provost for IT last year. Some candidates only did home-brew when no off the shelf package was available, others looked at the total cost of both and chose from there. This is a nice proxy question for, "What is the role of open source in your environment," as it happens.

Backups are one area where duct tape and bailing wire is to be discouraged most emphatically.

And now, a moment on tar. It is a very versatile tool, and is what a lot of unixy backup packages are built around. The main problem with backup and restore is not getting data to the backup medium, it is keeping track of what data is on which medium. Also in these days of the backup-to-disk, de-duplication is also in the mix and that's something tar can't do yet. So while you can build a tar-and-bash backup system from scratch without paying a cent, it will be lacking in certain very useful features.

Also? Tar doesn't work nearly as well on Windows.

Your backup system is one area you really do not want to invest a lot of developer creativity. You need it to be bullet proof, fault tolerant, able to handle a variety of data-types, and easy to maintain. Even the commercial packages fail some of these points some of the time, and the home brew systems fall apart much more often relative to these. The big backup boys have agents that allow backups of Oracle DBs, Linux filesystems, Exchange, and Sharepoint all to the same backup system, a home-brew application would have to get very creative to do the same thing; the problem gets even worse when it comes to restore.

Disaster Recovery is another area in which duct tape and bailing wire are to be discouraged most emphatically.

There are battle-tested open-source packages out there that will help with this (DRBD for one), depending on your environment. They're even widely used so finding someone to replace the sysadmin who just had a run in with a city bus is not that hard. Rsync can do a lot as well, so long as the scale is small. Most single systems can have something cobbled together.

Problems arise when you start talking Windows, very complex installations, or money is a major issue. If you throw enough money at a problem, most disaster recovery problems become a lot less complex. There is a lot of industry investment in DR infrastructure, so the tools are out there. Doing it on a shoe-string means that your disaster recovery also hangs by a shoe-string. If you're doing DR just to satisfy your auditors and don't plan on ever actually using it, that's one thing. But if you really expect to recover from a major disaster on that shoe-string you'll be sorely surprised when that string snaps.

Business Continuity is an area where duct tape and bailing wire should be flatly refused.

BC is in many ways DR with a much shorter recovery time. If you had problems getting your DR funded correctly, BC shouldn't even be on the timeline. Again, if it is just so you can check a box on some audit report, that's one thing. Expecting to run on such a rig is quite another.

And finally, if you do end up cobbling together backup, disaster recovery, or business continuity systems from their component parts, testing the system is even more important. In many cases testing DR/BC takes a production outage of some kind, which makes it hard to schedule tests. But testing is the only way to find out if your shoe-string can stand the load.

Labels: , ,


Friday, July 10, 2009

Email reputation

One of the hot new things in anti-spam technology is something that's rather old. Yes, the Realtime Blackhole List is back. Only these RBL's aren't the old school DNS servers of yesteryear, these RBLs are maintained by the big anti-spam vendors and are completely proprietary. The new name is now, "IP Reputation," and that's showing up on the marketing glossies.

The idea is that you deploy a network of sensors (say, every anti-spam appliance you ship, or software-package installed) that relay spam/ham information back to home base. Home base then builds a profile of the behaviors of the incoming IP connections. Once certain completely proprietary threshold are crossed, the anti-spam vendor then publishes that particular IP addresses reputation to their service. The installed base then queries the reputation service on every incoming TCP connection to see how to handle that connection.

The response varies from vendor to vendor, but include:
  • Outright blocking. Do not accept traffic from this IP address. The connection is terminated before any SMTP commands can be issued. Do not pass EHLO. Do not collect 220-ESMTP.
  • Deferr. Issue a 421 error message. Smart mailers will attempt redelivery later. Bots are generally too stupid to try this and just pass on to the next address on their list.
  • Throttle. Get very slow in accepting mail. Take a long time to issue 250-Ready statuses after SMTP commands.
The nice thing about IP reputation is that it is fast and cheap. Instead of having to lexically scan every incoming email for spamminess, you can just look at the source's reputation and block a very large percentage of messages. When we turned this on for our spam product a while back, the reputation filter blocked between 90% to 95% of all messages ultimately blocked as spam. Clean email is the single most expensive mail to pass since it has to go through every single stage of the spam/ham test pipeline, and blocking things earlier in the pipeline is a good way to shed load.

Not all optimizations are without side effects, and this one wasn't. The former student email server, titan, got itself 'greylisted' due to spam quantities. Around 50% of the message traffic into Exchange from this system was ultimately blocked as Spam according to the old anti-spam appliances we had (we'd routed its mail through the 'outbound' queue on those appliances so it wouldn't be subject to reputation tests, but would still scan email). As part of the migration of student email to OutlookLive.Edu, we set up forwards from the old cc.wwu.edu addresses to the new addresses. The spam-checkers on titan were of poor enough quality that enough spam got through to cause OutlookLive to start grey-listing Titan, causing mail to really back up on it.

That's not the only thing. Certain mailers managed by departments other that ITS here at WWU have managed to get themselves greylisted or outright blacklisted on these proprietary reputation lists. The one common denominator we've found is that certain specific UNIXy mailers do not apply their anti-spam processes to mail that is subjected to a .forward. At least, not without specific config telling it to scan that traffic. So if a person on one of these mailers has a .forward sending all mail into Exchange, the full spam-filled feed heads to Exchange and the reptuation of that mailer gets dinged.

Which is a long way of saying that, ahem:

In this era of IP reputation, outbound spam filtering is now just as required as inbound.

Really. Go do it. It'll help prevent blacklistings, and that sucks for anyone subjected to it.

Labels: ,


Tuesday, June 30, 2009

Super users

Having been a 'super user' for most of my career, I do not have the same perspective other people do when it comes to interacting with corporate IT. Because of what I do, I see everything. That's part of my job, so that's what I see. I have to know it is there.

However, how each company handles their elevated privilege accounts varies. Some of it depends on what system you're working in, of course.

Take a Windows environment. I see three big ways to handle the elevated user problem:
  1. One Administrator account, used by all admins. Each admin has a normal user account, and log in as Administrator for their adminly work.
    • Advantages Only one elevated account to keep track of.
    • Disadvantages Complete lack of auditing if there is more than one admin around. Also, unless said admin has two machines, or has a VM for adminly work, they're logged in as Administrator more often than they're logged in as themselves.
  2. One Administrator account, admins user accounts are elevated to Administrator. Each admin's normal account is elevated. Administrator is relegated to a glorified utility account, useful for backups, other automation, or if you need to leave a server logged in for some reason.
    • Advantages Audit trail. Changes are done in the name of the actual admin who performed the change.
    • Disadvantages These users really need to be exempted from any Identity Management system. Since there are only going to be a few of them, this may not matter. Also, these users need to treat these passwords like the Administrator password.
  3. Each admin gets two accounts, normal and elevated As with the above, Administrator is a glorified utility account. But each admin gets two accounts; a normal account for every day use (me.normal) and an elevated account (me.super) for functions that need that kind of access.
    • Advantages Provides audit trail, and allows the admin's normal account to be subject to identity-management safely. Easy availability of 'normal' account allows faster troubleshooting of permissions issues (hard to check when you can see everything)
    • Disadvantages Admin users are juggling two accounts again, with the same problems as option 1.
I personally haven't seen the third option in actual use anywhere, even though that's my favorite one. Unixy environments are a bit different. The ability to 'sudo' seems to be the key determiner of elevated access, with ultimate trust granted to those who learn the root password outright. Sudo is the preferred method of doing elevated functions due to its logging capability.

What other methods have you seen in use?

Labels: , ,


Monday, June 22, 2009

IPv6 and the PCI DSS standards

The Payment Card Industry Data Security Standard (PCI DSS) applies to a couple of servers we manage. In those standards is section 1.3.8. It reads

Implement IP masquerading to prevent internal addresses from being translated and revealed on the Internet, using RFC 1918 address space. Use network address translation (NAT) technologies—for example, port address translation (PAT).

With the testing procedure listed as:

For the sample of firewall and router components, verify that NAT or other technology using RFC 1918 address space is used to restrict broadcast of IP addresses from the internal network to the Internet (IP masquerading).

Which is sound practice, really. But we're running into an issue here that may become more of an issue once IPv6 gets deployed more widely. We're a University that received it's netblock back when they were still passing out Class B networks to the likes of us (140.160.0.0/16 in case you care). IPv4 address starvation is not something we experience. Because of this, NAT and IP-Masq have very little presence on our network.

We also believe in firewalls. Just because the address of my workstation is not in an RFC 1918 netblock, doesn't mean you can get uninvited packets to me. This is even more the case for the servers that handle credit-card data.

It is my belief that the intent of this particular standard-line is to prevent scouting of internal networks in the aid of directed penetration attempts. Another line that should probably be in this standard to support this, would be something similar to:
Implement separate DNS servers for public Internet and Internal usage, and prevent public Internet access to the internal DNS servers.
Because the same DNS servers we use internally are the same ones that are in our Name Server records for the WWU.EDU domain, you can do a lot of recon of our internal networks from home. We don't allow zone transfers, of course, but enough googling around our various sites and reverse-IP-lookups will reveal the general structure of our network, such as which subnets contain most of our servers and which are behind the innermost firewalls.

This is a long way of saying that our IPv4 network functions a lot like the network envisioned when IPv6 was first ratified. Because of this, we're running into some problems with the PCI standards that IPv6 will probably run into as well.

Take the requirement to have the PCI-subject servers on an RFC1918 IP number. RFC1918 only applies to IPv4. IPv6's version of that is RFC4193, so the standard will have to be modified to mandate IPv6 numbers be on RFC4193 numbers. Therefore, for strictest compliance no PCI servers can move to pure IPv6. Servers that have both IPv4 and v6 numbers on them are an interesting case, where the v4 number may be an RFC1918 number, but the v6 number is NOT private. To my knowledge, the standards are unclear on this topic.

We had to create NAT gateways for our PCI servers, and create RFC1918 addresses for them just for PCI compliance. The NAT gateway is behind the innermost firewall. These are our only servers behind a NAT gateway of any kind.

In the beginning, IPv6 expressly did NOT have NAT; it was designed to get rid of NAT. However, in recent years the case for IPv6 NAT has been pressed, and there is movement to get something like that working. In my opinion, a lot of that push was to allow NAT to continue as an obscuration-gateway (or low-cost stateless 'firewall') between internal resources and external hostile actors. I strongly suspect that when the PCI DSS standard gets it IPv6 update, they will continue to mandate some form of IP Masquerade.

Labels: ,


Thursday, June 18, 2009

Historical data-center

As I've mentioned several times here, our data-center was designed and built in the 1999-2000 time frame. Before then, Technical Services had offices in Bond Hall up on campus. The University decided to move certain departments that had zero to very little direct student contact off of campus as a space-saving measure. Technical Service was one of those departments. As were Administrative Computing Services, Human Resources, Telecom, and Purchasing.

At that time, all of our stuff was in the Bond Hall data-center and switching room. This predates me (December 2003), so I may be wrong on some of this stuff. That's a tiny area, and the opportunity to design a brand new data-center from scratch was a delightful one for those who were here to partake of it.

At the time, our standard server was, if I've got the history right, the HP LH3. Like this:
An HP LH3
This beast is 7U's high. We were in the process of replacing them with HP ML530's, another 7U server, when the data-center move came, but I'm getting a bit ahead of myself. This means that the data-center was planned with 7U servers in mind. Not the 1-4U rack-dense servers that were very common at that time.

Because the 2U flat-panel monitor and keyboard drawers for rack-dense racks were so expensive, we decided to use plain old 15-17" CRTs and keyboard drawers in the racks themselves. These take up 14U. But that's not a problem!

A 42U rack can take 4x of those 7U servers, and one of the 14U monitor/keyboard combinations for a total of...42U! A perfect fit! The Sun side of the house had their own servers, but I don't know anything about those. With four servers per rack, we put in a Belkin 4-port PS-2 KVM switch (USB was still too new fangled in this era, our servers didn't really have USB ports in them yet) in each. As I said, a perfect fit.

And since we could plan our very own room, we planned for expansion! A big room. With lots of power overhead. And a generator. And a hot/cold aisle setup.

Unfortunately... the designers of the room decided to use a bottom-to-top venting strategy for the heat. With roof mounted rack fans.
Rack fans

And... solid back doors.

Rack back doors

We got away with this because only HAD four servers per rack, and those servers were dual processor 1GHz servers. So we only had 8 cores running in the entire rack. This thermal environment worked just fine. I think each rack probably drew no more than 2KW, if that much.

If you know anything about data-center air-flow, you know where our problems showed up when we moved to rack-dense servers in 2004-8 (and a blade rack). We've managed to get some fully vented doors in there to help encourage a more front-to-back airflow. We've also put some air-dams on top of the racks to discourage over-the-top recirculation.

And picked up blanking panels. When we had 4 monster servers per rack we didn't need blanking panels. Now that we're mostly on 1U servers, we really need blanking panels. Plus a cunning use of plexi-glass to provide a clear blanking panel for the CRTs still in the racks.

And now, we have a major, major budget crunch. We had to fight to get the fully perforated doors, and that was back when we had money. Now we don't have money, and still need to improve things. We're not baking servers right now, but temperatures are such that we can't raise the temp in the data-center very much to save on cooling costs. Doing that will require spending some money, and that's very hard right now.

Happily, rack-dense servers and ESX have allowed us to consolidate down to a lot fewer racks, where we can concentrate our good cooling design. Those are hot racks, but at least they aren't baking themselves like they would with the original kit.

Labels: ,


Monday, June 15, 2009

Fire protection done right

What kinds of things do you need to consider when deciding on fire protection for your data-center?

Check local fire-codes. Really. In 2003 I was involved in setting up a new data-center for my old job. My job was more moving the gear safely and setting it up in the new location, not wrangling with the architects and contractors who were building it.

Imagine my surprise when I found sprinkler heads in the data-center during my first walk through. I got about a quarter of the way to indignant outrage before my boss short-circuited me with logic. It seems that local fire code actually covers data-centers, and it mandates sprinklers. I was assured, assured I tell you, that they wouldn't go off unless the FM-200 system failed to snuff the fire. I was dubious, but the fire inspectors really did mean that.

Anyway, there are a series of things you need in a fire suppression system.

  • An Emergency Power Off function If there is a fire, the EPO will drop power to the room hard. Yes, that'll cause data damage, but so does fire. If the fire is electrical in nature, this may stop it. Also, if all the gear is de-powered, a water dump does less damage.
  • A sealed room You want sealed for correct HVAC anyway. You don't want to rely on building HVAC unless the building was designed with that room in mind in the first place. Also, this allows you to use...
  • A gas-based suppression system FM-200 is popular choice for this. Unlike the halon systems of old, it isn't as environmentally evil and doesn't leave a mess behind. OldJob had a Halon dump in the 80's due to a burned bag of popcorn ("The $20,000 bag of popcorn"). It was... bad.
  • A water based backup suppression system If the FM-200 fails, you need to get the fire out. After the EPO has fired, and the FM-200 dumps, if there is still a fire then you need old fashioned water
  • Water detection sensors in/on the floor If you have any water pipes overhead, you need water sensors in the floor. This is more of an asset-protection thing, but if you DO have sprinklers you need water sensors to detect leaks. Also good for detecting leaks in your HVAC chillers.
  • Call-out capabilities If the fire system trips, you want to notify both Facilities people, as well as data-center staff and management. Obviously, this system should NOT rely upon assets in the data-center that's on fire. This can be hard.

There may be more, but that's off the top of my head. The EPO can be a destructive option, so I don't know how wide-spread they are. But they make all kinds of sense in a room where a water dump is possible.

If you have to retrofit a pre-existing room, some of the above may not be possible. As a fire-inspector once told me, to extinguish a fire you need one of three things:

  • Remove the fuel
  • Remove the oxidizer
  • Cool the reaction below the combustion point

The system I lined out above does all three. The EPO removes fuel and can cool the reaction to below the combustion point. The FM-200 partially removes oxygen, but mostly cools the reaction below the combustion point. The water dump smothers the fire due to lack of oxygen, and also cools it. For a high-value asset like a data-center, you want at least two of these.

Because of this, I'd say that your top priority is to see if you can get a gas-based extinguishing system in place as it does far less damage than water does (even with an EPO on your power-distribution-unit or main breaker panel). A truly good system, no matter what the actual suppression technology, has a flexible notification system that allows more than just the facilities supervisor to be notified of the fire-suppression systems activating.

As for hand-held extinguishers, use Class C extinguishers. But be careful. Dry chemical style extinguishers blow a powder everywhere. And that powder is somewhat corrosive. In the typical high airflow data-center, a fired extinguisher's residue can get everywhere. If the powder gets inside server intakes, it can cause higher equipment failure rates for the next several years and the total cost may be more than the system that was on fire. We've had demonstrations of extinguishing fires at our workplace, and have seen how messy it can get. When you buy your extinguishers for in-center usage, use the gas-style Class C extinguishers.

Labels:


Friday, June 12, 2009

Explaining LDAP.

The question was asked recently...

"How would you explain LDAP to a sysadmin who'd've heard of it, but not interacted with it before."

My first reaction illustrates my own biases rather well. "How could they NOT have heard of it before??" goes the rant in my head. Active Directory, choice of enterprise Windows deployments everywhere, includes both X500 and LDAP. Anyone doing unified authentication on Linux servers is using either LDAP or WinBind, which also uses LDAP. It seems that any PHP application doing authentication probably has an LDAP back end to it1. So it seems somewhat disingenuous to suppose a sysadmin who didn't know what LDAP was and could do.

But then, I remind myself, I've been playing with X500 directories since 1996 so LDAP was just another view on the same problem to me. Almost as easy as breathing. This proposed sysadmin probably has been working in a smaller shop. Perhaps with a bunch of Windows machines in a Workgroup, or a pack of Linux application servers that users don't generally log in to. It IS possible to be in IT and not run into LDAP. This is what makes this particular question an interesting challenge, since I've been doing it long enough that the definition is no longer ready to my head. Unfortunately for the person I'm about to info-dump upon, I get wordy.

LDAP.... Lightweight Directory Access Protocol. It came into existence as a way to standardize TCP access to X500 directories. X500 is a model of directory service that things like Active Directory and Novell eDirectory/NDS implemented. Since X500 was designed in the 1980's it is, perhaps, unwarrantedly complex (think ISDN), and LDAP was a way to simplify some of that complexity. Hence the 'lightweight".

LDAP, and X500 before it, are hierarchical in organization, but doesn't have to be. Objects in the database can be organized into a variety of containers, or just one big flat blob of objects. That's part of the flexibility of these systems. Container types vary between directory implementation, and can be an Organizational Unit (OU=), a DNS domain (DC=), or even a Cannonical Name (CN=), if not more. The name of an object is called the Distinguished Name (DN), and is composed of all the containers up to root. An example would be:

CN=Fred,OU=Users,DC=organization,DC=edu

This would be the object called Fred, in the 'Users' Organizational Unit, which is contained in the organization.edu domain.

Each directory has a list of classes and attributes allowable in the directory, and this is called a Schema. Objects have to belong to at least one class, and can belong to many. Belonging to a class grants the object the ability to define specific attributes, some of which are mandatory similar to Primary Keys in database tables.

Fred is a member of the User class, which itself inherits from the Top class. The Top class is the class that all other classes inherit from, as it defines the bare minimum attributes needed to define an object in the database. The User class can then define additional attributes that are distinct to the class, such as "first name", "password", and "groupMembership".

The LDAP protocol additionally defines a syntax for searching the directory for information. The return format is also defined. Lets look at the case of an authentication service such as a web-page or Linux login.

A user types in "fred" at the Login prompt of an SSH login to a linux server. The linux server then asks for a password, which the user provides. The Pam back-end then queries the LDAP server for objects of class User named "fred", and gets one, located at CN=Fred,OU=Users,DC=organization,DC=edu. It then queries the LDAP server for objects of class Group that are named CN=LinuxServerAccess,OU=Servers,DC=Organization,DC=EDU, and pulls the membership attributes. It finds that Fred is in this group, and therefore allowed to log in to that server. It then makes a third connection to the LDAP server and attempts to authenticate as Fred, with the password Fred supplied at the SSH login. Since Fred did not fat finger his password, the LDAP server allows the authenticated login. The Linux server detects the successfull login, and logs out of LDAP, finally permiting Fred to log in by way of SSH.

As I said before, these databases can be organized any which way. Some are organized based on the Organizational Chart of the organization, not with all the users in one big pile like the above example. In that case, Fred's distinguished-name could be as long as:

CN=Fred,ou=stuhelpdesk,ou=DesktopSupport,ou=InfoTechSvcs,ou=AcadAffairs,dc=organization,dc=edu

How to organize the directory is up to the implementers, and is not included in the standards.

The higher performing LDAP systems, such as the systems that can scale to 500,000 objects or higher, tend to index their databases in much the same way that relational databases do. This greatly speeds up common searches. Searching for an object's location is often one of the fastest searches an LDAP system can perform. Because of this LDAP very frequently is the back end for authentication systems.

LDAP is in many ways a speciallized identity database. If done right, on identical hardware it can easilly outperform even relational-databases in returning results.

Any questions?
Yeah, I get wordy.

1: Yes, this is wrong. MySQL contains a lot, if not most, of these PHP-application logins on the greater internet. I said I had my biases, right?

Labels: , ,


Tuesday, June 09, 2009

Email delivery problems to Comcast.net

Yesterday we got some concerned mails from the one of the groups who sends mail by way of one of our web-servers. It's a somewhat critical function they do, so we paid attention to it. It seems they were getting bounce-messages from comcast.net. The bounce said that the incoming IP address did not have a reverse lookup (PTR record) and they don't talk to people like that.

This was confusing. Because we really do have a PTR record for that particular mailer. And yet, getting bounces. So one of the Webdevs calls Comcast to ask politely what the heck, and the Comcast support person walks them through a series of steps to demonstrate what went wrong. According to them, or so implied the webdev who doesn't speak SMTP as well as we do, the problem was that 'wwu.edu' does not resolve to an IP address.

There are reasons we haven't done this, and they have to do with mail delivery. Certain stupid mailers will deliver to a resoveable host before searching MX records, and if "wwu.edu" is resoveable, it'll attempt delivery to THAT instead of where it should. The server that runs 'www.wwu.edu' is the one that we'd have to point 'wwu.edu' to, and it is not a mail host. Far from. This seemed to be a strange requirement of Comcast.

I cracked it earlier today. You see, if you take a look at the NameServer records for the "wwu.edu" domain you will find three records.

140.160.242.13
140.160.240.12
216.186.4.245

It's that last one that's the problem. For some reason, our offsite DNS didn't have that particular reverse-lookup domain replicated to it. So if Comcast used it for resolving the incoming IP, it would get 'UNKNOWN' and block the connection. If they picked one of the other two, it would resolve and delivery would continue. Tada! The Comcast error message really was true, we just didn't realize one of our DNS servers didn't have all the data it needed. Oops.

Labels: ,


Thursday, June 04, 2009

Power update

Last night's shutdown went fine. We lost a single hard-drive somewhere, probably on the Solaris/Linux side since I know we didn't lose any on the Windows/NetWare side. The power guys found the problem, and were able to get the UPS up and running. We're running protected right now.

What was the problem? Well, I know what I was told yesterday, and I'm not able to translate that into anything intelligible. I'm not an electrical engineer. If I caught it right, and it is decidedly possible I didn't, the circuit breakers in the UPS cabinet were configured to trip for the wrong condition. An overly conservative condition. This was done when the UPS was installed back in 1999-2000, and we only just discovered it because we haven't had to take the UPS down since then.

We get to do a last set of maintenance in two weeks. This is where they move the breaker to a new electrical panel. This will be done with the UPS on bypass, and shouldn't interrupt the load. We'll be keeping an eagle eye, of course, but don't expect any problems.

Labels:


Wednesday, June 03, 2009

When good power becomes bad power

A good thing is happening. We're replacing the generator backing up the datacenter with a unit large enough to run both HVAC units. When the room was built in the 1999/2000 timeframe, it was presumed that one would be enough to keep the room cool enough. That's true to a point, but it didn't take into account localized hot-spots due to very hot running servers like the ones in our ESX cluster. Testing we've done shows that the temps do fall out of tolerance between 30 to 45 minutes after running on only one HVAC unit. So, we're setting things up to run both HVAC units. Good! It'd be even better if we could get a newer UPS since this one was nearly EOL when we bought it. But that's something for another capitol request.

Because we're replacing a generator, this means some unavoidable periods of time when the room is not fully protected and we'll be running on naked utility power. Like I said, this is unavoidable. Happily, utility power is pretty stable this time of year. We're having a bit of a hot-snap right now, so there is some concern about AC-related brown-outs but it isn't quite that hot yet. That's why the work is scheduled to be done during the cool part of the day.

Murphy did not agree with us. Yesterday, they spliced in the new generator transfer switch to the Bypass circuit of the UPS. This should have been a non-event since the main circuit was just fine and feeding load. Unfortunately for us, the monitor card on the UPS saw the Bypass circuit failing as a UTILITY FAIL event. What's more, it erroneously fired the ON_BATTERY event even though the UPS was not actually on battery. This started the shutdown timers on the servers with the UPS shutdown-service client on it. This is why things got Very Exiting around 8:57am yesterday, as these servers shut themselves down. On the plus side, things worked as they should. On the negative side, we were trusting a signal source that it turns out we shouldn't trust. Crap.

Then this morning. This morning they were splicing in the new transfer switch to the mains circuit, and during this time the UPS would be on Bypass leaving us on naked utility power. Once done, the new generator would be supporting the UPS. The next outage would be similar, putting the UPS on bypass, while they cut over to the new electrical panel downstairs.

Unfortunately for us, when the work was completed and we went through the UPS startup procedure, two things happened. First, we discovered that the Input breaker had tripped some time between when we shut the UPS down and opened the doors to start it back up. We (actually the WWU Facilities electricians, I was just shoulder surfing at this point) flipped the breaker to the On position, which gave the datacenter a transient power flicker on the order of 50ms-100ms, which didn't bring anything down. Second, when we got to the part of the startup procedure that says 'tell the UPS to turn on,' it failed with an error to the effect of, "incorrect phase rotation, startup aborted." This caused the electricians some great concern, and they went about validating their wiring.

Which tested out fine. The phases well and truely are wired in correctly. They have very high confidence in this. Which leaves something in the UPS being wonky. So they call the UPS vendor, who ends up sending a technician up from Seattle to look things over. He should be here any time now. Meanwhile, we've been on naked utility power since 7am this morning.

The electricians are very concerned about that Input breaker tripping. This is a 50KVA 3-phase UPS, and when those short out the arc it generates is more accurately described as an explosion. The breaker caught it, as it should, but it shows a highly energetic event was avoided. They do not have confidence that we can bring the UPS up without a blip in power to the main load. If not a full on surge if it fails the wrong way.

The decision was made to prepare for shutting the whole machine room down. This is not a decision made lightly, this is the week before finals so uptime is even more critical right now. This decision will have to be made by the Vice Provost or the President, and we haven't had word yet what they've decided. We hear they're considering the full shutdown to start at 1am. We're still planning for it.

This would mark the first time since the datacenter went production back in 2000 that we've had to gracefully shut the whole thing down. The closest we've come was last September when we had to shutdown the EVA3000 in order to upgrade it to an EVA6100, and all servers connected to it had to be shut down. We're guessing that it'll take 45 minutes to get everything down, and close to 90 minutes to bring it all up in the correct order. When the room is down is when the electricians will attempt to restart the UPS.

This is an all-hands thing, and we'll have to get in contact with the University parties that have servers in there so they can either shutdown for the night, or be here to shut down in person. We've designated a pair of admins to sleep through the event so they can be fresh for the morning disasters while the rest of us sleep in.

Of course, the powers-that-be may decide to risk another UPS restart with load. Who knows.

Once a decision has been made about what to do, I'm fairly certain an all-points email will go out if we decide the full shutdown is needed. This is why we get paid the big money.

EDIT: It is official. We're taking everything down starting at 1am tonight.

Labels:


Wednesday, May 20, 2009

Doing some cleanup

Every once in a while it's nice to try and take out the trash. I did some fiddling around with some custom scripts and got a list of trustee assignments on the cluster and cross-checked them with our user groups to find groups that do not have any direct trustees assigned anywhere on the cluster. It was a sizeable list, about 30% of the groups in the groups context aren't there to manage file access (any more).

I then ran another one to give me a list of empty groups. There was some congruence between the two lists, but not as much as I thought. Unless the group was there for a documented reason, allows student workers to print to a departmental print-object kind of thing, if it had no trustees, was empty, and otherwse seemed ignored, it got tossed.

Some of these groups had comments in them about what they gave access to. I like that! It allowed to delete the serious looking group that gave access to a directory on a server that's been dead for 6 years and has no modern equivalent.

Unfortunately, there are a whole bunch of groups that I can't determine if they're still good. That'll have to be done by the desktop folk who create most of the groups. While I have hopes, I do not have high hopes that it'll get done.

Labels:


Monday, May 11, 2009

Rebuilds and nwadmin

Friday afternoon the Kala server, one of our three primary eDirectory replica servers, died. In event I've never seen before, one hard drive of a mirrored pair failed in such a way that bad data got committed. This server had to be rebuilt.

Happily for me, this is a procedure I can do without having to look things up in the Novell KB. This is part of the reason the letters "CNE" follow my name. The procedure is pretty straight-forward and I've done it before.
  1. Remove dead server's objects from the tree
  2. Designate a new server as the Master for any replica this server was the master of (all of them, as it happened)
  3. Install server fresh
The details change somewhat over time, but that's the same workflow it has been since the NetWare 4 days. In my case I did hit the KB to see if there was a way to do step 2 in iMonitor. I couldn't find one, so I did it through DSREPAIR which works just fine.

As for the install... this server is an HP BL20P G3, which means I used the procedure I documented a while back (Novell, local copy). A few minor steps changed (the INSERT Linux I used then now correctly handles SmartArray cards), but otherwise that's what I did. Still works.

For a wonder, our SSL administrator still had the custom SSL certificate we created for this server three years ago. That saved me the step of creating a CSR and setting up all the Subject Alternate Names we needed.

And today I fired up NWADMIN for the first time in not nearly long enough to associate the SLP scope to this server, since it was one of our two DA's. I could probably have done the same thing in iManager with "Other" attributes, but... why risk not getting all the right attributes associated when I have a tool that has all the built-in rules. This is the one thing that I still have NWAdmin around for. SLP-on-NetWare management.

Labels: , , ,


Wednesday, April 22, 2009

A new version of BIND

I saw on the SANS log today that the ISC is starting work on BIND10. A list of the new stuff can be found here. A couple of those items are very interesting to me. Specifically the Modularity and Clustering items.

Modularity:
...the selection of a variety of back-ends for data storage, be it the current in-memory database, a traditional SQL-based server, an embedded database engine or back-ends for specific applications such as a high performance, pre-compiled answer database.
Which makes me think of eDirectory backed DNS. Novell has had this for ages with NetWare, and from what I recall it was based on BIND. But... BIND8. BIND10 would formalize this in the linux base, which would further allow Novell to publish a more 'pure' eDir-integrated BIND.

Clustering:
run on multiple but related systems simultaneously, using a pluggable, open-source architecture to enable backbone communications between individual members of the cluster. These coordination services would enable a server farm to maintain consistency and coherence.
This is exactly what AD-integrated DNS and the DNS on NetWare has been doing for over 8 years now. Glad to see BIND catch up.

The big thing about using a database of some kind as the back-end for DNS is that you no longer have to create Secondary servers and muck about with Zone Transfers. For domains that change on a second by second basis, such as an AD DNS domain with dynamic updates enabled and thousands of computers during morning power-on, it is entirely possible for a BIND secondary-server to be missing many, many DNS updates. Microsoft has known about this issue, which is why they have their own directory-integrated DNS service.

This also shows just how creaky the NetWare DNS service really is. That's based on BIND8 code, which is now over 10 years old. Very creaky.

I'm looking forward to BIND10. It is a needed update that addresses DNS as it is done today, and would better enable BIND to handle large Active Directory domains.

Labels: , , , ,


Tuesday, April 21, 2009

Zen Asset Inventory

A while back we installed Zen Asset Inventory (but not Asset Management) since it came with our Novell bundle, and inventory is a nice thing to have. At the beginning of this quarter it started to crash while inventorying certain workstations. After sending the logs to Novell, it turned out to be crashing on a lot of workstations.

Novell said that the reason for the crashes was excessive duplicate workstations. ZAM is supposed to handle this, but it seems 2 years of quarterly lab reimaging seems to have finally overwhelmed the de-dup process. The fix is fairly straight forward, but very labor intensive:
  1. Clean out the Zenworks database
  2. Force a WorkstationOID change on all workstations
The second took quite a while. Those steps are:
  1. Stop the Collection Client service
  2. Delete a specific registry key
  3. Start the Collection Client service
These three steps can be done by way of Powershell (or the 'pstools' suite of command-line utilities if you want to rock it old school). One at a time. As we have on the order of 3,700 workstations, this took a few days and I'm sure I missed some. I did get all of the lab machines, though. That's important.

Cleaning out the database proved to be more complicated than I thought. At first I thought I just had to delete all the workstations from the Manager tool. But that would be wrong. Actually looking at the database tables showed a LOT of data in a supposedly clean database.

The very first thing I tried was to remove all the workstations from the database by way of the manager, and restart inventory. The theory here is that this would eliminate all the duplicate entries, so we'd just start the clock ticking again until the imaging caught us out. Since I had modified our imaging procedures, this shouldn't happen again any way. Tada!

Only the inventory process started crashing. Crap.

The second thing I tried was to strobe through the Lab workstations with the WorkstationOID-reset script I worked up in PowerShell (this is not something I could have done without an Active Directory domain, by the way). These are the stations with the most images, and getting them reset should clear the problem. Couple that with a clearing of the database by way of the Manager, and we should be good!

Only the inventory process started crashing. It took a bit longer, but it still crashed pretty quickly.

Try number three... run the powershell script across the ENTIRE DOMAIN. This took close to four days. Empty the database via Manager again, restart.

It crashed. It took until the second day to crash, but it still crashed.

As I had reset the WorkstationOID on all domained machines (or at least a very large percentage of them), the remaining dups were probably in the non-domained labs I have no control over. So why the heck was I still getting duplication problems with a supposedly clean database? So I went into SQL Studio to look at the database tables themselves. The NC_Workstation table itself had over 15,000 workstations in it. Whaaa?

However, this would explain the duplication problems I'd been having! If it had been doing the de-dup processing on historical data that included a freighter full of duplicates already, it was going to crash. Riiiiight. So. How do I clean out the tables? Due to foreign key references and full tables elsewhere, I had to build a script that would purge leaf tables, then core tables. The leaf tables (things like NC_BIOS) could be Truncated, handy when a table contains over a million rows. Core tables (NC_Component) have to be deleted line-by-line, which for the 2.7 million row NC_Component table took close to 24 hours to fully delete and reindex.

With a squeaky clean database, and the large majority of WorkstationOID values reset enterprise wide, I have restarted the inventory process. The Zenworks database is growing at a great pace as the Component tables repopulate. This morning we have 3,750 workstations and growing. We inventoried close to 3,300 stations yesterday and didn't get a single inventory crash. This MAY have fixed it!

I'm keeping these SQL scripts for later use if I need 'em.

They key learning here? Removing the workstations from the Manager doesn't actually purge the workstation from the database itself.

Labels: , ,


Thursday, April 16, 2009

A Mac botnet?

Ars Technica has an article up about a detected botnet based on Mac OSX machines. This is interesting stuff since you don't SEE this kind of thing all that often. OSX is the #2 operating system after Windows, but it is a distant #2. Also interestingly the infection vector appears to be pirated software, a vector that bring a tear of nostalgia to my eye for its sheer antiquity. Clearly this would be a slow growing botnet, but that's OK since a large percentage of Mac users don't bother with AV software since they're not running Windows and "don't need it".

What would be more impressive would be a drive-by downloader ala IE, but with Safari instead. I don't remember hearing any press about anything other than proof-of-concept with that, though.

Labels: ,


Wednesday, April 15, 2009

Windows 7 forces major change

I've said before that you'll have to pry the login-script out of our cold dead hands. The simple Novell login-script is the single most pervasive workstation management tool we have, since EVERYONE needs the Novell Client to talk to their file servers. Its one reason we have computer labs when others are paring down or getting rid of theirs. People can live without the Zen agents if they work at it, but they can't live without the Novell Client. Therefore, we do a lot of our workstation management through the login-script.

The Vista client has been vexing in this regard since it is so painfully slow in our clustered environment. The reason it is slow is the same reason the first WinXP clients were slow, the Microsoft and Novell name-resolution processes conmpete in bad ways. As each drive letter we map is its own virtual-server, every time you attempt to display a Save/Open box or open Windows Explorer it has to resolve-timeout-resolve each and every drive letter. This means that opening a Save/Open box on a Vista machine running the Novell client can take upwards of 5 minutes to display thanks to the timeouts. Novell knows about this issue, and has reported it to Microsoft. This is something Microsoft has to fix, and they haven't yet.

This is vexing enough that certain highly influential managers want to make sure that the same thing doesn't happen again for Windows 7. As anyone who follows any piece of the tech media knows, Windows 7 has been deemed, "Vista done right," and we expect a lot faster uptake of Win7 than WinVista. So we need to make sure our network can accommodate that on release-day. Make it so, said the highly placed manager. Yessir, we said.

So last night I turned CIFS on for all the file services on the cluster. It was that or migrate our entire file-serving function to Windows. The choice, as you can expect, was an easy one.

This morning our Mac users have been decidedly gleeful, as CIFS has long password support where AFP didn't. The one sysadmin here in techservices running Vista as his primary desktop has uninstalled the Novell Client and is also cheerful. Happily for us, the directive from said highly placed manager was accompanied by a strong suggestion to all departments that domaining PCs into the AD domain would be a Really Good Idea. This allows us to use the AD login-script, as well as group-policies, for those Windows machines that lack a Novell Client.

Ultimately, I expect the Novell Client to slowly fade away as a mandatory install. So that clientless-future I said we couldn't take part in? Microsoft managed to push us there.

Labels: , , , ,


Friday, March 27, 2009

Computer labs in a ubiquitous computing world

Ars Technica has an article up called, When every student has a laptop, why run computer labs?

It's a good question. But before I go into it, I should mention something. What I do for WWU doesn't have a lot to do with our labs. The biggest interaction I have with them is for printing and maybe some Zen or GPO policies. I also know some of the people who support them, and I sit in meetings where other people gripe about them. So I'm speaking as someone who works around people who deals with them, not as someone who deals with them or has any decision making power.

Why run computer labs?

In the beginning it was to provide computers to students who didn't have one.
Then, it was to provide on-campus computers to students who didn't have a laptop.

Now that almost every student has a computer, and most of those laptops, it makes a less sense. Centralized printers where they can print off assignments from their own hardware? Yes. 60 seat general computing labs? Um.

The point is made in the Ars Technica article that specialized software that students generally wouldn't have, such as SPSS or the full Adobe Acrobat suite, are a good reason to have them. This is true. We have not only the general computing labs run by ATUS, but we also have special purpose labs run by ATUS and the various colleges. We now have a lab that has a large format printer, something I guarantee no student has in their dorm or apartment, and a flat-bed scanner. One non-ATUS lab has VMWare Workstation installed on all the workstations. Some of the general computing labs are actual classrooms some of the time.

In our specific case, we have one software package in universal use that greatly encourages the existence of the general computing lab.

The Novell Client.

In order to get drive-map access to the NetWare cluster, you need that. This is not a package you want to inflict on a home machine without the victim knowing what they're in for. So we need to provide computers with the client installed so students can get at their files simply. WebDav through NetStorage goes some of the way, but it can be tricky to set up.

If we were a pure Windows network, it wouldn't be so bad. Both OSX and all the major Linuxes come with Samba pre-installed, which eases access to Windows networks. Printing isn't quite as convenient, but at least you can get at your files easy enough once you're inside the firewall.

In the end, except for our NCP dependencies, we could possibly close some of our GC labs to save money. However, we do track lab utilization, and those numbers may tell a different story. I know some students don't bother hauling their laptop to campus so long as they can use a lab machine for a quick social-networking fix. If we start closing labs those students will start hauling their gear to campus and we can save money. I still think we need to provide general access printers at various spots, which is something that Novell iPrint is rather good for. We also need to provide access to the special software packages that are needed for teaching, things like SPSS and MatLab.

The role of the computer lab has changed now that all but a few students have laptops. We still need them for specialized teaching functions, but general access to computing is no longer a primary function. The convenience factor of simple internet access drives some usage, and it may even be a majority. But the labs aren't going away any time soon. Their printers, even less so.

Labels: ,


Tuesday, March 17, 2009

Stimulus fairy wish list

There is a list on my whiteboard. It is the wish list of infrastructure projects I'd really like to see if the stimulus fairy decides to pay WWU a visit. We're higher-ed. And the stimulus bill includes funding for higher-ed. It could happen! Heck, I have two coworkers who are in the Seattle area right now listening to a demo just in case aforementioned fairy does drop by.

Anyway, the wish-list (or Santa! Give me hardware!)
  • 2 more enclosures for the existing EVA6100
  • A new EVA4400 with all eight enclosures
  • HP EVA replication software, so we can mirror the 6100 to the new 4400.
  • Data Protector licensing for everything we need
  • A new tape library, the one we have is creaky
  • New 64-bit servers for the main file-serving cluster
These would make me a happy, happy geek. Said coworkers are working on something else above and beyond these. But I'm not saying what that is. If the fairy does drop by, I will.

However, and there is always one, there is a problem with air-lifting big wads of cash into an IT environment and then spending it all. When it comes time to replace the existing EVA6100, we will have to pay for something equivalent. Since it would have had the EVA replication software on it, it is now about twice as expensive as a simple hardware replacement would suggest. The replication software would quickly become line-of-business with significant future expenses deriving from its purchase.

Maintenance has to be factored in to anything we spend fairy-money on. There is a certain amount of money we can spend on catching up our IT deferred maintenance backlog, like that creaky tape library I just mentioned, but that won't come close to the fairy-money numbers being bandied about. As hard as leaving big piles of cash laying by the side of the road there on the road-side is, there is some money that it is safer not to touch. Such as anything with a yearly maintenance fee. Or version upgrade fees for the upgrade we'll need to do in 3-6 years.

Those organizations who live on grant-money know this very well. However, here at ITS at WWU, other than Student Tech Fee funds we don't live on grant-money. The stimulus-fairy counts as grant money that could leave behind a future liability that STF can't come close to being able to cover.

We'll see what happens.

Labels: ,


Thursday, March 05, 2009

Anatomy of an adware install

A bit of analysis I had to do in the past couple days. I'm sharing because I don't do this all that often. I'm pretty handy with wireshark, so I got asked to interpret a capture of an infection process.

The sequence of events, as near as I can figure:
  • User runs the bad file
  • postcard.exe checks http://whatismyip.com/autmation/n09230945.asp to get the local IP address
  • File throws them at Hallmark.com displaying the ecard. Awww.
  • Hallmark throws the user some advertising from a bunch of places.
  • 5 minutes pass where nothing happens
  • Postcard.exe does an HTTP POST to 85.12.43.102 (Netherlands) with encrypted data
  • 85.12.43.102 replies with a bunch of encrypted data. Presumably, this is the command file.
  • Postcard.exe opens three connections to 82.98.235.205 (Belgium), getting a trio of windows files of some kind. I think they're DLL files that compliment postcard.exe. That or the chopped up pieces of javawm.exe.
  • Postcard.exe does an HTTP POST to 85.17.169.56 (Netherlands) with a bunch of HTTP headers populated with crypted data.
  • 85.17.169.56 replies with an HTTP 200/OK, a bunch of HTTP headers that contain redirection servers, stats servers, and other information useful for adware, as well as a 143KB file of some kind.
  • Infected computer connects to 83.149.75.33 (Netherlands) and does an HTTP GET with a series of parameters. This is probably a status message of some kind. Remote side returns 404-not-found.
  • 5 minutes pass where nothing happens on the network, but the local machine falls deeper into the clutches of the adware czars.
  • Someone launches IE, and it goes to http://runonce.msn.com/, the default XP home page. Probably just to see what happens.
  • HTTP connection to Key Bank, redirected to https://www.key.com/, where I can't see squat. SSL doing its job.
  • Parallel to the KeyBank connection, an SSL connection to 216.236.233.68, an iP hosted in Denmark. This resolves to "key.tcliveus.com", which is very probably legitimate traffic directed by www.key.com.
  • Connection to 83.149.115.156 (Netherlands), almost definitely the adware. Phoning in that IE went to http://runonce.msn.com/. The reply directs the client to connect to 82.98.235.58. Meanwhile, keybank session continues.
  • SSL Connection to 66.235.132.62, a host in the 2o7.net advertising network. Very probably legitimate from Key Bank.
  • HTTP connection to 82.98.235.58 (Netherlands), as directed. Supplies URL given to it by 83.149.115.156. Server returns the URL http://privacyscanner15.com/sysgd09_2/3/10232 (don't go there). Meanwhile, Keybank session continues.
  • HTTP connection to 209.249.222.48, which is privacyscanner15.com, with the supplied URL.
  • Key Bank session finishes cleanly.
  • HTTP connections to privacyscanner15.com, clearly rendering the page, pulling graphics and the evil javascripts.
  • Key Back session resumes. SSLed, so I have no idea what's going on.
  • HTTP connection to 83.149.75.33, but I can't tell what it does because…
  • End of capture.
Ripping into the javascript with a very, very handy Firefox plugin called "JavaScript Deobfuscator", I hit the page from my Linux machine to see what those scripts did. If you click "yes", it forces the download of an executable file that contains a Trojan. I haven’t unpacked it to see what it does.

This is pretty clearly the trace of an adware installer. However, the adware points the user to a site where they'll get further infected first thing. Depending on how gullible the user is, they may or may not fall for it.

All the Netherlands addresses come from the same netblock owner, a place called “LeaseWeb”.

Labels:


Monday, February 23, 2009

The Internet SAFETY Act

I'm sure this has made the rounds, but I've been out sick for the past week and thus not as caught up on my tech media as I normally would be. But a bill has been introduced to the US Congress that would:

SEC. 5. RETENTION OF RECORDS BY ELECTRONIC COMMUNICATION SERVICE PROVIDERS.

    Section 2703 of title 18, United States Code, is amended by adding at the end the following:
    `(h) Retention of Certain Records and Information- A provider of an electronic communication service or remote computing service shall retain for a period of at least two years all records or other information pertaining to the identity of a user of a temporarily assigned network address the service assigns to that user.'.
    At minimum this means keeping DHCP records for 2 years. What's a bit more unclear is whether or not just IP address is sufficient to meet the standard of, 'identity of a user'. I don't think it is, though the courts will have to clarify this. This tells me that we'd have to retain records associating IP address with authenticated user.

    For commercial ISPs this is an easier bar to pass, as you need a username and password or some such equivalent to get on their networks in the first place and be provisioned with an address. For entities like us who are sort-of ISPs for our students, and have very permissive usage policies for our faculty (sex-researchers have a legitimate business need to search for, you know, sex), it's a bit less cut and dried. What isn't yet clear, but is getting a lot of internet buzz, is whether or not home users fall under this requirement as well.

    Bills such as these make a fundamentally false assumption about the internet:
    The end points always require authentication prior to usage.
    So long as vendor-neutrality holds, anyone who can get on the network at all can pass traffic over it. The Internet's protocols have no header value for signifying whether the originating node is an authenticated access or anonymous access, they just don't care. Authentication is optional on the Internet, not mandatory.

    This bill would indirectly require mandatory authentication for network access. Yes, this is a trend in the business world these days (google term: NAC), but there are whole classes of network users out there that aren't even looking into this. The locally owned independent coffee shop, with the commercial DSL line and free WiFi, the Hotel with 200 guests sharing the same business Comcast line, these are the sorts of 'anonymous' network access where NAC solutions aren't likely to ever be in place.

    Ultimately, by the time I'm 50 I expect the Internet to have converted to a mandatory-auth scheme for access. However, we're not there yet, not even close. This bill needs to be fought.

    Labels: ,


    Wednesday, February 11, 2009

    High availability

    64-bit OES provides some options to highly available file serving. Now that we've split the non-file services out of the main 6-node cluster, all that cluster is doing is NCP and some trivial other things. What kinds of things could we do with this should we get a pile of money to do whatever we want?

    Disclaimer: Due to the budget crisis, it is very possible we will not be able to replace the cluster nodes when they turn 5 years old. It may be easier to justify eating the greatly increased support expenses. Won't know until we try and replace them. This is a pure fantasy exercise as a result.

    The stats of the 6-node cluster are impressive:
    • 12 P4 cores, with an average of 3GHz per core (36GHz).
    • A total of 24GB of RAM
    • About 7TB of active data
    The interesting thing is that you can get a similar server these days:
    • HP ProLiant DL580 (4 CPU sockets)
    • 4x Quad Core Xeon E7330 Processors (2.40GHz per core, 38.4GHz total)
    • 24 GB of RAM
    • The usual trimmings
    • Total cost: No more than $16,547 for us
    With OES2 running in 64-bit mode, this monolithic server could handle what six 32-bit nodes are handling right now. The above is just a server that matches the stats of the existing cluster. If I were to really replace the 6 node cluster with a single device I would make a few changes to the above. Such as moving to 32GB of RAM at minimum, and using a 2-socket server instead of a 4-socket server; 8 cores should be plenty for a pure file-server this big.

    A single server does have a few things to recommend it. By doing away with the virtual servers, all of the NCP volumes would be hosted on the same server. Right now each virtual-server/volume pair causes a new connection to each cluster node. Right now if I fail all the volumes to the same cluster node, that cluster node will legitimately have on the order of 15,000 concurrent connections. If I were to move all the volumes to a single server itself, the concurrent connection count would drop to only ~2500.

    Doing that would also make one of the chief annoyances of the Vista Client for Novell much less annoying. Due to name cache expiration, if you don't look at Windows Explorer or that file dialog in the Vista client once every 10 minutes, it'll take a freaking-long time to open that window when you do. This is because the Vista client has to enumerate/resolve the addresses of each mapped drive. Because of our cluster, each user gets no less than 6 drive mappings to 6 different virtual servers. Since it takes Vista 30-60 seconds per NCP mapping to figure out the address (it has to try Windows resolution methods before going to Novell resolution methods, and unlike WinXP there is no way to reverse that order), this means a 3-5 minute pause before Windows Explorer opens.

    By putting all of our volumes on the same server, it'd only pause 30-60 seconds. Still not great, but far better.

    However, putting everything on a single server is not what you call "highly available". OES2 is a lot more stable now, but it still isn't to the legendary stability of NetWare 3. Heck, NetWare 6.5 isn't at that legendary stability either. Rebooting for patches takes everything down for minutes at a time. Not viable.

    With a server this beefy it is quite doable to do a cluster-in-a-box by way of Xen. Lay a base of SLES10-Sp2 on it, run the Xen kernel, and create four VMs for NCS cluster nodes. Give each 64-bit VM 7.75GB of RAM for file-caching, and bam! Cluster-in-a-box, and highly available.

    However, this is a pure fantasy solution, so chances are real good that if we had the money we would use VMWare ESX instead XEN for the VM. The advantage to that is that we don't have to keep the VM/Host kernel versions in lock-step, which reduces downtime. There would be some performance degradation, and clock skew would be a problem, but at least uptime would be good; no need to perform a CLUSTER DOWN when updating kernels.

    Best case, we'd have two physical boxes so we can patch the VM host without having to take every VM down.

    But I still find it quite interesting that I could theoretically buy a single server with the same horsepower as the six servers driving our cluster right now.

    Labels: , , , , , , ,


    Monday, January 26, 2009

    Software install and maintenance contracts

    In this modern era, it is becoming more an more common for vendors in the Windows world to either require, or strongly suggest that the vendor perform a software install on a server. In the past this required either sending a physical body out to the location, or using something like PC-Anywhere to do the install. Now, a wide variety of web-based remote-control packages are on the market that greatly simplify getting the knowledgeable install-geek onto the server in question.

    More and more often, vendors are offering maintenance and update contracts contingent on console access. While these greatly simplify maintaining a package for small offices who don't have the IT oomph to really do it themselves, these are a great pain for those of us who manage the servers themselves. What's really bad is when web software (typically IIS and .NET based) is subject to these sorts of contracts.

    We have a small number of IIS-based web-servers that are shared with a variety of departments. ATUS and ADMCS are the biggest consumers, of course, but other departments have their own stuff on there. This also includes several 3rd party apps we've put in over the years. These servers have a lot on them.

    What happens when we get more than one software package with this sort of contract attempting to run on these IIS servers? It means that, at least in theory, multiple vendors have nearly unrestricted access to these web-servers. As these servers are general-purpose servers and not dedicated to this one application, this is a pretty major data-security issue.

    This isn't quite as big a problem when the application is a more traditional client/server app, or the app resides on its own dedicated server. We don't like that kind of app-server running with AD credentials on console, but we can work around that. Web-servers, though, run a lot of apps.

    In the UNIX world, I have heard of vendors requesting the ability to SSH into a server in order to do installs. The context of this was humor, as in, "look at the stupid vendor." In general, if a vendor asks for local root to a server for a simple install, the answer will be a resounding no way. So what makes the Windows world different, that they'll permit a third party root-access to their own servers? Perhaps because it takes a lot less skill to do Windows administration at least half-way right, so vendors have to compensate for less comprehensively trained system administrators. Unfortunately, it makes them less nimble when they do run into shops with strict controls on what runs on the web servers, and who is allowed console access to them.

    Labels: ,


    Wednesday, January 14, 2009

    NTP on NetWare

    A while back I did some work setting up an ntp peer-group on a pair of SLES servers (SLES9 and SLES10). That worked pretty good, and I managed to get autokey security working, which I thought was nifty. Then my thoughts turned to the OES environment.

    If/when we get off of NetWare and move to the Linux kernel, NTP becomes the only way to do timesync. So I figured I'd see how amenable NetWare's xntpd was to secure configuration. Turns out it can do it, but there are some caveats.

    First of all, it seems that the NTP for NetWare is based on NTPv3, not NTPv4, which means it doesn't support autokey and only supports symmetric keys. This also means that some other items on the ntp.conf file on the sles servers couldn't be carried over.

    As it happens, the following sys:/etc/ntp.conf file works pretty well:
    server ntpserver1
    server ntpserver2 minpoll 6 maxpoll 13
    peer ntppeer1 key 1

    enable auth monitor
    keys sys:\etc\ntp.keys
    trustedkey 1
    requestkey 1

    restrict default ignore
    restrict 140.160.0.0 mask 255.255.0.0 nomodify nopeer
    restrict 127.0.0.1
    restrict [ip of ntpserver1]
    restrict [ip of ntpserver2]
    restrict [ip of ntppeer1]

    Populating the ntp.keys file couldn't be done from NetWare directly, I had to do that on a SLES server and copy it over. But once I did that, the ntppeer1 server and the NetWare server correctly authenticated to each other.

    Interestingly, when I pointed an NTPv4 linux machine at the NetWare NTP setup I got complaints on the NetWare server about the incoming timehost not having the correct key and not being able to sync time. This is interesting because this linux machine was NOT one of the specified time-hosts. When I put in the 'restrict' line above with the 'nopeer' flag on it, those messages stopped.

    The above configuration was successful in enabling a peer relationship between the two timehosts. This is loosely analogous to a PRIMARY group in traditional NetWare TIMESYNC setup. Should one or both of these hosts lose connection to the non-WWU time-servers (which are in essence equivalent to REFERENCE servers in Timesync, but unlike Timesync you can have more than one), they can negotiate time between themselves. This is important, as it prevents them from going out of sync, which would have dire consequences if allowed to happen more than a few minutes.

    Labels: , ,


    Tuesday, January 06, 2009

    DataProtector 6.00 vs 6.10

    A new version of HP DataProtector is out. One of the nicest new features is that they've greatly optimized the object/session copy speeds.

    No matter what you do for a copy, DataProtector will have to read all of one Disk Media (50GB by default) to do the copy. So if you multiplex 6 backups into one Disk Writer device, it'll have to look through the entire media for the slices it needs. If you're doing a session copy, it'll copy the whole session. But object copies have to be demuxed.

    DP6.00 did not handle this well. Consistently, each Data Reader device consumed 100% of one CPU for a speed of about 300 MB/Minute. This blows serious chunks, and is completely unworkable for any data-migration policy framework that takes the initial backup to disk, then spools the backup to tape during daytime hours.

    DP6.10 does this a lot better. CPU usage is a lot lower, it no longer pegs one CPU at 100%. Also, network speeds vary between 10-40% of GigE speeds (750 to 3000 MB/Minute), which is vastly more reasonable. DP6.10, unlike DP6.00, can actually be used for data migration policies.

    Labels: , , , ,


    Monday, December 08, 2008

    When you can't trust tcpdump

    I just spent a good chunk of today bothering the telecom people to try and figure out why one server of mine couldn't talk to any off campus NTP servers. I had two servers, one was talking just fine, the other wasn't. For proof, I had packet traces showing that the non-working server was not getting an appropriate "ntp server" return packet.

    And yet, after the telecom people sniffed the border firewall connections they saw UDP/123 packets on both sides. In other words, it transited the firewall just peachy. And also our IDS/IPS. So, clearly it was getting in. But it wasn't showing up on the tcpdump output on the server.

    Then about 15 minutes ago I turned off the "restrict default ignore" line on that server.

    And it started syncing off campus just fine. With packets.

    WHY was tcpdump not showing the packets? That's what I want to know! Somehow, the UDP packets were being dropped before tcpdump saw them. Strange.

    Labels: ,


    Wednesday, December 03, 2008

    OES2 SP1 ships!

    Full announcement.

    It's out!

    Labels: , , , , , ,


    Saturday, November 29, 2008

    10,000 hours

    I read an excerpt of a book a week or so ago. Always dangerous, as it lacks context. But the general principal of the book was the observation that to get really really good at something requires about 10,000 hours of practice. There are no 'naturals', just people who are naturally more pig-headed than others who can get to 10K hours.

    10K hours is 2-4 hours a day for 10 years.

    The studies were about things like child prodigies, or top tier athletes who get Olympic gold at age 22, and retire by 30. That sort of thing. It seems that almost all of these people started their thing by age 6, and by age 8 there was already a break between the kids who'd ultimately reach the peak of their field and those who'd merely be very good. The ones destined for peak were giving 2-3 hours a day at age 8, where the other group had cut back.

    I believe this also applies to technical expertise. As anyone who has done any job searching in my field knows, there are real breaks for experience levels; 1-3 years, 4-6 years, 10+. Those of us in the 10+ area (and by now I am there with NetWare, and by the end of December I can claim that with Windows) are pretty much technical experts. We've put in the time over the years to get good.

    However, we work in a field where, "Change or Die," is an accurate mantra. The IT industry of 2008 is markedly different than it was in 1998. Windows NT installs right now are laughed at. Very, very little of the operating systems and software in active use in 1998 is still able to be on a support contract. It is hard to be a 10K-hour expert in something in our field, you have to put in 8 hours a day for 5 years.

    My first real exposure to NetWare was in a class I took for my CNA back in the Autumn of 1996. That was on NetWare 4.0, so at least my first experience was with NDS. In fact, my first job with NetWare was with 3.x, so I had to learn bindary on-the-job.

    I consider myself to be an expert in NetWare. I've been actively administering it for 11 years now, so if I'm not across the 10K line I'm really close to it. This is only possible because the 'change or die' mantra has not applied to NetWare over the years. Lets take a look at the biggest disruptions of how things work in NetWare (kernel). This isn't incremental changes, this is fundamental re-learnings of how things work. Sort of like what all the Windows engineers had to go through when Active Directory came onto the scene.
    1. The move to TCP/IP. This by far is the biggest disruption since 1996. NetWare 5.0(?) introduced the ability to do NCP over TCP/IP natively, and not tunneled IPX-over-IP. This required replacing IPX SAP, something the routers just did, with SLP, a service that needed configuration and setup.
    2. The NSS file-system. This was a much lesser move than the TCP/IP one, as it worked on a general level (trustees, quotas, etc) the same as TFS did. Tweaking it for performance, however, was a dark art for many years and much learning was derived out of this.
    3. Protected memory. A concept familiar to anyone who has used Windows or Linux, and all NetWare admins are by now administering one or both of these OS's. While some modules can't use it for whatever reason (iPrint, NetStorage) others (GroupWise) could.
    4. Native File Access Pack. NetWare could do AFP since the NW3 days, the same for NFS. SMB was another story. It was with NetWare 5.1 that NFAP came on to the scene, and NetWare 6.0 where it came built in and performed much better. The ability to use protocols other than NCP for your Windows clients was embraced by many shops.
    There were more changes, but in my mind these are the biggest four. You will note the complete lack of OES in this list. That's because this is a list of the changes to NetWare, and OES-Linux is not NetWare. OES-Linux represents the sort of "change everything you expect" that the rest of the industry does, that the Novell ecosystem hasn't had.

    Over the last 12 years NetWare has remained markedly static. This has allowed enough time for people who don't do this every waking moment to achieve a high level of expertise with NetWare. While this is good for NetWare, it unfortunately shows how NetWare has lagged behind the rest of the industry.

    It is my opinion that OES-Linux represents a decade of pent up change that needed to happen in NetWare but didn't. This is why old time NetWare admins are having such trouble moving to Linux, they're being asked to support an Operating System that they don't have anywhere near the same level of expertise in and that is uncomfortable. I know I'm moving from an OS that I know exceedingly well to one where there are still, "here be monsters," marked out on my mental map. I'm also having to give up, "10+ year experience with NetWare," in favor of, "2-4 years of experience with Linux," and that doesn't feel good professionally.

    But... that is the nature of our field. Just when we get really good at something, it's time to throw it out and learn something new. That something may be an incremental change from what we know (Windows 2003 vs Windows 2000) or a complete break (NT Domains vs AD Tree). But, learn we must. Us NetWare wonks have just been sheltered from it for some time.

    Labels: , , ,


    Friday, October 24, 2008

    Microsoft out-of-band patch, exploits released

    Earlier today, Bugtraq saw a couple of messages with links to actual exploit code for this patch. Now anyone can play!

    On the up side, stuff built with this code will in all probability be detectable with IPS technologies. But that doesn't help devices in places that lack IPS, such as your local Starbucks.

    Labels: ,


    Wednesday, October 22, 2008

    An old theme made new

    Yesterday on Slashdot was a link to an article that sounds a lot like one I published two years ago tomorrow. The main point in the article is that due to the unrecoverable-read-error rate in your standard SATA drive (10^14 bits, or 12.5TB), and the ever increasing sizes of SATA drives means that Raid 5 arrays can get to 12.5TB pretty quickly. Heck, high-end home media servers chock full of HD content can get there very fast.

    While it doesn't say this in the specs page for that new Seagate drive, if you look on page 18 of the accompanying manual you can see the "Nonrecoverable read error" rate of the same 10^14 as I talked about two years ago. So, no improvement in reliability. However.... For their enterprise-class "Savvio" drives, they list a "Nonrecoverable Read Error" rate of 10^16 (1 in 1.25PB), which is better than the 10^15 (125TB) they were doing two years ago on their FC disks. So clearly, enterprise users are juuuust fine for large RAID5 arrays.

    As I said before, the people who are going to be bitten by this will be home media servers. Also, whiteboxed homebrew servers for small/medium businesses will be at risk. So those of you who have to justify buying the really expensive disks, when there are el-cheepo 1.5TB drives out there? You can use this!

    Labels: , ,


    Thursday, October 02, 2008

    MSA performance in the new config

    Today I reconfigured the MSA1500 to run in Active/Active mode. While there, I also rearranged our disk arrays. We have 41, 500GB, 7.2K RPM drives in there. I created two, 20 disk Arrays, and filled each array with Raid 0+1 LUNs. This yielded 9TB of useful space. That extra drive will stay extra until we get an odd number of new drives.

    Yes, a profligate waste of space but at least it'll be fast. It also had the added advantage of not needing to stripe in like Raid5 or Raid6 would have. This alone saved us close to two weeks flow time to get it back into service.

    Another benefit to not using a parity RAID is that the MSA is no longer controller-CPU bound for I/O speeds. Right now I have a pair of writes, each effectively going to a separate controller, and the combined I/O is on the order of 100Mbs while controller CPU loads are under 80%. Also, more importantly, Average Command Latency is still in the 20-30ms range.

    The limiting factor here appears to be how fast the controllers can commit I/O to the physical drives, rather than how fast the controllers can do parity-calcs. CPU not being saturated suggests this, but a "show perf physical" on the CLI shows the queue depth on individual drives:
    Queue depth chart
    The drives with a zero are associated with LUNs being served by the other controller, and thus not listed here. But a high queue depth is a good sign of I/O saturation on the actual drives themselves. This is encouraging to me, since it means we're finally, finally, after two years, getting the performance we need out of this device. We had to go to an active/active config with a non-parity RAID to do it, but we got it.

    Labels: , , ,


    Friday, September 19, 2008

    Monitoring ESX datacenter volume stats

    A long while back I mentioned I had a perl script that we use to track certain disk space details on my NetWare and Windows servers. That goes into a database, and it can make for some pretty charts. A short while back I got asked if I could do something like that for the ESX datacenter volumes.

    A lot of googling later I found how to turn on the SNMP daemon for an ESX host, and a script or two to publish the data I need by SNMP. It took some doing, but it ended up pretty easy to do. One new perl script, the right config for snmpd on the ESX host, setting the ESX host's security policy to permit SNMP traffic, and pointing my gathering script at the host.

    The perl script that gathers the local information is very basic:
    #!/usr/bin/perl -w

    use strict;
    my $partition = ".";
    my $partmaps = ".";
    my $vmfsvolume = "\Q/vmfs/volumes/$ARGV[0]\Q";
    my $vmfsfriendly = $ARGV[1];
    my $capRaw = 0;
    my $capBlock = 0;
    my $blocksize = 0;
    my $freeRaw = 0;
    my $freeBlock = 0;
    my $freespace= "";
    my $totalspace= "";
    open("Y", "/usr/sbin/vmkfstools -P $vmfsvolume|");
    while () {
    if (/Capacity ([0-9]*).*\(([0-9]*).* ([0-9]*)\), ([0-9]*).*\(([0-9]*).*a
    vail/) {
    $capRaw = $1;
    $capBlock = $2;
    $blocksize = $3;
    $freeRaw = $4;
    $freeBlock = $5;
    $freespace = $freeBlock;
    $totalspace = $capBlock;
    $blocksize = $blocksize/1024;
    #print ("1 = $1\n2 = $2\n3 = $3\n4 = $4\n5 = $5\n");
    print ("$vmfsfriendly\n$totalspace\n$freespace\n$blocksize\n");
    }
    }


    Then append the /etc/snmp/snmp.conf file with the following lines (in my case):

    exec .1.3.6.1.4.1.6876.99999.2.0 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
    -61468d50-ed1f-001cc447a19d Disk1

    exec .1.3.6.1.4.1.6876.99999.2.1 vmfsspace /root/bin/vmfsspace.specific 48cb2cbc
    -7aa208e8-be6b-001cc447a19d Disk2


    The first parameter after exec is the OID to publish. The script returns an array of values, one element per line, that are assigned to .0, .1, .2 and on up. I'm publishing the details I'm interested in, which may be different than yours. That's the 'print' line in the script.

    The script itself lives in /root/bin/ since I didn't know where better to put it. It has to have execute rights for Other, though.

    The big unique-ID looking number is just that, a UUID. It is the UUID assigned to the VMFS volume. The VMFS volumes are multi-mounted between each ESX host in that particular cluster, so you don't have to worry about chasing the node that has it mounted. You can find the number you want by logging in to the ESX host on the SSH console, and doing a long directory on the /vmfs/volumes folder. The friendly name of your VMFS volume is symlinked to the UUID. The UUID is what goes in to the snmp.conf file.

    The last parameter ("Disk1" and "Disk2" above) is the friendly name of the volume to publish over SNMP. As you can see, I'm very creative.

    These values are queried by my script and dropped into the database. Since the ESX datacenter volumes only get space consumed when we provision a new VM or take a snapshot, the graph is pretty chunky rather than curvy like the graph I linked to earlier. If VMware ever changes how the vmfstools command returns data, this script will break. But until then, it should serve me well.

    Labels: , , ,


    Moving storage around

    The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

    Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

    Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
    1. Rebuilding all of the Disk Arrays.
    2. Creating LUNs expressly for Backup-to-Disk functionality.
    3. Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
    4. Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
    But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

    What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

    Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

    In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

    Labels: , , , ,


    Sunday, September 14, 2008

    EVA6100 upgrade a success

    Friday night four HP tech arrived to put together the EVA6100 from a pile of parts and the existing EVA3000. It took them 5 hours to get it to the point where we could power-on and see if all of our data was still there (it was, yay), and a few hours after that on our behalf to put everything back together.

    There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.

    For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.

    Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.

    The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.

    Labels: , , , ,


    Thursday, September 11, 2008

    Fixing DNS issues

    I've noticed some slow DNS on my station for the last few weeks and finally got down to checking it out. In the wake of the cache-poisoning scare of late July, we had to upgrade our DNS servers to something a bit less scarily old. I believe this required an operating system rev. The last time this happened to me, we figured out that the DNS server in question had auto-negotiated itself to 10-HalfDuplex, and the switch thought it was 100-FullDuplex. You can imagine what that did to throughput.

    I fired up wireshark and started tracking my DNS requests. A pattern soon emerged. The first entry in my resolve.conf list was taking anywhere from .5 to 5.2 seconds to resolve most queries. This is hella slow for a DNS server. Since I don't manage these machines, I let the admin who did manage 'em know about it. He couldn't find anything wrong with the DNS servers on a first glance.

    Another thing I noticed when looking at the resolver requests I was passing was a lot of IPv6 requests. Almost all of them were for Active Directory related queries, as I've turned off IPv6 support in my web-browser. I still haven't quite figured out how to disable IPv6 on my openSUSE 10.3 machine here.

    As it happens, said DNS admin came back in and said to look at things again. So I dropped into nslookup and started throwing queries and watching the response times in wireshark, and sure enough they were zippy again. He turned off IPv6 support on the DNS servers.

    Looks like we'll probably need to have a conversation on campus about IPv6 sooner rather than later. Vista comes with it turned on by default, and happily we don't have much of that yet. But these newer linux distros all have it turned on by default.

    Labels: ,


    Wednesday, September 10, 2008

    That darned budget

    This is where I whine about not having enough money.

    It has been a common complaint amongst my co-workers that WWU wants enterprise level service for a SOHO budget. Especially for the Win/Novell environments. Our Solaris stuff is tied in closely to our ERP product, SCT Banner, and that gets big budget every 5 years to replace. We really need the same kind of thing for the Win/Novell side of the house, such as this disk-array replacement project we're doing right now.

    The new EVAs are being paid for by Student Tech Fee, and not out of a general budget request. This is not how these devices should be funded, since the scope of this array is much wider than just student-related features. Unfortunately, STF is the only way we could get them funded, and we desperately need the new arrays. Without the new arrays, student service would be significantly impacted over the next fiscal year.

    The problem is that the EVA3000 contains between 40-45% directly student-related storage. The other 55-60% is Fac/Staff storage. And yet, the EVA3000 was paid for by STF funds in 2003. Huh.

    The summer of 2007 saw a Banner Upgrade Project, when the servers that support SCT Banner were upgraded. This was a quarter million dollar project and it happens every 5 years. They also got a disk-array upgrade to a pair of StorageTek (SUN, remember) arrays, DR replicated between our building and the DR site in Bond Hall. I believe they're using Solaris-level replication rather than Array-level replication.

    The disk-array upgrade we're doing now got through the President's office just before the boom went down on big expensive purchases. It languished in the Purchasing department due to summer-vacation related under-staffing. I hate to think how late it would have gone had it been subjected to the added paperwork we now have to go through for any purchase over $1000. Under no circumstances could we have done it before Fall quarter. Which would have been bad, since we were too short to deal with the expected growth of storage for Fall quarter.

    Now that we're going deep into the land of VMWare ESX, centralized storage-arrays are line of business. Without the STF funded arrays, we'd be stuck with "Departmental" and "Entry-level" arrays such as the much maligned MSA1500, or building our own iSCSI SAN from component parts (a DL385, with 2x 4-channel SmartArray controller cards, 8x MSA70 drive enclosures, running NetWare or Linux as an iSCSI target, with bonded GigE ports for throughput). Which would blow chunks. As it is, we're still stuck using SATA drives for certain 'online' uses, such as a pair of volumes on our NetWare cluster that are low usage but big consumers of space. Such systems are not designed for the workloads we'd have to subject them to, and are very poor performers when doing things like LUN expansions.

    The EVA is exactly what we need to do what we're already doing for high-availability computing, yet is always treated as an exceptional budget request when it comes time to do anything big with it. Since these things are hella expensive, the budgetary powers-that-be balk at approving them and like to defer them for a year or two. We asked for a replacement EVA in time for last year's academic year, but the general-budget request got denied. For this year we went, IIRC, both with general-fund and STF proposals. The general fund got denied, but STF approved it. This needs to change.

    By October, every person between and Governor Gregoir will be new. My boss is retiring in October. My grandboss was replaced last year, my great grand boss also has been replaced in the last year, and the University President stepped down on September 1st. Perhaps the new people will have a broader perspective on things and might permit the budget priorities to be realigned to the point that our disk-arrays are classified as the critical line-of-business investments they are.

    Labels: , , , , , , , , , , , ,


    Disk-array migrations done right

    We have two new HP EVA systems. An EVA4400 with FATA drives that we'll be putting into our DR datacenter in Bond Hall, and upgrading our EVA3000 into an EVA6100 + 2 new enclosures. The 4400 is a brand new device, so is sitting idle right now (officially). It will be replacing the MSA1500 we purchased two years ago, and will fulfill the duties the MSA should have been doing but is too stupid to do.

    We've set up the 4400 already, and as part of that we had to upgrade our CommandView version from the 4.something it was with the EVA3000 to CommandView 8. As a side effect of this, we lost licensing for the 3000 but that's OK since we're replacing that this weekend. I'm assuming the license codes for the 6100 are in the boxes the 6100 parts are in. We'll find that out Friday night, eh?

    One of the OMG NICE things that comes with the new CommandView is a 60 day license for both ContinuousAccess EVA and BusinessCopy EVA. ContinuousAccess is the EVA to EVA replication software, and is the only way to go for EVA to EVA migrations. We started replicating LUNs on the 6100 to the 4400 on Monday, and they just got done replicating this morning. This way, if the upgrade process craters and we lose everything, we have a full block-level replica on the 4400. So long as we get it all done by 10/26/2008, which we should do.

    On a lark we priced out what purchasing both products would cost. About $90,000, and that's with our .edu discount. That's a bit over half the price of the HARDWARE, which we had to fight tooth and nail to get approved in the first place. So. Not getting it for production.

    But the 60 day license is the only way to do EVA to EVA migrations. In 5 years when the 6100 falls off of maintenance and we have to forklift replace a new EVA in, it will be ContinuousAccess EVA (eval) that we'll use to replicate the LUNs over to the new hardware. Then on migration date we'll shut everything down ("quiesce I/O"), make sure all the LUN presentations on the new array look good, break the replication groups, and rezone the old array out. Done! Should be a 30 minute outage.

    Without the eval license it'd be a backup-restore migration, and that'd take a week.

    Labels: , , ,


    Wednesday, September 03, 2008

    EVA4400 + FATA

    Some edited excerpts of internal reports I've generated over the last (looks at watch) week. The referenced testing operations involve either a single stream of writes, or two streams of writes in various configurations:
    Key points I've learned:
    • The I/O controllers in the 4400 are able to efficiently handle more data than a single host can throw at it.
    • The FATA drives introduce enough I/O bottlenecks that multiple disk-groups yield greater gains than a single big disk-group.
    • Restripe operations do not cause anywhere near the problems they did on the MSA1500.
    • The 4400 should not block-on-write the way the MSA did, so the NetWare cluster can have clustered volumes on it.
    The "Same LUN" test showed that Write speeds are about half that of the single threaded test, which gives about equal total throughput to disk. The Read speeds are roughly comperable, giving a small net increase in total throughput from disk. Again, not sure why. The Random Read tests continue to perform very poorly, though total throughput in parallel is better than the single threaded test.

    The "Different LUN, same disk-group," test showed similar results to the "Same LUN" test in that Write speeds were about half of single threaded yielding a total Write throughput that closely matches single-threaded. Read speeds saw a difference, with significant increases in Read throughput (about 25%). The Random Read test also saw significant increases in throughput, about 37%, but still is uncomfortably small at a net throughput of 11 MB/s.

    The "Different LUN, different disk-group," test did show some I/O contention. For Write speeds, the two writers showed speeds that were 67% and 75% of the single-threaded speeds, yet showed a total throughput to disk of 174 MB/s. Compare that with the fasted single-threaded Write speed of 130 MB/s. Read performance was similar, with the two readers showing speeds that were 90% and 115% of the single-threaded performance. This gave an aggregate throughput of 133 MB/s, which is significantly faster than the 113 MB/s turned in by the fastest Reader test.

    Adding disks to a disk-group appears to not significantly impact Write speeds, but significantly impact Read speeds. The Read speed dropped from 28 MB/s to 15 MB/s. Again, a backup-to-disk operation wouldn't notice this sort of activity. The Random Read test showed a similar reduction in performance. As Write speeds were not affected by restripe, the sort of cluster hard-locks we saw with the MSA1500 on the NetWare cluster will not occur with the EVA4400.

    And finally, a word about controller CPU usage. In all of my testing I've yet to saturate a controller, even during restripe operations. It was the restripe ops that killed the MSA, and the EVA doesn't seem to block nearly as hard. Yes, read performance is dinged, but not nearly to the levels that the MSA does. This is because the EVA keeps its cache enabled during restripe-ops, unlike the MSA.
    One thing I alluded to in the above is that Random Read performance is rather bad. And yes, it is. Unfortunately, I don't yet know if this is a feature of testing methodology or what, but it is worrysome enough that I'm figuring it into planning. The fastest random-read speed turned in for a 10GB file, 64KB nibbles, came to around 11 MB/s. This was on a 32-disk disk-group on a Raid5 vdisk. Random Read is the test that closest approximates file-server or database loads, so it is important.

    HP has done an excellent job tuning the caches for the EVA4400, which makes Write performance exceed Read performance in most cases. Unfortunately, you can't do the same reordering optimization tricks for Read access that you can for Writes, so Random Read is something of a worst-case scenario for these sorts of disks. HP's own documentation says that FATA drives should not be used for 'online' access such as file-servers or transactional databases. And it turns out they really meant that!

    That said, these drives sequential write performance is excellent, making them very good candidates for Backup-to-Disk loads so long as fragmentation is constrained. The EVA4400 is what we really wanted two years ago, instead of the MSA1500.

    Still no word on whether we're upgrading the EVA3000 to a EVA6100 this weekend, or next weekend. We should know by end-of-business today.

    Labels: , , , ,


    Wednesday, August 27, 2008

    Woot!

    The EVAs are scheduled to deliver today! This means that we are very probably going to be taking almost every IT system we have down starting late Friday 9/5 and going until we're done. We have a meeting in a few minutes to talk strategy.

    There was some fear that the gear wouldn't get here in time for the 9/5 window. The 9/12 window has one of the key, key people needed to handle the migration in Las Vegas for VMWorld, and he won't be back until 9/21 which also screws with the 9/19 window. The 9/19 window is our last choice, since that weekend is move-in weekend and the outage will be vastly more noticeable with students around. Being able to make the 9/5 window is great! We need these so badly that if we didn't get the gear in time, we'd have probably done it 9/12 even without said key player.

    The one hitch is if HP can't do 9/5-6 for some reason. Fret. Fret.

    Labels: , , ,


    Monday, August 25, 2008

    Dynamic partitions in Server 2008 and Cluster

    It would seem, and I've yet to trace down definitive proof of this, that Windows Server 2008 Clustering still has the Basic Partitioning dependency in it. This limits Windows LUNs to 2TB, among other annoyances. Such as the fact that resizing one of those puppies requires a full copy onto a larger LUN rather than extending the one you already have. How... 1999.

    Labels: , , , ,


    Email sizes

    The question has been raised internally that perhaps we need to reassess what we've set for email message-size limits. When we set our current limit, we did it to the apparent defacto standard for mail size limits, which is about 10 meg.

    This, perhaps, is not what it should be for an institution of higher-ed where research is performed. We have certain researchers on campus that routinely play with datasets larger than 10MB, sometimes significantly larger. And these researchers would like to electronically distribute these datasets to other researchers, and the easiest means of doing that by far is email. The primary reason we have mail-servers serving the (for example) chem.wwu.edu domain is to have these folk with much larger message size limits. Otherwise, these folk would have their primary email in Exchange.

    The routine answer I've heard for handling really large file sizes is to use, "alternate means," to send the file. We don't have a FTP server for staff use, since we have a policy that forbids the use of unauthenticated protocols for transmitting passwords and things. We could do something like Novell does with ftp.novell.com/incoming and create a drop-box that anyone with a WWU account can read, but that's sort of a blunt-force solution and by definition half of a half-duplex method. Our researchers would like a full duplex method, and email represents that.

    So what are you all using for email size limits? Do you have any 'out of band' methods (other than snail mail) for handling larger data sizes?

    Labels: , ,


    Tuesday, August 19, 2008

    IPv6 uptake

    Not too long ago I asked the question about what our plans were about IPv6. While the telecom guys didn't actually laugh at me, it was clear the question was considered a bit silly. After all, we are the proud owners of a full out class B (140.160.0.0/16) so IPv4 address exhaustion is not something we're likely to run into very soon. Certainly not by 2014 when we should be 'out' of IPv4 address space on the internet. Will IANA repossess our 'unused' spaces? Don't know, probably not.

    That said us moving to IPv6 will require a few things, none of them internal processes:
    • A bill by the State Legislature mandating IPv6 uptake by all State agencies. We're not subject to the already existing Federal rule.
    • Enough of the general internet is routing IPv6 that the IPv4-over-IPv6 tunneling causes enough headaches we need to move due to user revolts.
    • Some new widget, be it server tech or some kind of net-attached device, only supports IPv6 and we need to get it running.
    Of course, if the powers that be here decided that it must be done, and our telecom people fail to talk them out of it, it could still happen.

    Labels: ,


    Monday, August 18, 2008

    Enabling autokey auth in NTP on SLES10

    The NTP protocol permits the use of crypto to authenticate clients and servers to each other, as well as between time servers. By default, SLES10 is set up to allow the v3 method of using symmetric keys, but not the v4 method that uses public/private keys. If you want to use the v4 method, this is the tip for you.

    Background

    By default SLES runs NTP inside a chroot jail. This can be changed from the YaST NTP config screen if you wish. This is a more secure method of running NTP. The chroot jail's root is at /var/lib/ntp/.

    Additionally, ntp runs with an AppArmor profile loaded against it for added security.

    Getting NTPv4 auth to work

    There are 4 steps to get this to work.

    1. Copy the .rnd file to the chroot jail
    2. Run ntp-keygen
    3. Modify the AppArmor profile for /usr/sbin/ntpd to allow read access to the new files
    4. Modify the /etc/ntp.conf file to enable v4 auth.

    Copy the .rnd file to the chroot jail

    By default, there should be a .rnt file at /root/.rnd. If so, copy this to /var/lib/ntp/etc/.rnd. If there is no file there, one can be generated through use of openssl.

    timehost:~ # openssl rand -out /var/lib/ntp/etc/.rnd 1

    Run ntp-keygen

    Change-directory to /var/lib/ntp/etc, and execute the following command:

    timehost:~ # ntp-keygen -T

    This will drop a pair of files in the directory you run it, so running it while in /var/lib/ntp/etc saves you the step of copying them to this directory.

    Modify the AppArmor profile

    This is done through YaST

    1. Launch YaST
    2. Go to the "Novell AppArmor" section, and enter the "Edit Profile" tool.
    3. Select "/usr/sbin/ntpd" and click Next.
    4. Click the "Add Entry" button and select File.
    5. Browse to /var/lib/ntp/etc/.rnd and click the "Read" permissions check-box, and click OK
    6. Repeat the previous two steps to add the two files created by ntp-keygen, named "ntpkey_cert_[hostname]" and "ntpkey_host_[hostname]".
      1. Note: AppArmor behavior changes between SP1 and SP2. In SP1 you can use the link files, in SP2 you need to specify the link targets.
    7. Click Done on the main Profile Dialog
    8. Agree to reload the AppArmor profile

    Modify /etc/ntp.conf

    The YaST tool for NTP doesn't allow for v4 configurations, so this has to be done on the command line. Open the /etc/ntp.conf file with your editor of choice, and insert the following lines before your "server" lines:

    keysdir /var/lib/ntp/etc/
    crypto randfile /var/lib/ntp/etc/.rnd

    Then append the word "autokey" to the server and peer lines of your choice. At this point, you should be able to restart ntpd, and it will use authentication. This is a very basic NTPv4 configuration setup, but this should set the ground up for more complex configs.

    Labels: , , , ,


    Thursday, August 14, 2008

    Virtualization and Fileservers

    There are some workloads that fit well within VM of any kind, and others that are very tricky. Fileservers are one area that are not good candidates for VM. In some cases they qualify as highly transactional. In others, the memory required to do fileserving well makes them very expensive. When you can fit 40 web-servers on a VM host, but only 4 fileservers, it makes the calculus obvious.

    This is on my mind since we're running into memory problems on our NetWare cluster. We've just plain outgrown the 32-bit memory space for file-cache. NW can use memory above the 4GB line, it does have PAE support, but memory access above there is markedly slower than it is below the line. Last I heard the conventional wisdom is that 12GB is about the point where it starts earning you performance gains again. eek!

    So, I'm looking forward to 64-bit memory spaces and OES2. 6GB should do us for a few years. That said, 6GB of actually-used RAM in a virtual-host means that I could fit... two of them on a VM server with 16GB of RAM.

    16GB of RAM in, say, an ESX cluster is enough to host 10 other servers. Especially with memory deduplication. In the case of my hypothetical 6GB file-servers, 5.5GB of that RAM will be consumed by file-cache that will be unique to that server and thus very little gains from memory de-dup.

    In the end, how well a fileserver fits in a VM environment is based on how large of a 'working set' your users have. If the working set it large enough, it can mean that you'll get small gains for virtualization. However, I realize fileserving on the scale we do it is somewhat rare, so for departmental fileservers VM can be a good-sized win. As always, know your environment.

    In light of the budgetary woes we'll be having, I don't know what we'll do. Last I heard the State is projected to have a 2.7 billion deficit for the 2009-2011 (fiscal year starts July 1) budget cycle. So it may very well be possible that the only way I'll get access to 64-bit memory spaces is in an ESX context. That may mean a 6 node cluster on 3 physical hosts. And that's assuming I can get new hardware at all. If it gets bad enough I'll have to limp along until 2010 and play partitioning games to load-balance my data-loads across all 6 nodes. By 2011 all of our older hardware falls off of cheap-maintenance and we'll have to replace it, so worst-case that's when I can do my migration to 64-bit. Arg.

    Labels: , , , ,


    Friday, August 01, 2008

    Older, but still a goodie.

    Bruce Schneier, whom I've met once, had an essay last year on the state of the art of password guessing. Not cracking, ala rainbow tables, but guessing. If you want to generate passwords that are better cracked than guessed, this is the essay for you.

    Labels:


    Friday, July 25, 2008

    Handling eDirectory core-files on linux

    If you've been getting core files generated by ndsd on your Linux servers, and want to call Novell Support about it, there are a few things you can do to maximize what Novell will get out of the files themselves. You may not get much, but these will help the people with the debug symbols figure out what's going on.

    Packaging the Core


    First and foremost, you already have the tool to package core files for delivery to Novell already on your system. TID3078409 describes the details of how to use 'novell-getcore.sh'. It is included on 8.7.3.x installations as well as 8.8.x installations.

    Running it looks like this:
    edirsrv1:~ # novell-getcore -b /var/opt/novell/eDirectory/data/dib/core.31448 /opt/novell/eDirectory/sbin/ndsd
    Novell GetCore Utility 1.1.34 [Linux]
    Copyright (C) 2007 Novell, Inc. All rights reserved.


    [*] User specified binary that generated core: /opt/novell/eDirectory/sbin/ndsd
    [*] Processing '/var/opt/novell/eDirectory/data/dib/core.31448' with GDB...
    [*] PreProcessing GDB output...
    [*] Parsing GDB output...
    [*] Core file /var/opt/novell/eDirectory/data/dib/core.31448 is a valid Linux core
    [*] Core generated by: /opt/novell/eDirectory/sbin/ndsd
    [*] Obtaining names of shared libraries listed in core...
    [*] Counting number of shared libraries listed in core...
    [*] Total number of shared libraries listed in core: 72
    [*] Corefile bundle: core_20080725_092227_linux_ndsd_edirsrv1
    [*] Generating GDBINIT commands to open core remotely...
    [*] Generating ./opencore.sh...
    [*] Gathering package info...
    [*] Creating core_20080725_092227_linux_ndsd_edirsrv1.tar...
    [*] GZipping ./core_20080725_092227_linux_ndsd_edirsrv1.tar...
    [*] Done. Corefile bundle is ./core_20080725_092227_linux_ndsd_edirsrv1.tar.gz


    Once you have the packaged core, you can upload it to ftp.novell.com/incoming as part of your service-request.

    Including More Data


    If you're lucky enough to be able to cause the core file to drop on demand, or it just plain happens often enough that repetition isn't a problem, there is one more thing you can do to include better data in the core you ship to Novell. TID3113982 describes a setting you can add to the ndsd launch script (/etc/init.d/ndsd) that'll include more data. The TID describes what is being done pretty well. In essence, you're using an alternate malloc call that fails with better information than the normal one. You don't want to run with this set for very long, especially in busy environments, as it impacts performance. But if you have a repeatable core, the information it can provide is better than a 'naked' core. Setting MALLOC_CHECK_=2 is my recommendation.

    Be sure to unset this once you're done troubleshooting. As I said, it can impact performance of your eDirectory server.

    Labels: , , , , ,


    Monday, July 14, 2008

    An exchange 2007 problem

    While I was on vacation we had a few more instances of email going into a black hole. This is not good. I had suspected this was happening, but proof accumulated while I was broiling in the mid-west.

    After doing a lot of message tracing in Exch2007, I noticed one trend. When an email to a group hits the Hub server, it attempts to dereference the group into a list of mailboxes to deliver to. It uses Global Catalogs for this function. When the GC used was one in our empty root rather than the child domain that everything lives in, this one group didn't return any people. The tracking code was, "dereferenced, 0 recipients". Which is a fail-by-success.

    After a LOT of digging, I threw an LDAP browser at the GC's. What I noticed is that the GC entry for this one group was subtly different on the empty-root GC and the child-domain GC. Specifically, the object had no "member" attributes.

    It turns out the problem was that the group in question was set to a Global group, rather than a Universal group. Ahah! Global groups apparently don't publish Member info globally, just in the domain itself. Universal groups are just that, Universal, and publish enterprise wide. Right. Gotcha.

    Exch2003 did not manifest this, as it stayed in the domain pretty solidly. I don't know how many of our groups are still Global groups, but this one is going to take some clean-up to fix.

    Labels: , ,


    Wednesday, June 18, 2008

    Firefox3 and IT laziness

    I'm just now loading FF3. Like IE8, they got a lot more paranoid about bad SSL certs. They've gone beyond just coloring the toolbar orange, and are now fully blocking bad SSL sites..

    This is a bad thing for us IT wonks. Every appliance and web-ap comes with an SSL functionality these days, and all too many of these use a self-signed cert. Unfortunately, FF3 blocks access to these sites by default and you have to add an exception (a multi-click, "are you SURE you want to do this"? procedure) for each one. All but about 5 of our HP servers have iLO cards with certificates are self-signed, so that's A LOT of sites that'll need to get added.

    I'm sure there is a, "Don't be paranoid about SSL Certs," setting somewhere in about:config, but I haven't looked.

    Like IE8, this will push the IT administration industry to be less lazy about SSL compliance. We need that, but it'll be 5 years before we really get there. As that's how long it'll take to phase out old "self-signed is good enough for internal use" software and widgets.

    Labels: ,


    Monday, June 16, 2008

    A good article on trustees

    Over on the Novell Cool Solutions site, Marcel Cox just posted an article about how Trustees are handled on the Novell Filesystems (TFS and NFS). If you wanted to know the fundamentals of how ACLs are done on NSS volumes and how it relates to eDirectory, this is a good start.

    Labels: , , , , , ,


    Wednesday, June 11, 2008

    Shrinking data-centers

    This is the 901st post of this blog. Huh.

    ComputerWorld had a major article recently, "Your Next Data Center", subtitled, "Companies are outgrowing their data centers faster than they ever predicted. It's time to rethind and rebuild."

    That is not the case with us. Ours is shrinking, and I'm not alone in this. I know other people who are experiencing the same thing.

    The data-center we have right now was built sometime between 1999 and 2000. I'm not sure exactly when it was, as I wasn't here. I like to think they planned 20 years of growth into it, as that's how long the previous data-center lasted.

    When I first started here in late 2003, the workhorse servers supporting the largest percentage of our intel based servers were HP ML530 G1's (here are the G2's, the same size as the G1's), with some older HP LH3 servers still in service. The freshly installed 6-node Novell NetWare cluster had 3 ML530's, and 3 rack-dense BL380's. If I'm remembering right, at that time we had two other rack-dense servers. The rest were these 7U monsters, and we could cram 4 to a rack.

    With the 7U ML530's as the primary machine, it would seem that the planners of our data-center did not take 'rack dense' into consideration. This was certainly the case with the rack they decided to install, as they planned a very old-school bottom-to-top venting scheme; something I've spent considerable time and innovation trying to revise. They also heard about the stats like "20% growth in number of servers year-over-year," and planned enough floor space to handle it.

    Right this moment we're poised to occupy a lot LESS rack-space than we once were. For this, I thank two major trends, and a third chronic one:
    1. Replacing the 7U monsters with 1U servers
    2. Virtualization
    3. No budget for massive server expansions
    We're still consuming the same amount of power as we were 2 years ago, but the rack units drawing power has reduced. We still have most of those ML530's, but they've all been relegated to 2nd or 3rd line duties like test/deployment servers or single function utility servers. They're all coming off of maintenance soon (they're like 5-7 years old now) so I'm not 100% sure what we're replacing them with. Probably more VM servers if we can kick the money tree hard enough.

    One thing we have been having growing pains over is power draw. The reason we're drawing the same as we were 2 years ago is largely due to us coming close to the rated max for our UPS, and replacing the UPS is a major, major capital-request process nightmare. It would seem that upgrading our UPS triggers certain provisions in the local building code that will require us to bring the data-center up to latest code. The upgrades required to do that are prohibitive, and most likely would require us to relocate all of our gear during the construction process. Since I haven't heard any rumors of us starting the capital-request process, I'm guessing we're not due for another UPS any time soon. This... concerns me.

    One side-effect to being power-limited, is that our cooling capacity isn't anywhere NEAR stressed yet.

    But when it comes to square footage, we have lots of empty space. We are not shoe-horning in servers into every available rack-unit. We haven't resorted to housing servers in with the sys-admin staff.

    Labels: ,


    Friday, May 23, 2008

    Problem with SLES10-SP2

    Just this morning Novell posted a new TID:

    Updates catalogs missing after updating libzypp

    I've heard on the grape-vine that this particular libzypp update was put into the SLES10-SP1 channel in order to prepare for SP2's release. Those fine folk out there that have turned on Auto Updating on their SLE[S|D] boxes have very probably already been bit by it. I hope Novell gets this one fixed, and posts recovery steps, soon.

    Labels: , , , ,


    Thursday, May 22, 2008

    A question of scale

    This morning I ran cross this article:

    Honda's 68MPG FCX Fuel-Cell Sedan to See Limited Service in '08

    This is interesting in and of its own self. But in the main body of the article is this very interesting sentence :
    Honda's FCX prototype uses a 95kW (127HP) electric motor which is powered by a 100kW Proton Exchange Membrane Fuel Cell (PEFC), 171 liter hydrogen fuel tank and a bank of lithium-ion batteries.
    The UPS attached to our datacenter is 50kW. This one car will have to push out enough electricity to run TWO of our datacenters in order to have enough oomph to satisfy the normal American consumer. Interesting!

    Labels:


    Monday, May 12, 2008

    DataProtector 6 has a problem, continued

    I posted last week about DataProtector and its Enhanced Incremental Backup. Remember that "enhincrdb" directory I spoke of? Take a look at this:

    File sizes in the enhincr directory

    See? This is an in-progress count of one of these directories. 1.1 million files, 152MB of space consumed. That comes to an average file-size of 133 bytes. This is significantly under the 4kb block-size for this particular NTFS volume. On another server with a longer serving enhincrdb hive, the average file-size is 831 bytes. So it probably increases as the server gets older.

    On the up side, these millions of weensy files won't actually consume more space for quite some time as they expand into the blocks the files are already assigned to. This means that fragmentation on this volume isn't going to be a problem for a while.

    On the down side, it's going to park (in this case) 152MB of data on 4.56GB of disk space. It'll get better over time, but in the next 12 months or so it's still going to be horrendous.

    This tells me two things:
    • When deciding where to host the enhincrdb hive on a Windows server, format that particular volume with a 1k block size.
    • If HP supported NetWare as an Enhanced Incremental Backup client, the 4kb block size of NSS would cause this hive to grow beyond all reasonable proportions.
    Some file-systems have real problems dealing with huge numbers of files in a single directory. Ext3 is one of these, which is why the b-tree hashed indexes were introduced. Reiser does better in this case out of the box. NSS is pretty good about this, as all GroupWise installs before GW became available for non-NetWare platforms created this situation by the sheer design of GW. Unlike NSS, ext3 and reiser have the ability of being formatted with different block-sizes, which makes creating a formatted file-system to host the enhincrdb data easier to correctly engineer.

    Since it is highly likely that I'll be using DataProtector for OES2 systems, this is something I need to keep in mind.

    Labels: , , , , ,


    Wednesday, May 07, 2008

    DataProtecter 6 has a problem

    We're moving our BackupExec environment to HP DataProtector. Don't ask why, it made sense at the time.

    Once of the niiiice things about DP is what's called, "Enhanced Incremental Backup". This is a de-duplication strategy, that only backs up files that have changed, and only stores the changed blocks. From these incremental backups you can construct synthetic full backups, which are just pointer databases to the blocks for that specified point-in-time. In theory, you only need to do one full backup, keep that backup forever, do enhanced incrementals, then periodically construct synthetic full backups.

    We've been using it for our BlackBoard content store. That's around... 250GB of file store. Rather than keep 5 full 275GB backup files for the duration of the backup rotation, I keep 2 and construct synthetic fulls for the other 3. In theory I could just go with 1, but I'm paranoid :). This greatly reduces the amount of disk-space the backups consume.

    Unfortunately, there is a problem with how DP does this. The problem rests on the client side of it. In the "$InstallDir$\OmniBack\enhincrdb" directory it constructs a file hive. An extensive file hive. In this hive it keeps track of file state data for all the files backed up on that server. This hive is constructed as follows:
    • The first level is the mount point. Example: enhincrdb\F\
    • The 2nd level are directories named 00-FF which contain the file state data itself
    On our BlackBoard content store, it had 2.7 million files in that hive, and consumed around 10.5GB of space. We noticed this behavior when C: ran out of space. Until this happened, we've never had a problem installing backup agents to C: before. Nor did we find any warnings in the documentation that this directory could get so big.

    The last real full backup I took of the content store backed up just under 1.7 million objects (objects = directory entries in NetWare, or inodes in unix-land). Yet the enhincrdb hive had 2.7 million objects. Why the difference? I'm not sure, but I suspect it was keeping state data for 1 million objects that no longer were present in the backup. I have trouble believing that we managed to churn over 60% of the objects in the store in the time I have backups, so I further suspect that it isn't cleaning out state data from files that no longer have a presence in the backup system.

    DataProtector doesn't support Enhanced Incrementals for NetWare servers, only Windows and possibly Linux. Due to how this is designed, were it to support NetWare it would create absolutely massive directory structures on my SYS: volumes. The FACSHARE volume has about 1.3TB of data in it, in about 3.3 million directory entries. The average FacStaff User volume (we have 3) has about 1.3 million, and the average Student User volume has about 2.4 million. Due to how our data works, our Student user volumes have a high churn rate due to students coming and going. If FACSHARE were to share a cluster node with one Student user volume and one FacStaff user volume, they have a combined directory-entry count of 7.0 million directory entries. This would generate, at first, a \enhincrdb directory with 7.0 million files. Given our regular churn rate, within a year it could easily be over 9.0 million.

    When you move a volume to another cluster node, it will create a hive for that volume in the \enhincrdb directory tree. We're seeing this on the BlackBoard Content cluster. So given some volumes moving around, and it is quite conceivable that each cluster node will have each cluster volume represented in its own \enhincrdb directory. Which will mean over 15 million directory-entries parked there on each SYS volume, steadily increasing as time goes on taking who knows how much space.

    And as anyone who has EVER had to do a consistency check of a volume that size knows (be it vrepair, chkdsk, fsck,or nss /poolrebuild), it takes a whopper of a long time when you get a lot of objects on a file-system. The old Traditional File System on NetWare could only support 16 million directory entries, and DP would push me right up to that limit. Thank heavens NSS can support w-a-y more then that. You better hope that the file-system that the \enhincrdb hive is on never has any problems.

    But, Enhanced Incrementals only apply to Windows so I don't have to worry about that. However.... if they really do support Linux (and I think they do), then when I migrate the cluster to OES2 next year this could become a very real problem for me.

    DataProtector's "Enhanced Incremental Backup" feature is not designed for the size of file-store we deal with. For backing up the C: drive of application servers or the inetpub directory of IIS servers, it would be just fine. But for file-servers? Good gravy, no! Unfortunately, those are the servers in most need of de-dup technology.

    Labels: , , , , ,


    Tuesday, May 06, 2008

    Being annoyed by rug?

    Rug/zmd in SLES10-SP1 is still a headache maker. Novell knows this, but I strongly suspect that we'll have to wait until SLES11 before we get anything improved. OpenSUSE now has zypper which works pretty good, and I think you can do it in SLES if you want, but I haven't tried.

    One of the chief annoyances of rug is that the zmd.db file kept in /var/lib/zmd/zmd.db gets corrupted far too easily. And when that happens, rug can take HOURS to return anything. If it returns anything at all.

    The fix for it is easy, stop zmd, delete the zmd.db file, restart zmd. Since I'm doing this fairly often, I've whipped up a bash script to do it for me.

    nukezmd
    #!/bin/sh
    #
    # For killing ZMD when it is clearly hung. An all too often occurance.
    #

    declare PIDZMD

    # First get the PID of ZMD

    printf "Getting PID... "
    let PIDZMD=`rczmd showpid`
    printf "$PIDZMD\n"
    # Then unconditionally kill it

    printf "Killing zmd hard... \n"
    kill -9 $PIDZMD

    # Remove the old, inconsistent database

    printf "Nuking old database... \n"
    rm /var/lib/zmd/zmd.db

    # Restart ZMD, which will build a new, consistent database

    printf "Restarting ZMD\n"
    rczmd start
    Simple, to the point. Works.

    Labels: , , ,


    Monday, May 05, 2008

    Back-scatter spam

    There was a recent slashdot post on this. We've had a fair amount of this sort of spam. And the victims are at pretty high levels of our organization, too. Last week the person who is responsible for us even having a Blackberry Enterprise Server asked us to figure out a way to prevent these emails from being forwarded to their blackberry. When a spam campaign is rolling, that person can get a bounce-message every 5-15 minutes for up to 8 hours, into the wee hours of the night. And that's just the mails that get PAST our anti-spam appliance. We set up some forwarding filters, but we haven't heard back about how effective they are.

    This is a hard thing to guard against. You can't use the reputation of the sender IP address, since they're all legitimate mailers being abused by the spam campaign and are returning delivery service notices per spec. So the spam filtering has to be by content, which is a bit less effective. In one case, of the 950-odd DSN's we received for a specific person during a specific spam campaign, only 15 made it to the inbox. But that 15 was enough above what they normally saw (about 3 a day) that they complained.

    Backscatter is a problem. However, our affected users have so far been sophisticated enough users of email to realize that this was more likely forgery than something wrong with their computer. So, we haven't been asked to "track down those responsible." This is a relief for us, as we've been asked that in the past when forged spams have come to the attention of higher level executives.

    If it becomes a more wide-spread problem, we will be told to Do Something by the powers that be. Unfortunately, there isn't a lot that can be done. Blocking these sorts of DSNs is doable, but that's an expensive thing to manage in terms of people time. In 6-12 months we can expect the big anti-spam vendors to include options to just block DSN's uniformly, but until that time comes (and we have the budget for the added expenses) we'd have to do it through dumb keyword filters. Not a good solution. And it would also cause legitimate bounce messages to fail to arrive.

    Labels: , ,


    Wednesday, April 30, 2008

    Legal processes

    Yesterday we received a Litigation Hold request. For those of you who don't know, this is the order given as part of a lawsuit ordering us to take steps to preserve data that could be used as part of the Discovery process of the suit. This is something that is becoming more and more common these days.

    Our department has been pretty lucky so far. Since I started here in late 2003 this is the first Litigation Hold request we've had to deal with. We've had a few "public records requests" come through which are handled similarly, but this is the first one involving data that may be introduced under sworn testimony.

    This morning we had an article pointed out to us by the Office of Finance Management at the state. WWU is a State agency, so OFM is in our chain of bureaucracy.

    Case Law/Rule Changes Thrust Electronic Document Discovery into the Spotlight
    .

    It's an older PDF, but it does give a high level view of the sorts of things we should be doing when these requests come in. One of the things that we don't have any processes for are the sequestration of held data and chain of custody preservation. We are now building those.

    Guideline #4 has the phrase, "Consultants are particularly useful in this role," referring to overseeing the holding process and standing up before a court to testify that the data was handled correctly. This is very true! Trained professionals are the kind of people to know the little nuances that hostile lawyers can use to invalidate gathered evidence. Someone who has done a lot of reading and been to a few SANS classes is not that person.

    Just because it is possible to self represent yourself in court as your own lawyer, doesn't make it a good idea. In fact, it generally is a very bad idea. Same thing applies to the above phrase. You want someone who knows what the heck they're doing when they climb up there onto the witness stand.

    This is going to be an interesting learning experience.

    Labels: ,


    Thursday, April 17, 2008

    And a gripe

    2.5 hours is too freakin' long for "rug lu" to tell me which patches need application to this particular OES2 server. This needs fixing. I hope its fixed in SLES10 SP2.

    Labels: , ,


    Tuesday, April 15, 2008

    Beta attitudes

    One thing I've noticed while working on this beta is a change in attitude. Specifically, attitude regarding problems. I've run into problems so far that would have had me throwing things across the room by now. Yet, instead I get that 'ahah!' feeling and proceed to figure out how it went poink exactly like that. And then report it. That feels good.

    All of my prior bug-hunting has been post-release, when we ran into issues in production. Now, it's in pre-release and the bugs and issues I find now will be fixed by release (or at least documented so people know to expect it to break that way).

    It's an interesting change in attitude.

    Labels: ,


    Friday, April 11, 2008

    On email, what comes in it

    A friend recently posted the following:
    80-90% of ALL email is directory harvesting attacks. 60-70% of the rest is spam or phishing. 1-5% of email is legit. Really makes you think about the invisible hand of email security, doesn't it?
    Those of us on the front lines of email security (which isn't quite me, I'm more of a field commander than a front line researcher) suspected as much. And yes, most people, nay, the vast majority, don't realize exactly what the signal-to-noise ratio is for email. Or even suspect the magnitude. I suspect that the statistic of, "80% of email is crap," is well known, but I don't think people even realize that the number is closer to, "95% of email is crap."

    Looking at statistics on the mail filter in front of Exchange, it looks like 5.9% of incoming messages for the last 7 days are clean. That is a LOT of messages getting dropped on the floor. This comes to just shy of 40,000 legitimate mail messages a day. For comparison, the number of mail messages coming in from Titian (the student email system, and unpublished backup MTA) has a 'clean' rate of 42.5%, or 2800ish legit messages a day.

    People expect their email to be legitimate. Directory-harvesting attacks do constitute the majority to discrete emails; these are the messages you receive that have weird subjects, come from people you don't know, but don't have anything in the body. They're looking to see which addresses result in 'no person by that name here' messages and those that seemingly deliver. This is also why people unfortunate enough to have usernames or emails like "fred@" or "cindy@" have the worst spam problems of any organization.

    As I've mentioned many times, we're actively considering migrating student email to one of the free email services offered by Google or Microsoft. This is because historically student email has had a budget of "free", and our current strategy is not working. The way it is not working is because the email filters aren't robust enough to meet expectation. Couple that with the expectation of effectively unlimited mail quota (thank you Google) and student email is no longer a "free" service. We can either spend $30,000 or more on an effective commercial anti-spam product, or we can give our email to the free services in exchange for valuable demographic data.

    It's very hard to argue with economics like that.

    One thing that you haven't seen yet in this article are viruses. In the last 7 days, our border email filter saw that 0.108% of incoming messages contain viruses. This is a weensy bit misleading, since the filter will drop connections with bad reputations before even accepting mail and that may very well cut down the number of reported viruses. But the fact remains that viruses in email are not the threat they once were. All the action these days are on subverted and outright evil web-sites, and social engineering (a form of virus of the mind).

    This is another example of how expectation and reality differ. After years of being told, and in many cases living through the after-effects of it, people know that viruses come in email. The fact that the threat is so much more based on social engineering hasn't penetrated as far, so products aimed at the consumer call themselves anti-virus when in fact most of the engineering in them was pointed at spam filtering.

    Anti-virus for email is ubiquitous enough these days that it is clear that the malware authors out there don't bother with email vectors for self-propagating software any more. That's not where the money is. The threat had moved on from cleverly disguised .exe files to cunningly wrought (in their minds) emails enticing the gullible to hit a web site that will infest them through the browser. These are the emails that border filters try to keep out, and it is a fundamentally harder problem than .exe files were.

    The big commercial vendors get the success rate they do for email cleaning in part because they deploy large networks of sensors all across the internet. Each device or software-install a customer turns on can potentially be a sensor. The sensors report back to the mother database, and proprietary and patented methods are used to distill out anti-spam recipes/definitions/modules for publishing to subscribed devices and software. There is nothing saying that an open-source product can't do this, but the mother-database is a big cost that someone has to pay for and is a very key part of this spam fighting strategy. Bayesian filtering only goes so far.

    And yet, people expect email to just be clean. Especially at work. That is a heavy expectation to meet.

    Labels: , , ,


    Wednesday, April 02, 2008

    From Slashdot: Should users manage their own PC's?

    Should IT Shops Let Users Manage Their Own PCs?

    It's a very Web 2.0 concept. And there is some merit to it. Back in the day when workstation lock-downs were getting common in workplace settings (ZENworks was good for that), there was a debate about some of this. At my old job one thing we wanted to lock down was the wall-paper. That one thing would help reinforce the idea that this was a WORK Pc, not a home PC. The counter argument to this is that such user environment things are mostly harmless, so permitting them allows the lock-down to be less intrusive on the user.

    This is another step in that direction. Workplaces have PC configuration standards for a variety of good reasons. You want all machines plugged into your network to not be festering hives of scum and malware, and these sorts of standards can prevent that. On the other end of the scale, high end users know the tools of their field better than your general IT desktop support person does and in theory can do more with the tools they know versus the tools forced upon them.

    On the control end of the spectrum, you keep IT costs down by standardizing the configs in your enterprise. This keeps the Total Cost of Ownership down, a big thing for companies with the right internal costing controls (*nudge nudge*). One tech can support many more end users that way, since the range of things they support is kept to a minimum.

    On the freedom end of the spectrum, the end user gets exactly the tools they want to do their job. They're happier that way. And since they support themselves, IT costs are controlled. One tech can support many more end users that way, since the bits they're supporting are significantly reduced.

    The 'freedom' end of things runs smack into some standard industry practices, such as volume licensing and big-buy discounts. Dell, for instance, sells PCs cheaper if you buy them by the gross rather than in singles as users are onboarded. Specialized packages like AutoCAD also come cheaper if you buy them in packs of 10 rather than one at a time. Licenses all too often these days are timed and enforced, so you could have end users forgetting to renew the license on their Scrivener install and being non-productive for a few days while purchasing gets them a renewed license. The big 'endpoint management suites', what they seem to be calling the AntiVirus/Firewall package these days, all assume enterprise central control.

    On the other hand, users liked being treated like reasoning, intelligent people who are capable of making choices about their work environment. This makes for happier workers.

    Also working in this favor is the trend to webify everything in the workplace. The days when you have a whonking big file-server to store all the company data on are slowly going away, and being replaced with things like SharePoint (which can get just as big, don't get me wrong). The fights we've had in the past about how to roll out a new Novell Client to all our desktops would be moot in such an environment as the 'client' is called 'Firefox' (or Gnome, or Office 2007).

    On the downside of the 'freedom' end of things is piracy. Tools like Zen Asset Management are there to make sure that the software in use is actually legal. In this freedom environment there is the significantly increased probability of someone bringing their 'backup' copy of something from home to install on their work machine and creating legal liability for the company if they get audited.

    Another downside is interoperability problems. The Microsoft Office users create document-macros that the WordPerfect Office users can't run, and the OpenOffice users can't read the WordPerfect files. The Microsoft Office users publish things to SharePoint, where the OpenOffice users drop their stuff onto a handy WebDAV server somewhere. Office peer-pressure will still work on software selection to a point, even if you absolutely love Package Q for your day-to-day work you won't use it if the software everyone else in the office uses can't do a thing with it.

    The trade-off here is balancing the chaos and increased direct costs 'freedom' will introduce to the IT environment versus the productivity bonuses and intangible benefits (morale). That will decidedly depend on the culture of the office, and what it is that they do. I know some people who would leave their current jobs just to get the freedom to order the machine they want and use the software they want to use, even if it means somewhat less benefits.

    A friend of mine recently changed jobs. The old job was was Microsoft. Since Microsoft is a software development firm of some significant size, they try to dog-food their own stuff wherever possible; even if the tool is a poor fit for the task at hand. She spent a lot of time clubbing her software to do what it didn't really want to do, all the while knowing that there were two non-Microsoft packages that did exactly what she wanted. The new job is not with Microsoft, and the first day there they gave her an order sheet to order the software she wanted; they wanted results and trusted her to turn them in in an understandable format. Thus, the joys of freedom.

    So, to answer the question, it depends. It depends on corporate culture to a significant degree, as well as the sector the company is in, as well as the work being done. In highly creative areas such as design, the benefits can be great. In highly regimented areas such as accounting, perhaps not so much or at least a high degree of freedom won't be worth the ultimate costs.

    Labels: ,


    Tuesday, March 25, 2008

    IPv6 vs IPX

    In a session last week came the following comment from a presenter (paraphrased):
    How may of you in the room have been at this long enough to do IPX? Ok, great. Now how many of you have done anything with IPv6? Doesn't that look JUST like IPX?
    And he's right, to a point. IPX addresses are of the form network-number:node-number, such as:

    00008021:0002a540d0e1

    Where 'node number' is the MAC address of the network card in question. It's up to the routers to figure out where network-numbers live, and advertised services issue full-network broadcasts to advertise said service, which is the primary reason that IPX just doesn't scale if WAN links are in the mix. But that's by the by.

    IPv6 addresses work similarly:

    2001:0db8:85a3:08d3:1319:8a2e:0370:7334

    The last 48 bits are the MAC address and the bits ahead of it constitute the network number. Except... the IPv6 designers knew about the failings of IPX and worked around them. The last 48 bits don't have to be the MAC address, though as I understand it that address has to exist for each physical interface. Unlike IPX, IPv6 has the ability to have 'secondary' addresses. The lack of this ability was the main reason that Novell Cluster Services only worked on IP networks, which caused its own wave of grief when clustering was introduced in the NetWare 5.1 era. Secondary IPv6 numbers don't have to follow the MAC format, which in my opinion is a good thing!

    Yes, when I first read about IPv6 addressing I had that same, "wow, this is just like IPX," moment the BrainShare presenter had. Only, more scalable, and more flexible.

    Labels: , , , ,


    Tuesday, March 18, 2008

    BrainShare Tuesday

    Today started off with a bit of panic, as I hadn't set my alarm. Me being a west-coaster, 7:20 (when I woke up) is an entirely reasonable time to get up as far as my body is concerned. Only, I needed to get dressed and breakfasted before my first session at 8:30. Aie! I had to eat quick, but I got there. Didn't get a chance to check work email, though.

    ATT326: Advanced Linux Troubleshooting
    An ATT, therefore hard to summarize. But I learned about a few new commands I didn't know about before. Like strace. And vimdiff.

    TUT130: Challenges in Storage I/O in Virtualization
    Another nice one, but an emergency at work (printing down in a dorm, during finals week) distracted me heavily during the first half of it. Which resulted in the following note in my notes:
    NPIV looks really nifty. Look into it.
    NPIV being how you can use fibre-channel zoning to zone off VM's, rather than HBA's. Highly useful. I also learned about a neat new thing called Virtual Fabrics. Virtual Fabrics work kind of like VLANS for fabrics. You can segregate your fabrics into fabrics that share hardware but nothing else. Handy if your, say, Solaris admins don't want you mucking about with their zoning, while saving money through consolidated hardware.

    TUT216: OES2 SP1 Architectural Overview
    There is a LOT of new stuff in SP1.
    • It will include eDir 8.8.4 (8.8.3 will ship this summer sometime)
    • NCP and eDir will be fully 64-bit
    • OES2 SP1 will be based on SLES SP2, which will be releasing about the same time
    • AFP Support
      • AFP 3.1
      • Uses Diffie-Helman 1 for password exchange, meaning the 8-character password problem is solved.
      • Fully SMP-safe
      • Has cross-protocol locking with NCP. CIFS doesn't have cross-protocol locking yet, but IIRC, Samba does
      • Does not need LUM enabled users
    • CIFS Support
      • NTLMv1, but v2 is a possibility if enough people ask, so file those enhancement requests!!
      • CIFS is separate from Samba, therefore can not be used in conjunction with Domain Services for Windows
      • As with AFP, fully SMP safe
    • EDir 8.8.4
      • LDAP auditing enhanced
      • "newer auth protocols", but they didn't say what.
    I should also mention that they're still deploying Novell Integrated Samba, which is what you'll have to use to get Domain Services For Windows. Samba still doesn't scale as far as I'd like ('only' 700-800 concurrent users), so that may be an issue for higher ed types who want high concurrency CIFS and also DSFW on the same box.

    TUT211: Enhanced Protocol Support in OES2 SP1
    This is the session where they went into detail about the AFP and CIFS support. They said that netatalk, the existing AFP stack on Linux, gets really slow once you go over the 20 concurrent users. Whoa! I can soooo understand why Novell felt the need to make a new one.
    • The 8 character password limit has been fixed! They now support DH1 for passing passwords.
    • The 'afptcp' daemon can use one password protocol at a time, so you can only use DH1, or one of the other three I can't remember.
    • Support for OSX 10.1 and 10.2 is scanty, and 10.5 is limited but users may not notice anyway.
    • Passwords will be case sensitive.
    • Kerberos will be in a future release
    • Performance is faster than NetWare, partly due to the ability to multi-thread
    • Can register services by way of SLP
    • Only supports NSS for the time being, the other Linux file-systems will be a future feature.
    • Can support 500 concurrent users, and 1000+ in the future. This fits our current AFP loads.
    • We can configure more about how it works than we could on NetWare, such as how many worker threads to spawn.
    • Has meaninful debug logs!
    • Has a new command, 'afpstat' that works like 'netstat' for giving a snapshot of afp connections.
    And then some CIFS stuff. We can't use it for political reasons so I didn't pay attention. Sorry.

    Tonight was the night formerly known as 'Sponsor Night,' but has a new name now that everyone who gets a booth is no longer a 'sponsor'. Some are sponsors, some are exhibiters. I can't keep track. Anyway, today was their party. "World of Novellcraft!" Homage to vid-gaming.

    Lots of Wii, lots of Rock Band, some Halo, lots of women dressed in Renaissance Festival gear getting their pictures taken by the 90%+ male audience. I've blogged before about my ambivalence about Sponsor Night. I lasted until about 7, when I came back to the hotel.

    Tomorrow I have an actual LUNCH BREAK in my schedule! Ooo! And Soul Asylum Soul Coughing Collective Soul plays the concert! I've been listening to two of their CD's for the past two months so I think I may even know a few songs by now.

    Labels: , , , , , ,


    Monday, March 17, 2008

    Today at Brainshare

    Monday. Opening day. I had trouble getting to sleep last night due to a poor choice of bed-time reading (don't read action, don't read action, don't read action). And had to get up at 6am body time in order to get breakfast before the morning keynote. There be zombies.

    Breakfast was uninspired. As per usual, the hashbrowns had cooled to a gellid mass before I found everything and got a seat.

    The Monday keynotes are always the CxO talks about strategy and where we're going. Today a mess of press releases from Novell give a good idea what the talks were about. Hovsepian was first, of course, and was actually funny. He gave some interesting tid-bits of knowledge.
    • Novell's group of partners is growing, adding a couple hundred new ones since last year. This shows the Novell 'ecosystem' is strong.
    • 8700 new customers last year
    • Novell press mentions are now only 5% negative.
    Jeff Jaffe came on to give the big wow-wow speech about Novell's "Fossa" project, which I'm too lazy to link to right now. The big concern is agility. He also identified several "megatrends" in the industry:
    • High Capacity Computing
    • Policy Engines
    • Orchestration
    • Convergence
    • Mobility
    I'm not sure what 'Convergence' is, but the others I can take a stab at. Note the lack of 'virtualization' in this list. That's soooo 2007. The big problem is now managing the virtualization, thus Orchestration. And Policy Engines.

    Another thing he mentioned several times in association with Fossa and agility, is mergers and acquisitions. This is not something us Higher Ed types ever have to deal with, but it is an area in .COM land that requires a certain amount of IT agility to accommodate successfully. He mentioned this several times, which suggests that this strategy is aimed squarely at for-profit industry.

    Also, SAP has apparently selected SLES as their primary platform for the SMB products.

    Pat Hume from SAP also spoke. But as we're on Banner, and it'll take a sub-megaton nuclear strike to get us off of it, I didn't pay attention and used the time to send some emails.

    Oh, and Honeywell? They're here because they have hardware that works with IDM. That way the same ID you use for your desktop login can be tied to the RFID card in your pocket that gets you into the datacenter. Spiffy.

    ATT375 Advanced Tips & Tricks for Troubleshooting eDir 8.8
    A nice session. Hard to summarize. That said, they needed more time as the Laptops with VMWare weren't fast enough for us to get through many of the exercises. They also showed us some nifty iMonitor tricks. And where the high-yield shoot-your-foot-off weapons are kept.

    BUS202 Migrating a NetWare Cluster to OES2
    Not a good session. The presenter had a short slide deck, and didn't really present anything new to me other than areas where other people have made major mistakes. And to PLAN on having one of the linux migrations go all lost-data on you. He recommended SAN snapshots. It shortly digressed into "Migrating a NetWare Cluster to Linux HA", which is a different session all together. So I left.

    TUT215 Integrating Macintosh with Novell
    A very good session. The CIO of Novell Canada was presenting it, and he is a skilled speaker. Apparently Novell has written a new AFP stack from scratch for OES2 Sp1, since NETATALK is comparatively dog slow. And, it seems, the AFP stack is currently out performing the NCP stack on OES2 SP1. Whoa! Also, the Banzai GroupWise client for Mac is apparently gorgeous. He also spent quite a long time (18 minutes) on the Kanaka client from Condrey Consulting. The guy who wrote that client was in the back of the room and answered some questions.

    Labels: , , , , , ,


    Tuesday, February 26, 2008

    The future of the IT career path

    There was an article in Computerworld a week or so ago that just caught my eye.

    IT career paths you never dreamed of

    The short of it is that IT as we've known it, a separate stack, is being integrated into the general business functions. Things like software-as-a-service, outsourcing, and freakishly fast WAN pipes mean there is less call for people like internal application developers, systems analysts, and system administrators. Those that remain, have a decided focus on project management, and focus on the business.

    I see some truth to this. I've known for years now that the kind of job I fit best in, only exists in organizations larger than a certain size. Organizations smaller than a certain size tend to be subject to, "the computer guy," being in charge of everything computery. WWU is large enough that I can specialize in one field, file-server maintenance and upkeep, without having to be 'the computer guy' to a bunch of people.

    This also means that my desktop support skills have atrophied from where they once were. Since everyone thinks that, "working in computers," means in reality, "desktop support," I have a hard time convincing family that I only know a little more than they do about why their Thunderbird broke in just that way. Doctors have this problem too, I hear.

    Anyway. The article mentions that newer job titles are including the word, "architect," in them. And I really agree with this, since any company needs people with an enterprise view of their IT infrastructure. I'm one of those people for Western, especially when it comes to the file servers. It is people like us who sheepdog consultants hired to implement new technologies.

    Which brings up another thing about the article. The article is rather .COM centered, which I understand. Us .EDU types really do live in a different world (where ELSE are you going to get 4000 people pounding the exact same file server at the exact same time?). The idea of hiring consultants (very expensive temp workers) to do the heavy lifting during upgrades is something we laugh ourselves silly over, since we barely have the money to BUY the new upgrade (even with our hefty .EDU discounts) much less pay someone else to put it in for us. Something simpler like outsourcing 90% of our on-site helpdesk work through a SE Asian call-center and remote-control apps is something we could possibly do, but the union those helpdesk techs belong to would pitch a fit. The same thing applies for a contract service to manage printers. Similar sorts of things apply to the non-profits of the world (the .ORG world), though perhaps not the union angle.

    But out there in the for-profit world, and the for-profits larger than SOHO or SMB, that's another story entirely. I don't know how much longer there is going to be a call for file-server jocks.

    Labels: ,


    Monday, February 04, 2008

    Today's 18 year olds...

    Over the time I've been here there has occasionally been a list posted in the break-room. This list is the, "Incoming freshman today...." list of things they know, experience, or haven't experienced. It contains things like:
    • Were born in 1990
    • ...have never known life without cable or satellite TV.
    • ...probably have never seen a rotary dial phone.
    • ...have had internet access for most of their school life.
    And other such things. Ostensibly this is to help foster an understanding of where incoming freshman are coming from, but generally they just cause faculty and staff to just feel a bit old. In tech circles this sparks conversations about the first computers we used.

    Which got me thinking about a few things. One of the items that is frequently put forth about Kids These Days (tm) is that they don't KNOW anything, they just know how to FIND things. There is some debate about this, but it is a common sentiment. I believe that kids these days (KTD) have figured out keyword based searching, and the search engines have gotten good enough at mind-reading that arcane search incantations aren't needed nearly as often as they were in the past.

    Before Google, there was AltaVista. This was an era of the internet where boolean search incantations were needed to really narrow down to what you wanted. I didn't switch to Google for a long time because Google didn't have the NEAR search term, which I used on AltaVista as a way to narrow results to be more relevant. I didn't know at the time that Google effectively threw that term in on every search.

    Those of us who lived through that era of the internet built up searching skills. I remember some searches I did back then that were pretty complex. I can't remember the exact terms used, but they looked like this:

    bootes AND (antaries OR proxima) AND (fulcrum NEAR pinnacle)

    I had a logic class in college, so these sorts of parenthetical statements made sense to me. Still do, I just don't end up needing to uncork the boolean logic to find what I need anymore as the search engines have gotten good enough that I don't NEED to do it. I know google allows much of the above, but I haven't had to do it so I don't know the syntax for it.

    So I posit that yes, KTD don't know anything, but neither are their search skills robust.

    Which brings me to Novell. I got to thinking what a NetWare administrator in 1990 had to know to do their job, and how I could fit into such a hypothetical time.

    Right now if I don't know the answer to a problem I have a few methods to figure it out.
    1. Hit the online Novell Knowledge Base over at novell.com/support
    2. Hit the peer-support forums over at forums.novell.com (or nntp://forums.novell.com/ if you prefer old-school)
    3. Pay for a support incident
    4. Ask around the office
    In 1990 the options were similar, but a key player was missing:
    1. Hit the peer-support forums over on CompuServe, which required a modem and a CompuServe account.
    2. See if the problem is mentioned in the book-shelf of manuals, which was a big investment to own.
    3. Pay for a support incident.
    4. Ask around the office.
    When I first started this Novell Administrator gig in 1997 most of the admins I knew had CompuServe accounts, even though the support forums had officially moved to NNTP. There was still plenty of traffic on the CS servers, though those died out fairly quickly. The office I started in had a subscription to a monthly publication from Novell of their support knowledge base, which I made extensive use of. Somewhere in there Novell made the archives web-searchable and I stopped using the CD's.

    As I see it, a NetWare admin of 1990 was on average more knowledgeable about their product than the NetWare admin of 2008. Such administrators avoided the cost of paying for support incidents by having the manuals in hard-copy form, and plonking down real money for CompuServe accounts. If I have a weird problem I'll hit up the Novell KB to see if there is a TID on it, then check the support forums to see if it is mentioned there, before I'll expend an incident on the thing. In time I've managed to teach myself how NetWare works in some very basic ways, simply by troubleshooting oddball problems. This is why I typically end up talking to backline support when I call in, unless the problem is a known issue in the private KB. My skills are probably on par with what was normal 'back in the day'.

    I think this holds true for a lot of the tech field. Back then there was a lot of stuff you just had to KNOW. Or failing that, have spent the money to get the backup resources in place (manuals, support contracts). These days a base understanding of how things work is the key to phrasing the right search queries in the online knowledge bases, and less rote memorization (training) can be effective in solving a greater list of problems.

    Prosthetic memory! Prosthetic training! The tools of geeks everywhere.

    Labels: , , ,


    Friday, January 25, 2008

    A needed patch.

    Novell has released a patch for the "ConsoleOne sorting problem."

    The sorting problem happens when you have eDir 8.8 installed. Suddenly C1 starts sorting things by creation date rather than as you've ever seen it before. This is... confusing. ConsoleOne 1.3h helped some of it for us, but not all. And now, we have a patch!

    Let ConsoleOne Sort Correctly!

    Labels: , , , ,


    Saturday, January 19, 2008

    Good migration

    At home I just migrated the linux server to new hardware. This has to be one of the easiest migrations I've ever done for that service. Now just the obsessive tweaking needs doing, all the major functions are moved.

    That server is running Slackware. I'm not using SuSE at home for a couple of reasons:
    1. I've been using Slack since college
    2. Diversity is good when figuring out how to run Linux
      1. Slackware doesn't have anything approaching YaST.
      2. Getting a new service online with Slackware takes about five times longer than it does with SuSE, but at the end of it you know how it bloody well works.
    3. It's easier to crib from existing config files that way.
    I've also done a major rework of the internal network, which required a small rewrite of the network start scripts to handle it correctly.

    I got my first wireless access point in November of 2000. Way back then, they hadn't quite figured out all the short-cuts to cracking WEP so it required a certain amount of traffic to analyze. This was a Linksys B AP, and a Linksys wireless card. Together they had el-crappo for range (to today's standards).

    With that in mind I segregated my network.

    Internet <- Cisco 675 DSL -> Wired network <- Linux server -> Wireless network

    Didn't have cable in our area yet back then. The Cisco handled everything I needed. Unfortunately, it was badly behaved. It had the nasty habit of ARPing through the whole dhcp range, one addr per second, continually.

    At that point in time I had one wireless device. The always-on windows server was on the wired network, and the linux server configured to proxy things. So the only traffic on the wireless network was from my laptop; no ARP ARP ARP ARP ARP and no windows browse packets. In other words, it was a network that was hard to crack. Oh yeah, baby.

    Fast forward a couple years. I move out here, we get cable instead of DSL.
    Another year or two, and the 802.11b AP died so we moved to a G AP.
    Another year, and I added a certain linux-based media server (wireless for long reasons) and my wife got a PowerBook.

    The 10MB ethernet card in the back of that Linux machine (a Pentium 2 450MHz machine) was really... concerning me. Comcast is still under 10MB, but... it's the principle of the thing. It was a bloody ISA card for pete's sake.

    So today I flattened the network. It's structured the same, but rather than have separate subnets I'm just using brctl to bridge the two; I like being able to easily sniff my wireless traffic. We no longer have an always-on Windows box. And WPA-PSK is a heckova lot harder to crack than WPA ever was. So, I figure it's safe. Plus, if the linux machine ever dies I only have to move one cable to get things back online.

    Now the internet seems faster when browsing on the laptops. I guess that 10MB card was actually slowing things down a bit.

    Labels: ,


    Wednesday, January 16, 2008

    NetWare library patches

    Novell recently split the libc and clib patches for NetWare. For a long time patches like "nwlib6a" included both. Now, they're split.

    This just caused me a problem. It turns out that if you have libcsp6b (the LibC patch) applied and not nwlib6k (the CLib patch), there is an abend possibility. It happened yesterday. It turns out that in that case, a badly formed network broadcast can cause an abend. This caused three of my six cluster nodes to fall on their butts at the same time. That was fun. Strange (but good) thing is, I had already applied both patches to these three servers but hadn't gotten around to rebooting them yet. So, by killing themselves they actually fixed the problem.

    The abend, key details:

    EIP in SERVER.NLM at code start +0015FD27h

    Heh heh heh. Oops.

    And now a bit of history. Long time NetWare admins can ignore this part.

    Q: Why are there two C libraries?

    CLIB is the library NetWare started with. It began life in the dark and misty past, probably in the late 1980's. It is the deepest, darkest bowels of NetWare from the era when Novell was it when it came to office networking. Being so old, its APIs are very mature. Applications developed against CLIB generally speaking just plain work.

    CLIB is also depreciated since it is highly proprietary, and doesn't play well with others. "Just plain works" in this instance means an assumption of 8.3 names, with kludging to support long file names if at all possible. CLIB applications have a tendency to have IPX dependencies for no good reason.

    LIBC was created, IIRC, around the release of NetWare 5.0 when it became possible for NetWare to operate in a "pure IP" environment. LIBC was designed with the concept of POSIX semantics in mind, which CLIB was not. LIBC was created from scratch with long file name support. By now, as of NetWare 6.5 SP7, most of the NetWare kernel is written against LIBC rather than CLIB.

    As an example of LIBC vs CLIB, take the 'MyWeb' service this blog is served by. When I did this the first time, it was on NetWare 6.0, using Apache 1.3. Apache 1.3 was linked against CLIB and was very stable. The service notes for the Apache Modules I needed to run to make it work made it clear that supporting long file-names on remote servers was something that only recently started working.

    When the migration to NetWare 6.5 came around, it meant I had to migrate MyWeb to Apache 2.0. Apache 2.0 is linked against LIBC and used a different apache module to make things work. I had troubles. The LibC functions were not nearly as mature as their CLIB counterparts, and it showed. 3.5 years later things are now a lot more stable then back then.

    Labels: , , ,


    Monday, December 17, 2007

    Not dead.

    Wow, last post was the 30th? Jeez. I was on vacation all last week, which accounts for some of it. And it's looking like I'll be out sick for at least a pair of days with a crud I got while wandering about. Not sharing that with work, nosir.

    On my list of things to do during the winter inter-session is to get eDir 8.8 deployed in the production tree. I just need to have ALL the servers in the tree (all, not just replica holders due to backlink updates) up and talking when I do the first one, and that could take some scheduling. This is the first step to OES2, which will be deployed on the eDir servers first.

    As soon as I get some new hardware, since they're getting old.

    Labels: , , , ,


    Friday, November 30, 2007

    OES2 SP1 timing

    Novell just posted the third draft of their OES2 Best Practices guide. Which you can locate here. In that guide is this text:
    Domain Services for Windows, which is scheduled to ship with OES 2 SP1 (currently scheduled for late 2008), will also offer some clear advantages.
    "Late 2008" means they WILL NOT have SP1 out by August of 2008. This means that the upgrade of our 6 node cluster to OES will have to wait until 2009. Grrarrr!

    Another 21 months of a 32-bit operating system on the single biggest storage consumer on campus. We'll have at least one hardware refresh before then for some of the nodes, and... boy I hope they have NetWare drivers for that. The very limited testing I did with NetWare-in-Xen was not encouraging from a performance stand-point. If it looks like I'll have to deploy that way for the next servers we get in the cluster, I'll have to do more real testing to characterize the performance hit (if any). The idea of a 64-bit memory space for file-caching makes me drool. Not getting it for 21 months is painful.

    That said, if Novell releases the eDirectory enabled AFP server for OES2-Linux outside of the service-pack I could still make the 2008 window. That's our only dependency for SP1.

    Update (09/08/08): Looks like 'late October' is the date for SP1's release. Should be in public beta before then.

    Update (12/03/08): It's out!

    Labels: , , , , ,


    Wednesday, November 28, 2007

    I/O starvation on NetWare, HP update

    Last week I talked about a problem we're having with the HP MSA1500cs and our NetWare cluster. The problem is still there, of course. I've opened cases with both HP and Novell to handle this one. HP because I really thing that such command latencies are a defect, and Novell since they're having starvation issues with clusters.

    This morning I got a voice-mail from HP, an update for our case. Greatly summarized:
    The MSA team has determined that your device is working perfectly, and can find no defects. They've referred the case to the NetWare software team.
    Or...
    Working as designed. Fix your software. Talk to Novell.
    Which I'm doing. Now to see if I can light a fire on the back-channels, or if we've just made HP admit that these sorts of command latencies are part of the design and need to be engineered around in software. Highly frustrating.

    Especially since I don't think I've made back-line on the Novell case yet. They're involved, but I haven't been referred to a new support engineer yet.

    Labels: , , , , , , ,


    Monday, November 26, 2007

    Adding attachments to an open HP Support case

    I don't think this is documented anywhere. But I just learned how to add updates to the HP case-file. Including attachments.
    To: support_am@hp.com
    Subject:

    CASE_ID_NUM: [case number, such as 36005555555]
    MESSAGE: [text]
    Any attachments to it will be automatically imported into the case. LOOKING at the case itself is a lot more complicated, and I'm still not sure of the steps. But this should be of use to some of you.

    Labels: ,


    Wednesday, November 21, 2007

    I/O starvation on NetWare

    The MSA1500cs we've had for a while has shown a bad habit. It is visible when you connect a serial cable to the management port on the MSA1000 controller, and doing a "show perf" after starting performance tracking. The line in question is "Avg Command Latency:", which is a measure of how long it takes to execute an I/O operation. Under normal circumstances this metric stays between 5-30ms. When things go bad, I've seen it as far as 270ms.

    This is a problem with our cluster nodes. Our cluster nodes can seen LUNs on both the MSA1500cs and the EVA3000. The EVA is where the cluster has been housed since it started, and the MSA has taken up two low-I/O-volume volumes to make space on the EVA.

    IF the MSA is in the high Avg Command Latency state, and
    IF a cluster node is doing a large Write to the MSA (such as a DVD ISO image, or B2D operation),
    THEN "Concurrent Disk Requests" in Monitor go north of 1000

    This is a dangerous state. If this particular cluster node is housing some higher trafficked volumes, such as FacShare:, the laggy I/O is competing with regular (fast) I/O to the EVA. If this sort of mostly-Read I/O is concurrent with the above heavy Write situation it can cause the cluster node to not write to the Cluster Partition on time and trigger a poison-pill from the Split Brain Detector. In short, the storage heart-beat to the EVA (where the Cluster Partition lives) gets starved out in the face of all the writes to the laggy MSA.

    Users definitely noticed when the cluster node was in such a heavy usage state. Writes and Reads took a loooong time on the LUNs hosted on the fast EVA. Our help-desk recorded several "unable to map drive" calls when the nodes were in that state, simply because a drive-mapping involves I/O and the server was too busy to do it in the scant seconds it normally does.

    This is sub-optimal. This also doesn't seem to happen on Windows, but I'm not sure of that.

    This is something that a very new feature in the Linux kernel could help out, that that's to introduce the concept of 'priority I/O' to the storage stack. I/O with a high priority, such as cluster heart-beats, gets serviced faster than I/O of regular priority. That could prevent SBD abends. Unfortunately, as the NetWare kernel is no longer under development and just under Maintenance, this is not likely to be ported to NetWare.

    I/O starvation. This shouldn't happen, but neither should 270ms I/O command latencies.

    Labels: , , , , , , ,


    Monday, November 19, 2007

    I didn't realize it was this bad.

    A while back Novell held an online survey about YaST usage. They've just released results.

    Right at the top, in the demographics section are the results for the 'gender' question.

    Men = 97.7%, Women = 2.3%

    Ow. Women are 2.3%? Jeez.

    These sorts of surveys are FAR from scientific. But still, such a STRONG bias is rather disheartening. I know the BrainShare crowd is somewhere between 4:1 to 6:1 Men-to-Women (don't have exact numbers). That said, most of the women I meet there are there for either Identity Management or GroupWise. The audiences for sessions on high Linux geekery (like for Clusters or HA computing) are... very male.

    Just looking at that chart makes me wince. Yeesh.

    Labels: ,


    Thursday, November 15, 2007

    Encryption & key demands

    As some of you know, the UK has passed a law which authorizes jail time for people who refuse to turn over encryption keys. If I'm remembering right, 2-3 years. This is a bill that's been making the rounds for quite some time, and got passage as a terror bill. Nefarious elements have figured out that modern encryption technologies really can flummox even the US National Security Agency deep crack mainframes and they therefore use it. There was a reason that encryption technologies were classified a munition and therefore export-restricted.

    Those of you who've been with Novell/NetWare long enough remember this. Back in the day the NICI and other PKI components came in three flavors, Domestic (strong, 128bit), International (weak, 40bit? 56bit?), and basic (none). Things have loosened up since then.

    Part of the problem of encryption is that while the private keys may be strong, securing them is tricky. When the feds raid your house and grab every device capable of both digital computation and communication to throw into the evidence locker, their computer forensics people can get your private keys. However, if your private keys are further locked away, such as PGP, it won't do them much good. To gain access to your key-ring they'll need the pass-phrase.

    That's where the new law in the UK comes in. Police have two options to figure out your pass-phrase. They can intercept it somehow, or they can point a jail term at your head and demand the the pass-phrase.

    That doesn't work in the US thanks to the Bill of Rights, and the 5th Amendment. This is the amendment that states that you have a right to not self incriminate, and by extension this means that police can't force you to divulge information that could be detrimental to you. As it happens, the people who wrote this amendment had the English legal system in mind when they came up with the idea, what with us being an ex-colony and all that. So if you performed safe encryption handling, didn't write the pass phrase anywhere and made a point of making sure it never hit disk in the clear, the US Government can't penalize you for not telling them the pass phrase. A US law similar to the UK law would face a much harder judicial battle than it got in the UK.

    Which isn't the case in the UK. As one crypto expert I spoke with once said, the UK law amounts to, "rubber-hose cryptography." Which is an allusion to the fact that a sufficient application of pain (i.e. torture) can cause someone to fork over their own encryption keys, which is a concern in certain totalitarian regimes.

    The accepted response to 'rubber-hose' crypto methods is to use a 'duress key'. This key will either destroy the crypted data, or reveal harmless data (40GB of soft porn!). The problem with such a key is that it works best if such a key is not known to exist. Forensics analysis can show what kind of crypto is in use, and if that particular type supports the use of a duress key, the interrogators can work that into their own information extraction methods. Also, any forensics person worth their salt works on a COPY of the data (as the RIAA knows all too freaking well, digital data is very easy to duplicate), so having the duress key destroy the data isn't a loss. In a judicial framework, having the key given destroy the (copy of the) data can earn the person a, "hampering a lawful investigation," charge and even more jail time.

    All that said and done, there are still PLENTY of ways for the US Government to gain access to pass-phrases. I've heard of at least one case where a key-logger was installed on a machine for the express purpose of intercepting the key-strokes of the pass-phrase. If the pass phrase exists in the physical realm in any way (outside of your head), they can execute search warrants on it. Some crypto programs don't handle pass-phrase handling well. Also, if you have a Word document that was crypted, then decrypted so you could view it, the temp files Word saves ever 5-10 minutes are in-the-clear and recoverable through sufficient disk analysis. The end-user needs to know about safe handling of in-the-clear data.

    All of which is expensive work. If the Government can save several thousand dollars in tech time by simply asking you the pass phrase and throwing you in the clink if you don't give it, that's what they'll do. If the person under investigation is known to be very crypto savvy (uses a Linix machine, with an encrypted file-system that requires a hand-entered password to even load, and uses PGP or similar on top of that to defend against attacks when the file-system is mounted) it becomes WAY cheaper to go the Judicial route than the tech route.

    Yeah, 2-3 years may be much better than the 20-life you'd face on a terrorism charge. But you'd be in custody the whole time, and they'll be spending that 2-3 years going over your encrypted data the hard way. And if actual actionable evidence surfaces to support a terrorism charge, you can bet your bippy that you'd be hauled into a court-house for a new trial, only this time facing 20-life. If you're in the UK. Here in the US they'll just keep you under surveillance until they get the pass phrase or enough other evidence to hold you down in custody and give them an excuse to throw everything you've ever touched into evidence lock-up.

    Labels: , ,


    Monday, November 05, 2007

    HP support problems.

    We had another unfortunate incident with HP support this morning. We found some critical infrastructure had quietly expired from warranty, so was not covered. How it is supposed to work is that when things near expiration we add them to a separate Support Contract we have with HP to cover stuff not on warranty.

    One of the biggest problems we have is that HP Support verification requires two factor authentication. You need both the serial number of the device (and for multi-device systems like blade racks or SAN racks it isn't always clear which S/N you need) and the model number of the device (ditto, with multi-device systems). The brand new servers we've received have a handy tag on the front with both numbers, but devices older than about a year do NOT.

    Having a single S/N key to support is not hard to do. Dell has been doing this for YEARS (the 'express service code'), so it can be done. It also makes the support verification problem a lot easier.

    HP also used to inform us when major things were coming off of support. As my boss just pointed out, doing so is a revenue thing for them, as they were always able to talk us into paying them money to keep things supported. A couple years ago they stopped doing that, and since then we've had several instances of key machines quietly going unsupported.

    My experiences with HP support:
    General Web SupportVery bad. Hard to find information. Even HP techs have trouble
    On-site SupportVery good
    Phone Support
    Pretty good
    Downloading Drivers
    Bad. Its on the web-site, so hard to find exactly what you're looking for.
    Finding Documentation
    Mixed. For some things like servers it is OK. For Storage things it is very bad.
    It hasn't quite gotten to the point where I'll CALL them before trying to find things myself, but it is getting close. Their web-site is THAT BAD.

    Labels:


    Monday, October 15, 2007

    Peer-to-peer sharing

    One feature that has shown up in some applications and widgets lately has gained some traction internally. That is the concept of peer to peer sharing of disk space without going through all the pain of getting things approved and formally set up. The general idea is this one.

    I want to share U:\SharedStuff\ApacheGroup\ to five other users. U: is my home directory, which is actually map-rooted so I don't see the top level directory. So I go to a web page and tell it I want to share this directory, to these people, for this long. Go.

    It struck me that this sort of thing can be engineered with NetWare and OES. The key components are eDirectory, NSS, and NetStorage.

    The web server takes the request and translates $Path into a real path by referencing the HomeDirectory attribute of the user who requested the share. Then, using LDAP it creates two objects:

    A Group Object
    • Created and named dynamically
    • [AuxClass] Attribute with user-defined name
    • [AuxClass] Attribute with the creator
    • [AuxClass] Attribute with the expiry date
    • Since this is eDirectory, group memberships apply immediately rather than taking a logout/login cycle to refresh the access token like in MS networks.
    A Storage Location Object
    • Created & named dynamically
    • Associated to the created group
    • Assigned to the specified users
    • This allows the share to show up in NetStorage
    The web server sends a request to a file daemon that handles the actual trustee assignment.

    There is a small constellation of maintenance tasks that also need to be created, such as a janitor process to deal with expirations, a helpdesk view to track who has what shares, a historic view to see what shares got deleted recently that suddenly need to be back RIGHT NOW, something to interface this with whatever disk or directory quota systems are in use.

    The use of NetStorage allows WebDAV to be used as an access method, which allows the shares to be seen. The really brave may be able to leverage DFS to create actual directory structures reflecting the shares in the actual directories so drive mappings can be used; unfortunately I have no idea if a DFS database that large is a good idea.

    Users would love this. No need to go through management to get a directory set up on the shared space. You just set up and go. Great for adhoc groups, or small private gatherings.

    Unfortunately, this sort of share model is one that a lot of sys-admins are familiar with. If you've ever had a chance to examine the network of a small business with under 15 users, all of whom call themselves 'not that good with computers', you know what I'm talking about. This model of sharing is the one that Windows for Workgroups was designed for, and is still the default mode for plain old WinXP. Excessive use of peer to peer sharing like that can lead to one unholy mess, especially if a key person leaves (or in the case of the Windows example, one hard drive crashes hard).

    If left unchecked, you can get whole business processes designed with the assumption that [username] will never retire. That already happens to an alarming extent, but this would make the dependency more invisible to those of us charged with making it all work again when it breaks. You can have shared spaces that are business critical to the company living 100% inside a user's self-managed space, and vulnerable to deletion on termination of that employee.

    This is all part of the balance we as system administrators have to keep between end user functionality, and data protection. Desktop techs fight a constant battle to get users to save data on the server where it is backed up, and Novell puts out things like iFolder to help that whole thing become more invisible. We created shared directories to draw a big line between 'my stuff' and 'us stuff'.

    That said, data-access habits are changing all the time. My own boss prefers to email a 150KB Excel spreadsheet to all of us, even though all of us have ready access a shared directory setup just for that. SharePoint integrates with Office to make the web-server look like a file-server. We still have to adapt with the times.

    User-directed sharing is something I can see as highly desirable among the student population and faculty as well. Among staff, I'm less sure its a good idea outside of the 'trivial' personal use we're allowed.

    Labels: , ,


    Tuesday, September 25, 2007

    The perils of a manual process

    Yesterday I found the root cause of a rather perplexing problem. We had a user, happily for me one of the other sysadmins at WWU, who couldn't get their eDir password changed. No matter how many times he ran the identity management process, his AD PW would change, but eDir would not even though the success on the event was good.

    A word of note:

    We do not use Novell Identity Management. We've built our own. When Novell first came out with DirXML 1.0, we already had the foundation of what we have right now. So, when I talk about IDM, I'm actually referring to our own self-built system not Novell's IDM.

    To troubleshoot, I ran many tests. The longest one was to turn on dstrace logging on the root replica server, and push changes to the object. I'd push a change, watch the logs, then parse through the log for the user's object.
    • Changing it via LDAP made a sync.
    • Changing it via the IDM did not make a sync.
    • Changing it via iManager made a sync.
    • Changing it via ConsoleOne on the IDM server made a sync
    This would point to some flaw in the IDM process. This is unlikely, as the password change logic has been largely unchanged for close to 7 years. The underlying libraries have also been unchanged for close to 3 years. Very unlikely to be that. What it could be, though, is some odd-ball untrapped error.

    To figure out what that is, I needed to sniff packets. PKTSCAN to the rescue. On the IDM server I turned off connections to all but the server holding the Master replicas of everything. Then on the master replica server I loaded PKTSCAN. I turned on sniffing, make the change, wait 5 seconds just to be safe, turn off the sniff, save the sniff, and load the sniff in Wireshark. The two cases I tested:
    • Change the concurrent connections attribute through IDM
    • Change the concurrent connections attribute through ConsoleOne on the IDM server
    This is what showed my problem. When I did it through IDM, it was attempting to change the Concurrent Connections attribute of T=WWU. Ahem. When I did it through ConsoleOne, it was attempting to change the Concurrent Connections attribute of CN=[username].OU=Users.O=WWU. AHAH!

    Looking at the details of T=WWU, I saw that it had an aux class associated with it. It was posixAccount. Thus, was I illuminated.

    This particular sysadmin requested to have his account 'turned on for linux'. Which is code for having the posixAccount aux-class associated and the uid, gid, cn, and shell attributes added. This is still a manual process for us since requests are few and far between, though that is changing. It would seem that when I did it, I right-clicked on the wrong object. Whoopsie poo! Easily fixed, though.

    I removed the aux-class from the tree root object, and suddenly... IDM changes started applying to the right object! Hooray! I think the IDM code was keying off of commonName rather than CN for some reason, which is why the aux-class got in the way.

    Labels: , , ,


    Monday, September 24, 2007

    Neat eDir trick

    One thing that I learned at BrainShare years ago is that eDir 8.7 permits LDAP clients to register against events. Probably the most widely applicable devnet thing is the LDAP Classes for Java. From my understanding, this sort of technology is used in both Novell Identity Manager and NSure Audit.

    So, what the heck is it? From the documentation:
    The event system extension allows the client to specify the events for which it wants to receive notification. This information is sent in the extension request. If the extension request specifies valid events, the LDAP server keeps the connection open and uses the intermediate extended response to notify the client when events occur. Any data associated with an event is also sent in the response. If an error occurs when processing the extended request or during the subsequent processing of events, the server sends an extended response to the client containing error information and then terminates the processing of the request.
    It's an extension to LDAP that Novell created to permit event monitoring. It monitors events in eDirectory, from object changes, to internal eDirectory statuses like obituary processing. For example, you can set up a connection and tell the LDAP server to tell you of all changes to the "member" attribute, and track all group modifications. Or track the "last login time" attribute, and create a robust login audit log.

    Stuff like this is downright handy in identity management situations. If a change is made to "phoneNumber" in the Identity tree, that change can be trapped by the monitor, and propagated to the production eDir tree, Active Directory, and NIS+. What's now a batch process can be event based.

    I'm not a java programmer so I'm limited in what *I* can do with it. However, I have coworkers who DO speak java, and can probably do wonderful things with it.

    Labels: , ,


    Tuesday, September 04, 2007

    Expanding the EVA

    Our EVA3000 is full. All shelves have disks in them. In order to add space we need to replace our existing 143GB drives with 300GB drives. This is a rather expensive way to gain more space, as that extra 157GB of space costs the same as 300GB of space. But, that's what we have to do.

    And wow does it take a while.

    First I have to ungroup the disk. This can take up to two days. Then I pull the drive, and put the new one in. And regroup on top of it, which takes another up to two days. All the group/ungroup operations are competing for I/O with regular production.

    Total time to add 157GB to the SAN? Looks to be 3 days and change.

    We need a newer EVA.

    Labels: ,


    Saturday, August 25, 2007

    Measuring sysadmin productivity

    There was another thread on Slashdot today that caught my attention:

    http://ask.slashdot.org/askslashdot/07/08/25/1753220.shtml

    The asker asked:
    RailGunSally writes "I am a (strictly technical) member of a large *nix systems admin team at a Fortune 150. Our new IT Management Overlord is a hardcore bean-counter from hell. We in the trenches have been tasked with providing 'metrics' on absolutely everything from system utilization to paper clip recycling. Of course, measuring productivity is right up there at the top of the list. We're stumped as to a definition of the basic unit of productivity for a *nix admin. There is a school of thought in our group that holds that if the PHBs are simple enough to want to operate purely from pie charts and spreadsheets, then we should just graph some output from /dev/random and have done with it. I personally love the idea, but I feel the need for due diligence, so I put the question to the Slashdot community: How does one reasonably quantify admin productivity?"
    I don't have a "bean-couter from hell" boss, but this is a topic I've spent a bit of time thinking about at my last job. How to you measure productivity of a sysadmin? The question at previous job was how do you determine which employee holds more value than another. This is not an easy thing.

    Productivity at its most abstract is the rate at which an employee adds value to an organization. The tricky part is determining how to measure that rate and the value itself. In manufacturing, it is easier as 'widgets-per-hour' is generally OK. IBM and Microsoft attempted to do this to programming back in the development phase for OS/2, and the infamous "KLOC", or, "thousand lines of code."

    System Administration is something that doesn't lend itself well to such quantification. A significant part of our job is quite literally, fire-watch; do nothing until something breaks and then spring into action to contain and correct the damage. While we're waiting for something to break, we're also working on projects to get new or upgraded systems online.

    What I have seen done is to have to account for every minute of my day. Every moment of my day has to be chargable against something; a project, a department, or other time-tracking tool. It is also my experience that such managers take a dim view of entries such as these:

    9:50-10:00 Bathroom
    11:45-12:00 Time-sheet entry
    15:45-16:00 Time-sheet entry

    The questioner asked, "what is the basic unit of productivity for an *nix admin?"

    I could come up with a funny name for this fictional unit, but in essence there isn't one. To fully quantify an admin's productivity requires fully quantified metrics for:
    • The impact of server and service downtime.
    • The value gained from meetings.
    • The seasonal variations in business (in our case, when are classes in session? When are finals? When do grades need to be reported? When are parents on campus? Things like that.)
    • Bureaucratic friction (how much 'process' is required to get things done?)
    I have yet to run into a business where the above are fully quantified. Through knowledge of the above you can determine the prodtivity of any single cog in the while mechanism. This is the best way to determine these things.

    Trying to reduce the complexity of the problem to certain 'proxy' metrics, metrics that are easy to track but also tend to mirror the much more complex metric, is the method of choice in these circumstances. Yet what proxy metric will do? Trouble-tickets resolved per week is one method, but it overlooks the differing complexity of some trouble-tickets (misplaced file versus install BlackBoard 9.4). Projects completed is another way, but as with trouble-tickets the complexity of some projects differs and projects can be canned from on-high without notice.

    It is for reasons like this that Unions really like seniority. It is a simple supposition:

    IF (timeAtCompany($NAME)) > (timeAtCompany($OTHERNAME)) THEN moreValuable($NAME)

    Plus, it is hard for managers to game. Time of service is easy!

    Yet every single tech-worker I've spoken with hates this system because we've all seen the flaw of it. If you've spent any amount of time at a company with more that 4 IT workers, there will be at least one of them that is not very good, just marking time until retirement, or is there for some reason besides to do a good job. These people have a tendency to have a lot of years of service, so are hard to get rid of. Just because you've been at a company in one general role is no guarantee of increased knowledge, skill, or value.

    Sysadmin productivity is not something that can be measured easy. It is similar to trying to measure the productivity of a department-level Project Manager. It can be done, but it is a very squishy measurement.

    Which just means we'll end up justifying every minute we're at work, and have the boss decide what productivity means through intuition.

    Labels: ,


    This page is powered by Blogger. Isn't yours?