January 2008 Archives

I don't think it works that way

We've been having some abend issues on the cluster lately, something with the network services rather than file serving. It seems to be triggered more by iPrint/NDPS than by MyFiles, but both are associated with it. The abend itself is in WS2_32.NLM, so it's in the network stack. I have a call open with Novell.

I finally, finally managed to get a meaningful packet-capture after it fails, and I found some traffic that... doesn't look right. Take a look:

-> NCP Connection Destroy
<- R OK
-> ACK (to R OK)
<- RST

Note the three packets in the middle. The responding server is tearing down the connection twice for some reason. Compare this with a 'normal' tear-down:

-> NCP Connection Destroy
<- R OK
<- ACK

The first example I gave is the last traffic on the wire before the server abends, so is of course highly suspicious. The pattern that leaps right out is that the responding server is issuing the FIN,PSH,ACK and RST,ACK pair, rather than the sending server, and doing so before the sending server can say "I got it" to the connection close packet.

Now I need to catch it in the act again to prove this theory.

BrainShare social networking

| 1 Comment
I am going to BrainShare this year!

It has been interesting to watch the social networking thingy related to BrainShare over the years.

Two years ago, and for many years before that, the primary social group for BrainShare was novell.community.brainshare. This was an NNTP (you remember Usenet?) group hosted on the same servers that host the Novell Support Forums. BrainShare 2006 saw an increase in a certain kind of anti-Novell traffic that was already fairly common in the lead up to BrainShare 2005. The denizens of the group tend to be old time Novell hands, and as you can imagine they were pretty upset about Novell's plans for NetWare. A few very vocal people managed to raise enough of a stink that there wasn't a lot going on in the group for 2006. Unsurprisingly, novell.community.brainshare was removed from the NNTP servers around May 2006 (though the google-groups version of it is still around, see the link).

Last year Novell came up with BrainShare Connect as the social networking thingy. It had forums, blogs, and various other things to try and get attendees hooked up with each other and interacting. It got a reasonable amount of traffic, but many folks who had been regulars of the NNTP group were not there. I checked in every few days to see if anything new was up. For 2006 and 2005 I had checked the NNTP group daily, since there really was that much going on.

This year BrainShare Connect is back, but... they didn't do it right. The same outsourced firm is handling it, but even though it has Web 2.0 stamped all over it the interface is markedly worse than last year. There are no blogs. There are no polls. The interest finders are... weak and obfuscated. The forums are implemented on PhpBB, but done wrong. As an example of the wrong, take a look at this screen shot of me Replying to a thread:

Reply pop-over obscuring everything

What am I replying to? I can't tell. That window can't be moved or resized. I better hope my memory is good. I don't know if this is a new PhpBB feature, a new version came out a while ago, or some customized mod from WingateWeb. Whatever it is, it isn't a good thing. The ability to see what you're replying to greatly eases the flow of conversation.

And the logout screen is particularly interesting, too.

The logout window with weird buttons

What ever happened to "Cancel/OK"? Hasn't that been a de facto standard since, like, the Mac Classic came out 24 years ago? Proceed? I think that's the first time I've ever seen that particular word in that particular spot in an application developed by professionals.

The NNTP group had plenty going for it, but it was spoiled by a few vociferous critics. In the last few months Novell has released a brand new HTTP interface for the support forums that is worlds better than what was there before. Novell could bring this function back in-house if they really wanted to, and I'd support that decision. That said, I do understand why they need/want WingateWeb to handle that function. I just wish they did it better.

A needed patch.

Novell has released a patch for the "ConsoleOne sorting problem."

The sorting problem happens when you have eDir 8.8 installed. Suddenly C1 starts sorting things by creation date rather than as you've ever seen it before. This is... confusing. ConsoleOne 1.3h helped some of it for us, but not all. And now, we have a patch!

Let ConsoleOne Sort Correctly!
Distributed identity systems are hot these days. Open-ID has been around for a while, and Yahoo! just jumped on that bandwagon. Possibly to stick it to Microsoft, who is deploying LiveID. Blogger just started allowing non-Google logins for things like comments.

These systems work by splitting apart authentication (verify who you are) and authorization (what you're allowed to do). Single-Sign-On systems work this way as well, but these systems take that to a much greater scale. Once you've been authenticated by the trusted third party, you are authorized to access the specified resources. In the web domain this is easily handled through cookies.

I noticed this text on the LiveID page I linked to:
Microsoft's Windows XP has an option to link a Windows user account with a Windows Live ID (appearing with its former names), logging users into Windows Live ID whenever they log into Windows.
I did not know that. Shows what I pay attention to. What this tells me is that it is possible to synchronize your local WinXP login with a LiveID. This causes me to glower, because I inherently trust my local system differently than I do miscellaneous web services. Yes, the authenticator is the piece I need to worry about as it is how I get to prove I'm me, and that's just in one spot. But still, one compromised account (my LiveID account) and everything is shot.

Lets take it a bit further. It would probably be easy to get LiveID working inside of SharePoint. Especially since a developer SDK has been released to do just that. This would permit LiveID's access into SharePoint. Handy for collaborating with colleges working for other companies or universities.

Now what if Microsoft managed to kerberize LiveID? That would make it possible to use LiveID to log in against any Kerberos enabled service, as well as almost anything ActiveDirectory enabled. It'd probably take a tree-level (or maybe domain-level) trust established to the foreign tree (LiveID in this case) to make it work, but it could be done. Use LiveID to log into Exchange with Outlook, or map a share. Use your corporate login to work on your Partner's ordering system.

This scares me. In principle, not just because it's Microsoft I'm talking about here. Yes, it can be a great productivity enhancer, but the devil lurks in the failure modes. Identity theft is big business now, and anything that extends the reach of a single ID makes the ID that much more valuable. Social Security Numbers to us Americans are big deals since we can't renumber those, thus we have to protect them as hard as we can. Until we get a better handle on identity theft, these sorts of "One ID to rule them all," systems just make me wince.

Good migration

At home I just migrated the linux server to new hardware. This has to be one of the easiest migrations I've ever done for that service. Now just the obsessive tweaking needs doing, all the major functions are moved.

That server is running Slackware. I'm not using SuSE at home for a couple of reasons:
  1. I've been using Slack since college
  2. Diversity is good when figuring out how to run Linux
    1. Slackware doesn't have anything approaching YaST.
    2. Getting a new service online with Slackware takes about five times longer than it does with SuSE, but at the end of it you know how it bloody well works.
  3. It's easier to crib from existing config files that way.
I've also done a major rework of the internal network, which required a small rewrite of the network start scripts to handle it correctly.

I got my first wireless access point in November of 2000. Way back then, they hadn't quite figured out all the short-cuts to cracking WEP so it required a certain amount of traffic to analyze. This was a Linksys B AP, and a Linksys wireless card. Together they had el-crappo for range (to today's standards).

With that in mind I segregated my network.

Internet <- Cisco 675 DSL -> Wired network <- Linux server -> Wireless network

Didn't have cable in our area yet back then. The Cisco handled everything I needed. Unfortunately, it was badly behaved. It had the nasty habit of ARPing through the whole dhcp range, one addr per second, continually.

At that point in time I had one wireless device. The always-on windows server was on the wired network, and the linux server configured to proxy things. So the only traffic on the wireless network was from my laptop; no ARP ARP ARP ARP ARP and no windows browse packets. In other words, it was a network that was hard to crack. Oh yeah, baby.

Fast forward a couple years. I move out here, we get cable instead of DSL.
Another year or two, and the 802.11b AP died so we moved to a G AP.
Another year, and I added a certain linux-based media server (wireless for long reasons) and my wife got a PowerBook.

The 10MB ethernet card in the back of that Linux machine (a Pentium 2 450MHz machine) was really... concerning me. Comcast is still under 10MB, but... it's the principle of the thing. It was a bloody ISA card for pete's sake.

So today I flattened the network. It's structured the same, but rather than have separate subnets I'm just using brctl to bridge the two; I like being able to easily sniff my wireless traffic. We no longer have an always-on Windows box. And WPA-PSK is a heckova lot harder to crack than WPA ever was. So, I figure it's safe. Plus, if the linux machine ever dies I only have to move one cable to get things back online.

Now the internet seems faster when browsing on the laptops. I guess that 10MB card was actually slowing things down a bit.

NetWare library patches

Novell recently split the libc and clib patches for NetWare. For a long time patches like "nwlib6a" included both. Now, they're split.

This just caused me a problem. It turns out that if you have libcsp6b (the LibC patch) applied and not nwlib6k (the CLib patch), there is an abend possibility. It happened yesterday. It turns out that in that case, a badly formed network broadcast can cause an abend. This caused three of my six cluster nodes to fall on their butts at the same time. That was fun. Strange (but good) thing is, I had already applied both patches to these three servers but hadn't gotten around to rebooting them yet. So, by killing themselves they actually fixed the problem.

The abend, key details:

EIP in SERVER.NLM at code start +0015FD27h

Heh heh heh. Oops.

And now a bit of history. Long time NetWare admins can ignore this part.

Q: Why are there two C libraries?

CLIB is the library NetWare started with. It began life in the dark and misty past, probably in the late 1980's. It is the deepest, darkest bowels of NetWare from the era when Novell was it when it came to office networking. Being so old, its APIs are very mature. Applications developed against CLIB generally speaking just plain work.

CLIB is also depreciated since it is highly proprietary, and doesn't play well with others. "Just plain works" in this instance means an assumption of 8.3 names, with kludging to support long file names if at all possible. CLIB applications have a tendency to have IPX dependencies for no good reason.

LIBC was created, IIRC, around the release of NetWare 5.0 when it became possible for NetWare to operate in a "pure IP" environment. LIBC was designed with the concept of POSIX semantics in mind, which CLIB was not. LIBC was created from scratch with long file name support. By now, as of NetWare 6.5 SP7, most of the NetWare kernel is written against LIBC rather than CLIB.

As an example of LIBC vs CLIB, take the 'MyWeb' service this blog is served by. When I did this the first time, it was on NetWare 6.0, using Apache 1.3. Apache 1.3 was linked against CLIB and was very stable. The service notes for the Apache Modules I needed to run to make it work made it clear that supporting long file-names on remote servers was something that only recently started working.

When the migration to NetWare 6.5 came around, it meant I had to migrate MyWeb to Apache 2.0. Apache 2.0 is linked against LIBC and used a different apache module to make things work. I had troubles. The LibC functions were not nearly as mature as their CLIB counterparts, and it showed. 3.5 years later things are now a lot more stable then back then.

Disk-space over time

I've mentioned before that I do SNMP-based queries against NetWare and drop the resulting disk-usage data into a database. The current incarnation of this database went in August of 2004, so I have just over 4 years of data in it now. You can see some real trends in how we manage data in the charts.

To show you what I'm talking about, I'm going to post a chart based on the student-home-directory data. We have three home-directory volumes for students, which run between 7000-8000 home directories on them. We load-balance by number of directories rather than least-size. The chart:

Chart showing student home directory disk space usage, carved up by quarter.

As you can see, I've marked up our quarters. Winter/Spring is one segment on this chart since Spring Break is hard to isolate on these scales. We JUST started Winter 2008, so the last dot on the chart is data from this week. If you squint in (or zoom in like I can) you can see that last dot is elevated from the dot before it, reflecting this week's classes.

There are several sudden jumps on the chart. Fall 2005. Spring 2005. Spring 2007 was a big one. Fall 2007 just as large. These reflect student delete processes. Once a student hasn't been registered for classes for a specified period of time (I don't know what it is off hand, but I think 2 terms) their account goes on the 'ineligible' list and gets purged. We do the purge once a quarter except for Summer. The Fall purge is generally the biggest in terms of numbers, but not always. Sometimes the number of students purged is so small it doesn't show on this chart.

We do get some growth over the summer, which is to be expected. The only time when classes are not in session is generally from the last half of August to the first half of September. Our printing volumes are also w-a-y down during that time.

Because the Winter purge is so tiny, Winter quarter tends to see the biggest net-gain in used disk-space. Fall quarter's net-gain sometimes comes out a wash due to the size of that purge. Yet if you look at the slopes of the lines for Fall, correcting for the purge of course, you see it matches Winter/Spring.

Somewhere in here, and I can't remember where, we increased the default student directory-quota from 200MB to 500MB. We've found Directory Quotas to be a much better method of managing student directory sizes than User Quotas. If I remember my architectures right, directory quotas are only possible because of how NSS is designed.

If you take a look at the "Last Modified Times" chart in the Volume Inventory for one of the student home-directory volumes you get another interesting picture:
Chart showing the Last Modified Times for one student volume.
We have a big whack of data aged 12 months or newer. That said, we have non-trivial amounts of data aged 12 months or older. This represents where we'd get big savings when we move to OES2 and can use Dynamic Storage Technology (formerly known as 'shadowvolumes'). Because these are students and students only stick around for so long, we don't have a lot of stuff in the "older than 2 years" column that is very present on the Faculty/Staff volumes.

Being the 'slow, cheap,' storage device is a role well suited to the MSA1500 that has been plaguing me. If for some reason we fail to scare up funding to replace our EVA3000 with another EVA less filled-to-capacity, this could buy a couple of years of life on the EVA3000. Unfortunately, we can't go to OES2 until Novell ships an edirectory enabled AFP server for Linux, currently scheduled for late 2008 at the earliest.

Anyway, here is some insight into some of our storage challenges! Hope it has been interesting.
I've spoken before about my latency problems on the MSA1500cs. Since my last update I've spoken with Novell at length. Their own back-line HP people were thinking firmware issues to, and recommended I open another case with HP support. And if HP again tries to lay the blame on NetWare, to point their techs at the NetWare backline tech. Who will then have a talk about why exactly it is that NetWare isn't the problem in this case.

This time when I opened the case I mentioned that we see performance problems on the backup-to-disk server, which is Windows. Which is true, when the problem occurs B2D speeds drop through the floor; last Friday a 525GB backup that normally completes in 6 hours took about 50 hours. Since I'm seeing problems on more than one operating system, clearly this is a problem with the storage device.

The first line tech agreed, and escalated. The 2nd line tech said (paraphrased):
I'm seeing a lot of parity RAID LUNs out there. This sort of RAID uses CPU on the MSA1000 controllers, so the results you're seeing are normal for this storage system.
Which, if true, puts the onus of putting up with a badly behaved I/O system onto NetWare again. The tech went on to recommend RAID1 for the LUNs that need high performance when doing array operations that disable the internal cache. Which, as far as I can figure, would work. We're not bottlenecking on I/O to the physical disks, the bottleneck is CPU on the MSA1000 controller that's active. Going RAID1 on the LUNs would keep speeds very fast even when doing array operations.

That may be where we have to go with this. Unfortunately, I don't think we have 16TB of disk-drives available to fully mirror the cluster. That'll be a significant expense. So, I think we have some rethinking to do regarding what we use this device for.

Where NetWare Fits

NetWare 6.5 still holds top honors in one server niche. Even though it is a 32-bit operating system. That niche is the "large file-server" segment. I define "large" as, "lots of data, way-lots of concurrent users". Yeah, that's highly scientific. But "way-lots" means "over 1000 concurrent" to my thinking.

We regularly run between 1200-6000 concurrent connections on our cluster nodes. This is a density that just doesn't happen all that often in the market. If you have 6000 users close enough together to all talk to the same file-server at LAN speeds using a protocol designed for file-serving (such as NCP, SMB/CIFS, or AFP), you're a big organization. 6000 is a large corporate campus, a large governmental entity of some kind, or a larger .EDU like us. Nationally, the number of 'large' file-servers like that is peanuts compared to the number of 'workgroup' (i.e. under 300 concurrent users) servers out there.

It is therefore no surprise to me that Novell is not devoting a lot of engineering to supporting the top end of this market. While it may pay well, there just isn't enough revenue coming from these customers to try and handle the hardest-to-test use-case: very high concurrency. I find it disappointing because I AM one of those customers (a larger .EDU), but I understand the business drivers supporting the decision.

For the moment, NetWare 6.5 (32bit) is the top-dog performance wise for our environment. That isn't going to stay true for much longer. It would not surprise me to find out that a Windows Enterprise Server (x86_64) with 16GB of RAM can out-perform a NetWare 6.5 (32bit) server with 4GB of RAM, simply due to the added room for a file-cache. What I don't know is how CPU-bound file-serving I/O is on a Windows Enterprise Server, that's the one area that could keep NetWare 6.5 (32bit) on top. I already know that OES2-Linux out-performs NetWare for NCP traffic, so long as you stay within CPU bounds.

For high-concurrency applications, as far as I know NetWare still wins.