December 2005 Archives

Future of NetWare

I've talked about this one a lot, but more data keeps coming out. Novell held their Advanced Technical Training two weeks ago, and folk got a better look at the future of NetWare. David Kearns got wind of some trends and wrote about it.
That's not as troubling as the report I got from longtime reader Lewis Rosenthal about his trip to Provo for a training session, which included a discussion of the Open Enterprise Server (OES) roadmap. Rosenthal wrote: "I just thought I'd share with you a little bit I picked up while attending the OES Roadmap today, at ATT Live, here in Provo." Sounds good. But he gets right into the troubling news: "I can say without much doubt that NetWare - as we know it - will be vanishing in the next few years. Sometime after Cypress ships (the next version of Open Enterprise Server), Novell will be rolling out NetWare 'viX'. This will be a specially optimized version of NetWare to run in a virtual session on the Linux kernel, allowing 'legacy' NetWare NLMs to run in such an environment until such time as these applications can be migrated to OES Linux natively."
While I had rumors that the next version of NetWare would run inside a VM, this is the first solid word I've heard. It also tells me that the next version of OES, presumably 2.0, will not include a new NetWare kernel like I had predicted. Darn those rose-colored glasses anyway. Confirmation of this will most assuredly be given at BrainShare this year.

In other news, iFolder 3.0 will never be ported to NetWare. This is a product that began life on NetWare and has now left it. Signs of the times.

Time to start really boning up on bash scripts and the Linux driver model.

New things in CPU-land

AnandTech had a very interesting article about a new Sun CPU coming out.

Read it here.

To summarize, Sun is putting out an 8-core CPU that is focused on multiple threads running in order, and doing it with minimal memory latency, custom built for 'server loads'. Loads such as running a lot of SSL threads, or simple DB queries. According to the Sun provided benchmarks, the winners in the desktop number crunching market (Opteron) don't perform nearly as well in a typical 'enterprise server' environment. Which I believe, since there is a big difference between the loads required to drive a rendering engine and the load required for driving a website running Apache+Tomcat+MySql.

One of the really neat things Sun did was to increase register count. Each individual core has enough register space to store 4 complete sets of registers, rather than performing a context-switch out to the L1 cache like AMD and Intel chips both do. This means that each core can switch it's context in one cycle, rather than 3-4, and that provides speed increases.

The CPU then cycles between the four thread context per cycle, so every 4th cycle the same thread gets executed. That may sound like a performance decrease, but when you factor in fetches from memory it really isn't. The larger register file allows for fetches from L1 cache. If a thread on an AMD/Intel chip fetches from L1, the CPU sits idle for 3 cycles before getting the data. On this Sun CPU, the thread ALREADY was going to sit idle, and in the cycles between the fetch and the arrival of the data the CPU can perform operations on the other three threads in the registry. Those same threads may issue their own L1-gets, which has the happy side-effect of pipelining the memory channel in a way that AMD/Intel aren't doing.

We're not doing enough data-pushing of this type to really benefit from the new CPU. Our Solaris systems might be getting to the point of needing it, but we JUST finished a hardware refresh on that side of the house so the point is moot at the moment. Also, that kind of processing on the Intel side of the house isn't prevalent enough nor critical enough to warrant a change in platform. So I'm hoping this new architecture inspires Intel/AMD to put out CPUs that do a lot of the same things. Once that happens, I'm all over it.

NSS on Linux

CoolSolutions posted an article recently that has me wondering.

Novell Connections Tech Talk (November)

Scroll down to Revving up Novell Storage Services on Linux. In that section is a description of how NSS on Linux compares with NSS on NetWare. This is a very key thing since file-server performance is one of the prime road-blocks to OES-Linux deployment here at WWU. At BrainShare 2005 the NSS-on-Linux support was 12% slower than NSS-on NW. According to the above article, Novell has fixed that and the two platforms are now at parity.

There is nothing about how they ran the benchmarks there, but that is a very interesting finding. If their network file-share performance is at parity as well, that'll be another roadblock removed.

Cluster realignments

One of my projects for the break is to realign which services run on which nodes in the cluster. We have six nodes, and previously there was a none-shall-pass division between the three Student nodes, and the three Faculty/Staff nodes. The division is gone in the drive for more reliable file-serving.

Without running the numbers, I'm guessing that 90-95% of the unexpected volume failovers are due to an application crashing and taking the node down with it. Last year and early this year we had a lot of problems with NDPS. Recently it has been SSHD, and NetStorage. I've recently reminded myself that anytime you run a "sshd reload" from the console you run a very real risk of a crash in the next few minutes.

While our overall downtime from pre-cluster is w-a-y better, the frequency of multi-second outages unfortunately has gone w-a-y up. Before it could be 12 minutes before a crashed server gets to the point it could serve files again. Now it's 12-45 seconds, but we get them a lot more often. We're trying to reduce even these small downtimes.

To do that we're dividing the cluster into two halves. The file-server side, and the application side. Due to technical reasons, printing overlaps a little. This is made possible thanks to improvements in LibC that permit things like MyWeb and SFTP to reliably work from servers that aren't also hosting the files. I couldn't have done this 4 months ago.

One of the side-effects of this is figuring out how to get myweb.students.wwu.edu and myweb.facstaff.wwu.edu to share a web-server and still be able to get separate logging from each side. On the surface this is trivial. Unfortunately, the NetWare application environment once again makes things more difficult.

Unlike on Linux, if you remove the access.log file as part of the rotation process, it won't re-create the next time someone hits the web server. All transactions after the access.log file is removed/rotated will not get logged. Getting it back requires a web-server restart. This behaves like Apache1.3 did, only with Apache2 you can read the access.log file while apache is running.

Novell includes a ROTLOG.NLM that allows you to pipe input through it to allow rotations.

CustomLog "|sys:/apache2/bin/rotlogs.nlm sys:/apache2/logs/myfiles_access_log 5M" common

Which works great for one logfile for an apache instance. Unfortunately, I need to run it three times in the same instance to provide for different log-files. Rotlogs doesn't like loading multiple times like that, so it has a tendency to crash out the memory space after a couple of hours of normal load. Hardly sporting. Clearly this would present issues in the OS memory space, so I haven't tried doing it over there even for testing.

Since I'm running these web-servers in a cluster, any one of three nodes could be running any one of three services. I can't just create a script that unloads then reloads Apache, since apache will bomb out unless it can bind a listener to every address it's configured to run on. Too tricky.

The solution is to modify the log format and then perform post-processing to split out the separate logs.:

LogFormat "%A %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" comb-vhost

By prefixing the "combined" pre-built LogFormat directory with the %A directive, each log-line starts with the IP of the VirtualServer that serviced the request. Then some scripting trickery later and I have three split log-files that look just like the standard "combined" format! So far, it is working well. We'll see if things hold up.

Patching success

SP4a got rolled out to the cluster last night. No significant issues. There was one moment when it was doing the imanager update and it just stopped doing anything for about 15 minutes. I had the 'down server upgrade' TID looked up just in case, but I left it to contemplate its navel for a while. Happily it continued the SP without me having to kick things.

Patience. Patience.

And from the looks of things, nothing is broken this morning. All for the good. AND we may end up with slightly faster backups due to better TSAs, new LAN drivers, and suchlike. We'll see if we get any improvements.

Work tonight

| 1 Comment
SP4A gets rolled out to the cluster. This is 1am kinda stuff, so I'll be up late. I hope I'm not up TOO late.

Not renewing Information Security

| 1 Comment
I've been a subscriber to Information Security for five years, possibly six. This magazine, like ComputerWorld, is one of those publication that is 'free' for people in the industry. The 'free' comes from answering a yearly survey and agreeing to have targeted ads in with the magazine content. The usual deal.

Five years ago I wasn't as with-it with information security as I am now. Five years ago I worked for a company that occasionally bought security widgets. Five years ago I was hoping to break into the burgeoning InfoSec industry.

Then I came here. Since our networking security model is a cross between an ISP and a government agency, we have different challenges. Security widgets aren't really in the picture, and security software only barely and is typically handled at levels above me (much to their detriment). Security procedure I've gleaned through experience over the last five years. We haven't had any prosecutible events on my watch, but then neither did I while at my old job.

Information Security is all about the following topics, in rough order of order:
  1. Selling Widgets. Reviewing new widgets, sharing ideas on how to use certain classes of widgets, and who is doing a good job in widgetdom.
  2. Regulatory Compliance. Things like HIPPA and SOX weigh big on corporate companies. Their ads are all about preying on concerned relating to regulatory compliance.
  3. Business Cases. People who have done Security right tell the tale of how it worked for them.
Widgets:
Our information security budget is teensy. So teensy, it isn't even broken out as a separate line item. We deal with it when we have an identified problem, and upper management has signaled that they're willing to finance the handling of it. As a rule, we don't put much stock into widgets. The few we have are a Bluesocket for wireless access, which arrived before I got here, a PIX around that one critical subnet, and that's about it. I don't count AV software in this category.

Regulatory Compliance:
Until the Feds pass some form of Higher Education Finance and Reporting Act, I'm largely safe here. There is some HIPPA stuff we handle, but that's really minor compared to other things like patching schedules. Regulatory compliance weighs very lightly (heck, not at all!) on my mind, which puts me even further from the clutches of the worry-wart advertisers.

Business Cases:
While a good idea in theory, in the last year I've yet to read even one business case that applied to us. Our unusual network security model is not near what Information Security is selling. It was at my old job, but it is not here.

Since the top three topics of Information Security are not applicable to me, I've decided to attempt to discontinue my subscription. We'll see if it takes.

Network outage

Something caused our router core to start dropping packets like it was going out of style, and that had side-effects. One of the first ways it manifested was as a DNS outage, but poking after that started getting reported started returning traceroutes going "host unreachable" while in our router core.

I'm just happy this happened during break. If it had happened during session there would have been screaming and Very Concerned deans-n-things asking for updates every few minutes. I still haven't heard of the exact cause, but know it was some strange traffic coming from multiple segments. Once those segments were cut off, the packet drops went away.

Of note to NetWare is how Timesync handled the fault. We have a Reference server and three Primaries supplying time to everything. The Reference server gets its time by way of NTP from Titan, the designated time-host on the Solaris side of the house. Because of the router problems, Titan went out-of-sync since it couldn't contact any of its sources. This caused our Reference server to take that time anyway, but report as 'out of sync'. Somewhere along the line, Titan demoted itself to a lower stratum (probably st 16) and our Reference server marked the time from there as insane and just plain quit. Once THAT happened, the three remaining Primary servers negotiated between themselves and picked a time.

Unfortunately, that took 15-20 minutes. The three main NDS servers went out-of-sync pretty quickly, so for a while there we weren't accepting any NDS changes. Again, during session there would be screaming. Happily, once the Primaries had agreed on a time, things fell back into Sync again and NDS deltas started flowing.

Other servers have been impacted. Something went screwy with our main MS-SQL server, and cause certain things like the Western Channels to stop working, portal, and other such thing. E-mail went available/unavailable depending on traffic in the core, but Outlook is robust that way so most folk didn't really notice; not being able to surf the web was much more concerning than not getting e-mail on time.

Plausible deniability and Firefox

I was doing some googling in Firefox a bit ago, when something happened that has happened before. I hit "search", and I got a flood of cookies from the first entry on the search-list. Firefox was pre-caching the first hit in the eventuality I wanted to go there. Fine. Good.

Disclaimer: WWU doesn't operate a firewall or web-proxy servers. ResTek does, but that's a separate network. This is a theoretical exercise.

Then it hit me. That's activity that I don't control. That's activity that could, conceivably, cause Firewall/Proxy logs to show me visiting sites I never actually clicked on. While it has gotten better, a badly phrased query can cause certain sites that the corporate masters would frown on to show up as accessed by me. A skillful defense attorney could probably use this behavior to introduce reasonable doubt in a wrongful-termination suit that used firewall/proxy logs as evidence.

I don't know if such information is cached out to RAM or Disk browser-cache, but I suspect not. If it does land in the disk-cache, it could be pulled out by a forensic analysis and misinterpreted as an 'active hit' rather than the 'passive hit' it really was. The tell there would probably be the history file. Hmm.

It also struck me as a possible vector for malware. Happily, Firefox isn't rendering the retrieved information so that reduces the areas of bad-ness that malware authors can use. Not that we've heard of any bugs relating to the pre-cache feature.

Yummy stats

The quarter is coming to a close, so now is the time we look at disk-space. I've had these charts for some time, but I'm learning new tricks. Such as the Grouping function in Excel. I was a Crystal Reports person before coming here, so I'm slowly learning the Microsoft way of doing things.

Anyway, take a look at this:

That's a nice growth rate, that is. Between mid-june and late-august we had double-duty shared volumes as we migrated to a unified FacShare volume, which explains the big hump in the FacStaff line. We also do the big student delete in the mid-October to mid-November timeframe, which explains the dips in the Students line. Data collection artifacts are responsible for the spurious dips and peaks.

Looking at the rate of increase is our big bugabo. If you look at the rate for last year's student line, it blew the socks off of the FacStaff growth rate. The rate for this year seems to be pretty close on the student side, but the FacStaff rate appears to have matched that of the Students. This is probably due to a major, high usage, department moving to the WUF servers over the summer as part of that FacShare unification. As expected, the rate of growth of disk usage is increasing.

More and more students are getting into heavy multimedia or GIS, which requires more disk quota than normal. Also, quota usage is going up. On 5/27/05 the average percentage of in-use disk-quota for the student side was 6.11%. On 12/5/05 that percentage was 6.90%. The numbers on the FacStaff side are even prettier, with a 5/27 average of 7.29% and a 12/05 average of 9.91%! This does not represent a decrease in assigned quota by any means, it's just more assigned quota being used. The assigned quota for Students comes to about 10TB, which represents a huge oversubscription of space.

A much larger piece of the storage pie on the FacStaff side is taken up by shared storage. The opposite is true on the Student side. It is the rate of growth on the FacShare volume that is driving the increase rate of disk burn, not increased user-directory utilization. In fact, the "Class" volume on the Student side shows very little growth when compared directly against "FacShare".

Brainshare 2006 sessions

Novell has done a new thing this year. They're allowing us to vote on the session catalog. It isn't clear yet what effect this will ultimately have, but they have well over 100 sessions up for voting right now. And you can only vote if you're registered.

Happily, I am.

It would seem that Novell is continuing and expanding on the theme of BS2005:
  • Identity Management (specifically, NAM3)
  • Linux
  • Open Source
I won't be going to any of the Identity Management sessions. WWU saw the need for an IM solution before Novell offered one. And if you remember the state of the industry at the time, there was zilch out there that did it before DirXML came onto the scene. Therefore, we have a self-built IM solution.

The cost of maintaining the self-built version vs the cost of obtaining and then supporting NAM3 is a no brainer. In no way is NAM3 worth it. This puts us in something of an unusual position, but there we are.

Linux is another story. We're getting some of it already, but the politics over who manages it is unclear. If this goes on, I may recommend one of our Unix guys follow me to BS2007 since he will get as much as I can from BrainShare. I'll be going to a few sessions, but not as many as I would had NetWare been the OS of focus.

Open Source... that is very conflicted. Most of the Open Source sessions are for developing and deploying on SLES/OES, which we're not really doing at the moment. On the other hand, there are WWU divisions that would really like to see some of that. The Hula project had been mentioned to me by more than one group. While the chances of a centralized Hula (or something) are slim to none, the chances of deploying something like that on a department level may happen.

The only sessions I saw that impacted NetWare as we all know and love it are for migration from it, running clusters, and gee-whiz new stuff. Not a lot.

Toolbox!

| 2 Comments
I was bummed to note the toolbox.nlm was no longer shipping in NW6.5 when I did the upgrades last year. But it is too useful, so I made a point of copying it from the NW6 servers.

Today I learn there is a forge project for it, and there is a newer release! April, but still, newer than was shipping before.

This thing is very useful. It is my tool of choice for handling name-space corruption problems with files. It can peer into SYS:_NETWARE\. Great stuff!

On OES-Linux

OES-Linux, which is separate from SLES in our collective mind-set, will not be seeing production deployment any time soon. SLES is getting some attention from our developers as a platform for Oracle bits that is more developer-friendly than Windows. OES-Linux has nothing special about it that would urge that instead of a similar OES-NW server. Really, the only way we're getting OES-Linux is if it can provide something that either SLES, NetWare, or Windows can't.

OES-Linux represents a brand new operating system to the Windows/NetWare engineers here. We have a few folk managing Solaris and the two (?) SLES boxes we have out there (neither of which are in full production, and are in test modes, if I remember right). OES-Linux presents a union of the two worlds.

The reasons why OES-Linux will be a while in coming:
  1. It represents a brand new operating system. We have had Linux before, but not in our group. We operate under a 'best of breed' methodology, which is why we still have NetWare managing 3.6TB of file-serving storage. We use Windows and Solaris for our application serving, and are assessing Linux for that role as well. A move to OES-Linux has to be assessed against business need and the operating system's strength. So long as OES-NW still has hardware support, we'll be going with that for our file-server. At least until it can be definitively proven that OES-Linux spanks NetWare for file-serving speed, at which point the business-case will have been made.
  2. Adopting it will introduce two authentication domains into the mix. We have an existing Solaris NIS infrastructure. This is synchronized by way of in-house automation processes (which pre-date DirXML, by the way) with eDirectory and Active Directory, so the usual 'multi-domain' penalties don't apply. The decision to have the Unix-people manage the OES-Linux machines or the NW/Win people do it has yet to be made. If the NW/Win people pick it up, it'll mean a third OS to support in a domain where existing expertise already exists. If the Unix people pick it up, it means having them learn all of the Novell widgets and rich rights-management of NSS. If some form of joint committee is formed, there will be a "who gets root" discussion that'll need to happen.
  3. OES-Linux will have to beat out SLES. Before we can start on OES-Linux, we'll have to provide a reason for using that instead of SLES. Again, the business case will have to be proven.
Of course, if Novell out and out states, clearly and distinctly, that NetWare as we know it will be going away, that'll prompt more urgent decision making. As things stand right now, Novell is implicitly giving the impression that NetWare is a dead end. From everything I've heard from actual Novell employees this impression is completely unintentional, so I don't count it as clear. That may change at future Brainshares, we'll see. If it does happen at a keynote, expect booing.

That said, there are a couple of areas that OES-Linux does have potential to take on roles.
  1. Built in eDirectory integration. This can be shimmed into Solaris, SLES, and other Unixes, but it comes stock in OES. This is useful for things like web-serving, or other web-based applications that use the local account domain for authentications.
    1. How the integration works needs to be very well understood by the Unix people before they'll agree to use it
    2. Certain security implications (who can set UID 0 in eDir?) need to be clarified and resolved to satisfaction
  2. NSS-Style permissions. The ACL structure of the Novell file-system is a very rich one. It is granular, transitive, and has decades of history. Something like this would be very nice for multi-access systems.
    1. UNIX-style permissions are equally well understood by the Unix people, and has worked for them for more decades than Novell has been around
    2. Applications on the OES-Linux box running NSS have to be able to work inside such permissions, and the methods and issues surrounding that have to be very well understood
    3. The very few multi-access systems we have running on Unix systems are all mission-critical, so something as relatively new to Unixland as NSS has a very slim chance of getting in
Even though Novell hasn't intentionally given the message that NetWare is dead, tea-leaf reading by people is giving another message. The amount of hardware that has drivers for NetWare will decrease over the next several years, and that'll have a big effect on how long NetWare can survive before behind exiled to Virtual Machine Dreamland. So sometime in the next 5-7 years, our group will be faced with the decision to pick what OS to move our main file-serving cluster. By then I strongly hope Novell has figured out how to make OES-Linux kick Windows butt in benchmarks. But only time will tell.