July 2007 Archives

Novell news

Two Cool Blogs posts in the past few days have held some nice tidbits.

Jason Williams says that the Novell Client for Vista is due out mid August
, so long as a key defect registered with Microsoft gets fixed.

Jaimon Jose says that eDir 8.8 SP2 is also due out real soon. SP2 apparently involves some serious performance enhancements.

Both of these are technologies associated with the elusive OES2. We need the Client for Vista as soon as they can get it to us, so I'm not surprised they're considering releasing that independently of OES2. SP2 for eDir 8.8 is one thing I figure will be included in OES2 by default. As that's an independent product as well, having it release independently is nice. This means that two technologies that could be blockers for OES2 are finally being kicked into the real world.

In news unrelated to WWU at all, Bonsai, the next GroupWise version, seems to be getting closer to deployment. They're nearing 'code complete' and will soon start the Authorized Beta phase.

Virus Alert!

The Internet Storm Center had a nice post this morning.

It seems, malware authors have taken to dynamically generating binaries for each and every client that contacts them. This understandably makes traditional virus-checkers useless, which is the whole point. This is another step in the game of cat-and-mouse that has been going on in the Microsoft OS virus game, since it started w-a-y back in the MS-DOS days. I had a computer infected with "ping-pong" once.

In the beginning, a virus checker was simply a regex engine capable of efficiently parsing binary files wrapped around a static list of regexes associated with known viruses. Cleaning files involved a bit more logic, but for non-destructive bugs it wasn't too hard. This worked for some time, as all known viruses had to infect files to propagate themselves.

Then the hackers figured out how to infect boot sectors, and the boot-sector virus was born. Anti-virus engines had to evolve to provide the same regex support on the boot sector. So now AV engines had to check files being launched and an actual disk structure. As before, this worked for some time.

In this early day of virus, it could take months for a new virus to get out there enough for detection. Anti-virus updates were sent out as needed, and eventually came monthly. For quite some time, McAfee sent out a static .exe file (Scan.exe) with the detection database hard-coded into it. So you saw .ZIP archives on BBS download sites with the shareware mcscan60.zip, mcscan72.zip, etc.

Then the virus writers figured out how to do "polymorphic" code. This was an advance, as the regex-based AV engines couldn't detect code that CHANGED from infection to infection. In response AV engines had to have much greater intelligence built into them. There was some back-and-forth for a while with code that really wasn't as regex-proof as the virus writers thought, for instance certain virus writing toolkits left their own signatures that could be detected.

Then the virus writers figured out how to infect the BIOS of computers. CIH is the classic example. This is an approach that didn't get wide-spread use since BIOS is a hardware dependant activity. Never the less, this caused antivirus vendors to modify their products to be able to clean the damage caused by this kind of bug.

Then came the era of application-viruses. MS Office came with a very featured macro language that was exploited to send bad code around the world. Yet another AV update to handle this kind of problem.

Extending from the Macro viruses came the first mass-mailers. While AV on mail gateways had been around for a while thanks to existing Macro viruses, the mass-mailers make it amply clear to everyone that such protection was a requirement. They also provided a nice scale test for those solutions.

And that's about when things stopped being just-for-the-mayhem and moved to profit-driven. You don't see truly destructive viruses anymore. If you've subverted a workstation, you may as well do things to it that'll earn you money, like install a bot-client.

And now we have bots being installed with custom compiled code! We no longer have "viruses" we now have malware, and spyware. The old regex-engine based AV methods are still somewhat viable for older threats, but the future is solidly into behavior-based detection. Spam spewers can come in many, many shapes, sizes, and colors. This sort of heuristic detection is a lot harder to code than fancy regex. This sort of heuristic detection is also a lot harder to make "false-positive proof".

Case in point. You purchase a software package to help you put together Newsletters for your quilt-shop. Once a month you send out 430 identical emails to your list. Heuristic scanners can see this behavior and start throwing alerts.

End users HATE false-positives, and unwarranted fear. This provides a disincentive to make aggressive heuristic scanners, and instead rely upon detection databases.

One false-positive that has annoyed NetWare engineers for years is McAfee's hate of SERVER.EXE that causes service-pack installs to bomb. Scanning this file specifically, rather than marking it bad based on name and location, would show that this is a file that can't run in Win32 protected mode and is thus no threat. This is a case of an incomplete heuristic. Not scanning the file, though, does provide some speed bonuses.

Gateway scanners like those for email are largely stuck with database-driven detection methods. Once processing power increases faster than email volume (not likely) you may be able to see in depth analysis of what a file COULD do once it gets to where it is going, but I don't expect to see systems like that for a number of years. End-point scanners like those on workstations have a much richer feature set to play with, and heuristic scanning will work there much better. We just need product.

SUSE driver pack for windows

Novell released the "SUSE Linux Enterprise Virtual Machine Driver Pack" today. You can find it on the downloads site. A word of warning, though, from the Documentation:

1.9 Avoiding Problems with the Drivers

[...]
  • Upgrading the Linux* kernel of the virtual machine host without upgrading the driver pack at the same time.
So, you can't run it on openSUSE (different kernel), and since SLES10 SP1 has already had a kernel update, you can't use it THERE without a subscription. So no freebie.

But, the fact that they've released it is great. Also, they list Windows XP drivers as part of the download. Yay!

That darned MSA again

| 2 Comments
I'm not sure where this problem sits, but I'm having trouble with this MSA1500cs and my NetWare servers. I've found a failure case that is a bit unusual, but things shouldn't fail this way.

The setup:
  • NetWare 6.5, SP5 plus patches
  • EVA3000 visible
  • MSA1500cs visible
  • Pool in question hosted on the MSA
  • Pool in question has snapshots
  • Do a nss /poolrebuild on the pool
Do that, and at some point you'll get an error like this one:
 7-19-2007   9:48:22 am:    COMN-3.24-1092  [nmID=A0025]
NSS-3.00-5001: Pool FACSRV2/USER2DR is being deactivated.
An I/O error (20204(zio.c[2260])) at block 36640253(file block
-36640253)(ZID 1) has compromised pool integrity.
The block number changes every time, and when it decides to crap out of the rebuild also changes every time. No consistency. The I/O error (20204) decodes to:

zERR_WRITE_FAILURE 20204 /* the low level async block WRITE failed*/

Which, you know, shouldn't happen. And this error is consistent across the following changes:
  • Updating the HAM driver (QL2300.HAM) from version 6.90.08 (a.k.a 6.90h) to 6.90.13 (6.90m).
  • Updating the firmware on the card from 1.43 to 1.45 (I needed to do this anyway for the EVA3000 VCS upgrade next month)
  • Applying the N65NSS5B patch, I had N65NSS5A on there before
PoolVerifies, a pure Read operation, do not throw this error.

I haven't thrown SP6 on there yet, as this is a WUF cluster node and this isn't intersession ;). This is one of those areas where I'm not sure who to call. Novell or HP? This is a critical error to get fixed as it impacts how we'll be replicating the EVA. It was errors similar to this, and activities similar to this, that caused all that EXCITEMENT about noon last Wednesday. That was not fun to live through, and we really really don't want to have that happen again.

Call Novell
Good:Bad:
  • Their storage geeks know NetWare a lot better.
  • Much more likely to know about Fibre Channel problems on NetWare.
  • Not likely to know HP-specific problems.
  • More likely to recommend, "Well, then don't move your arm like that," as a solution.
The next step here is to delete these pools and volumes, recreate them, and see if things go Poink in quite the same way. I'm not convinced that'll fix the problem, as the errors being reported are Write errors, not Read errors, and the faulting blocks are different every time. I'm suspecting instability in the Write channel somewhere that is unique to a nss /poolrebuild, as I didn't get these errors when FILLING these volumes. Write channel in this case has a lot of Fibre Channel in it.
With the release of OES2 pushed to Christmas, or possibly BrainShare 2008, I'm in a hard spot. The magnitude of this migration means that I have one period a year I can pull that off, and that is the last week in August and the first two weeks of September. If I don't have code in that period, I can't migrate. Period.

As I learned at BrainShare this year, the Apple Filing Protocol stack on OES2-Linux is not eDirectory integrated. This is a project stopper for us, so we need that to be in place before we migrate. They quoted us, "Possibly SP1 timeframe, definitely not first-customer-ship, but don't hold us to it." They learned of the AFP problem at BrainShare and said they'd get right on it to try and get that in. That told me that summer 2008 would be the earliest I could expect to have the eDir integrated AFP stack.

Since I don't think Novell is planning on pushing OES2 ship to summer 2008, I suspect the AFP stack will be in with SP1. I consider it likely that OES2 SP1 will ship about the same time as SLE10 SP2. Which means I have real strong doubts that I'll be doing an OES2-Linux migration during next year's intersession. So we'll probably end up staying on NetWare for file-serving at least until 2009. In 2009 those NetWare servers may very well be in either an ESX or Xen virtual container, but it'll still be the 32-bit NetWare code doing the serving. That said, the web and print services (MyFiles, MyWeb, iprint) may move earlier, as they do not have the same AFP dependency.

Our storage needs on the WUF cluster are already pushing the boundaries of the 32-bit memory space. I'd be a lot happier of I could throw another 2 gigs of RAM at the file-servers in order to keep their cache-levels at a good spot. Can't do that on 32-bit NetWare, at least not while expecting improved performance. In 2009 we'll be managing anywhere from 12 to 18 terabytes of data on WUF, with a good chunk of it active. That is a situation that screams for 64-bit limits to memory space in order to provide zippy performance.

Thus, I am worried. Please, Novell. Ship at Christmas. It'll make my schedules look a LOT less grim.

A digression on compensation

Several of the people I work with are getting hefty raises this year. This isn't part of any regular increase, it is something of a catch-up for years of being under-paid. I understand how this goes, I had the same thing happen to me at my old job when I got an effective 17% raise thanks to a salary study. I think some of the raises being passed out are around the same levels.

How did this come about? To answer that, I have to take tour through Labor Relations, a topic I've never touched on before on this blog. But it talks about technician pay, so I figure it's relevant!

Note: I'm not classified, so this doesn't apply to me. I'm exempt.

Note: WWU is a state funded institution, just like the University of Washington. We're all State employees here, even if we're 100% grant funded.

The ball got rolling in 1999 when a class-action suit was filed, Shroll vs. State of Washington.
There are State employees in two parallel personnel systems, one for general government agencies and one for higher education institutions. This case is a class action on behalf of all state workers at general government agencies or at higher education institutions who perform the same duties but are paid a lower basic salary range than their counterparts in corresponding job classes in other personnel system (i.e., either general government or higher education). Thus, the case challenged the fact that state workers in these "common classes" were performing "equal work" but were not being paid "equal pay." The case sought back pay and a change in the compensation system.
To summarize, an INFORMATION TECHNOLOGY SPECIALIST 3 at the Washington State Department of Transportation (WS/DOT) and the University of Washington were paid differently for exactly the same job duties. The suit alleged that this was unfair, and ultimately the state Supreme Court agreed with them. This year the pay-scales were unified. Areas where general government were paid less were brought up to the higher-ed levels, and areas where higher-ed was paid less than general government (most tech positions) were brought up to general government levels.

But that isn't all! Throw into the mix that the Democrats now hold both houses and the governor's mansion. Democrats being hip to organized labor have taken their own steps to kick the Civil Service structure into shape. One way they're doing that is to do a salary survey and reorganize pay scales and pay grades for job classes to reflect what's out there in the private sector.

Washington has been doing salary surveys on job classes since the 1970's. According to the HR person that spoke to us yesterday, before a few years ago they've only DONE SOMETHING with that data twice in that time. And "done something" means reorganize pay scales and grades based on the salary data. One of the changes being introduced is that this reorganization will now happen every two years, so job classes experiencing wage inflation in the private sector (some tech classes) will by proxy drag up the compensation of civil servants such as those I work with.

There is a law on the books now that says that job classes that are being paid less than the 25th percentile of the wage in the salary survey will be brought up to the 25th percentile. This is the OTHER area that is causing raises among those I work with, as tech pay here at WWU has lagged significantly according to the statewide salary survey (of which 40% of the state lives in highly urban and tech-heavy King County, not half rural Whatcom county). I don't know what they ranked according to the survey, but I know it was below the 25th percentile.

With the salary survey being applied every two years, it is possible that certain tech-classes will be getting raises above and beyond the cost-of-living raises we all get. This is a good thing in the end, because it's a way to limit turn-over. Having myself lived through a 'drastic underpaid' situation in the 1996-99 .com boom, and the resultant reclassification and reorganization, I really understand how this feels. And it is a good feeling to feel appreciated again.

Mmm. Needed reviews.

| 1 Comment
AnandTech is going to be reviewing power-supplies!

This is nifty because:
  • Power supplies are HARD to test right. Much harder than measuring frame-rates in an FPS for vid-card performance.
  • Getting details about what the efficiency label on the side of the box really means is very good
  • Getting an idea as to what power-supply manufacturers build good supplies, and which are fly-by-nights with lots of bling is very good
While it won't affect my job a whole lot (probably very little in fact), having these out there will be good all-around knowledge. Google was famous a few months ago for their push for higher efficiencies in server-class power-supplies. I don't think AnandTech will be testing server-class, OEM supplies but I may be surprised. And who knows, perhaps having this testing out there will spur a bit of good innovation in the industry as a whole.
I have managed to get the new beta of the NCL 2.0 working on my openSUSE 10.2 workstation. This is very nice. Some nice details can be found here in the Novell support groups. My steps were rather similar.
  1. Install the referenced RPM. I did it with an "rpm -i [rpm-name]". Use the RPM for your processor type. I'm using 64-bit and it worked just fine for me.
  2. Run the "ncl-install" from the beta download.
That was pretty much it. It isn't perfect, but it is w-a-y better than using NetStorage and WebDav for this stuff. One area of inperfection is the tray icon gets smooshed.
NWTRAY getting smooshed
See that little sliver of an icon between the magnifying glass and the vertical bar? That's the nwtray icon. It's about 2 pixels wide. If I can click on it I get the full NWTRAY menu just fine, but it's hard to hit.

The other problem is the "Novell Services" button in nautilus. When I click that button, it looks like gnome crashes. I haven't been able to find out where the dump traces are going so I don't know what's up with that. If I access the services from the 2-pixel wide NWTRAY things work just fine, though.

Throughput still sucks, though. The Windows client is still better for that. But... throughput is w-a-y better than using a WebDav connection! Progress!

More fun OES2 tricks

I had an idea while I was googling around a bit ago. This may not work the way I expect as I'm not 100% on the technologies involved. But it sounds feasible.

Lets say you want to create a cluster mirror of a 2-node cluster for disaster recovery purposes. This will need at least four servers to set up. You have shared storage for both cluster pairs. So far so good.

Create the four servers as OES2-Linux servers. Set up the shared storage as needed so everything can see what it should in each site. Use DRBD to create new block-devices that'll be mirrored between the cluster pairs. Then set up NetWare-in-VM on each server, using the DRBD block-devices as the Cluster disk devices. You could even do SYS: on the DRBD block-devices if you want a true cluster-clone. That way when disk I/O happens on the clustered resources it gets replicated asynchronously to the DR site; unlike software RAID1 the I/O is considered committed when it hits local storage, SW RAID1 only considers writes committed when all mirrored LUNs report the commit.

Then, if the primary site ever dies, you can bring up an exact replica of the primary cluster, only on the secondary cluster pair. Key details like how to get the same network in both locations I leave as an exercise for the Cisco engineers. But still, an interesting idea.

Getting creative with Blackboard

I had me an idea yesterday. One of those ideas that I'm not sure is a good one, but wow does it make a certain kind of sense.

We, like all too many schools run Blackboard as the groupware product supporting our classrooms. There is an opensource product out there that also can do this, but we're not running it. That's not what this post is about.

First a wee bit of architecture. Roughly speaking, Blackboard is separated into three bits. The web server, the content server, and the database. The web-server is the classic Application Server that is what students and teachers interface with. The web server then talks with both the content server and database server. The content server is the ultimate home of all things like passed in homework. The database server glues this all together.

Due to policies, we have to keep courses in Blackboard for a certain number of quarters just in case a student challenges a grade. They may not be available to everyone, but those courses are still in the system. And so is all of the homework and assorted files associated with that class. Because of this, it is not unusual for us to have 2 years (6-7 quarters) of classes living on the content server, of which all but one quarter is essentially dead storage.

One of the problems we've had is that when it comes time to actually delete a course, it doesn't always clean up the Content associated with that course. Quite annoying.

This is a case where Dynamic Storage Technology would be great. Right now our Blackboard Content servers are a pair of Windows servers in a Windows Cluster. It struck me yesterday that this function could be fulfilled by a pair of OES2 servers in a Novell Clustering Services setup (or Heartbeat, but I don't know how to set THAT up), using Samba and DST to manage the storage. That way stuff that is accessed in the past, oh, 3 months would be on the fast EVA storage, and stuff older than 3 months would be exiled to the slow MSA storage. As the file-serving is done by way of web-servers rather than direct access, the performance hit by using Samba won't be noticable as the concurrency is well below the limit where that becomes a problem. Additionally, since all the files are owned by the same user I could use a non-NSS filesystem for even faster performance.

Hmmmm......

The problem here is that OES2 isn't out yet. Such a fantastical idea may be doable in the 2008 intersession window, but we may have other upgrades to handle there. But still, it IS an interesting idea.

Dynamic Storage Technology

Novell Connection Magazine has an article up right now that describes DST, formerly known as Shadow Volumes. I've talked about them before, both last year around this time (6/15/07, and 6/26/07) and back at BrainShare (TUT205). So, I've been following this.

As said previously, this'll not work for NetWare, just OES-Linux. From what I understand you can host migration volumes on NetWare, but the server presenting the unified view of the storage has to be OES-linux.

Anyway, on with the article.

OES2: pushed several months

| 1 Comment
A new post up on Cool Blogs shows where OES2 is sitting:

http://www.novell.com/coolblogs/?p=921

To quote from one of the comments by the author:
There will be a public beta. It might take couple of months more for a public beta.
This blows my schedule. From the sounds of it, they're looking at a Christmas or possibly BrainShare 2008 release. We'll have to put NetWare inside ESX server instead of a Xen paravirtualization. Due to this delay, and the presumed SP1 schedule, chances are now much worse for Novell to make the summer intersession 2008 migration window.

Crap.

Disk failure rates

My boss forwarded us this article:

Opinion: Real-world disk failure rates offer surprises.

Apparently a pair of studies of large populations of disks have been released. Both have over 100,000 disk drives in the study, and they looked at real world failure rates. What they show is that the MTBF reported by drive manufacturers is incorrect. It also shows several other things as well.

Remember the long standing SysAdmin wisdom that you get a few drive failures within a few months, then not many, then more as the drives age? Bunk. The failure curve over time doesn't look like that at all.

The study also shows that the real-world MTBF for SATA is no different than SCSI. And the real-world MTBF of SCSI drives is no different than Fibre Channel. They also see failure rates increasing significantly after 3 years of age, not the 5 years of age that the MTBF numbers would suggest.

Another thing indicated is that S.M.A.R.T. errors do correlate with a much greater chance of failure in the near term, but such drives have a solid chance of running for another year without a hitch. That said, many failures are not presaged by SMART errors at all. Customers with massive RAID systems (think Raid 6) may not care about SMART failures as internal redundancy renders such predictive failures moot. On the other hand, home users really should replace drives after the first SMART error.

Another interesting item they found is that environmental temperature does not affect drive failure rates, to a point. Get too cold, under 17C (63F), and failure rates increase. Get too high, and high is really high, and you get increased failure rates as well. Other systems go kablam at high temperatures, so disk failures are not the top thing to worry about if you have a "heat event" in your datacenter.

As for failure rates, the study uses what they call the Annualized Replacement Rate (ARR). This is the likelihood of any given disk failing in a given year, during its 5 year lifespan. The observed ARR came to about 3%, where the ARR based on datasheet information puts it under 1%. The observed ARR at different sites can change markedly, and the study did not theorize as to why that would be. As an anecdote, one dataset had drives that were 7 years old at the end of the study, and that population had an ARR of 24%.
Observation 1: Variance between datasheet MTTF and disk replacement rates in the field was larger than we expected. The weighted average ARR was 3.4 times larger than 0.88%, corresponding to a datasheet MTTF of 1,000,000 hours.

Observation 2: For older systems (5-8 years of age), data sheet MTTFs underestimated replacement rates by as much as a factor of 30.

Observation 3: Even during the first few years of a system's lifetime ($< style="font-weight: bold;">Observation 4: In our data sets, the replacement rates of SATA disks are not worse than the replacement rates of SCSI or FC disks. This may indicate that disk-independent factors, such as operating conditions, usage and environmental factors, affect replacement rates more than component specific factors. However, the only evidence we have of a bad batch of disks was found in a collection of SATA disks experiencing high media error rates. We have too little data on bad batches to estimate the relative frequency of bad batches by type of disk, although there is plenty of anecdotal evidence that bad batches are not unique to SATA disks.
Very interesting stuff! Studies like these can lead to new methods of labeling for drives.

The joys of working REALLY late

I'm doing some FibreOS updates to a few SAN switches, and this of course requires me to take down large chunks of things. While I'm watching progress bars, Blackboard and Exchange are both down. WUF isn't down for the sole reason that it is our only SAN-based service that has multiple hosts on different SAN switches. That'll be changing tonight, though.

But, as it is Saturday night radio on the drive in is interesting. Flipping around I found a station in Vancouver playing what sounds like Indian pop. Very interesting! I don't get much exposure to 'World' music, and I'd like more. Over the past few years I've heard some very interesting things. Techno/Dance, which I didn't expect to hear broadcast. Indian. Lots of blues.

Plus, thanks to the hour of night I can park right at the back door and not have to worry about getting a ticket from the Campus Police. :)