April 2007 Archives

Probably the biggest news, Ted Hagger is leaving Novell. Though the message says "March 24th", he said in session on Saturday, "Well, as of Wednesday I'm no longer with Novell." Which would imply that Tuesday April 24th was his last day. March 24th was the Saturday after BrainShare.

Whoa.

He's moving to a company that's doing Web 2.0 work. As that's soooo not my field, Ted is likely dropping off of my radar. He will not be doing any more Novell Open Audio.

Also, I was inspired by a session on OpenID to do a few things differently around here. I'm not sure we'll become and OpenID provider, but it is within the realm of possibility.

I learned more about Xen virtualization, which is nifty as I need to know that stuff.
http://yro.slashdot.org/yro/07/04/25/219257.shtml

Yep, they've gone and blocked all P2P sharing.

Is this something we do? For that, I refer you to the ResTek group since they're the ones handling that end of it.

From what I understand they're using quality-of-service methods to provide a disincentive for P2P. Regular traffic is set to a fairly low priority. Known-good traffic is bumped up, and known-good is fairly permissive. I know they regularly bump up game servers in priority. I have no idea what kinds of throughput bittorrent gets from their networks. They also run a caching proxy for HTTP traffic that is set to a very high priority in order to make normal web traffic run at a good speed (the downside of that is, of course, logging, which I know some students don't like).

And most importantly, ResTek's network is physically separate from WWU Campus. This was done before I got here, and is something I've come to appreciate. I have friends who work on a campus of about our size that has the res-net on the main campus link. Their bandwidth bill is much higher than ours thanks to that.

Migrating a tricky application

| 1 Comment
We have a line of business application that is used for a small department. Like financial applications, this one is consistently about 4-6 years behind the current trends in application development. Version 9 which was just released a few months ago is finally web based. The version we're running, 8.something, is still based on the same model that Access databases are. I.e., file based databasing.

This particular application is excruciatingly sensitive to oplocks. We've fought this application for years as a result of that. Why is it so sensitive?

Any long time NetWare admin will tell you about PDOXUSRS.NET files. This particular application uses the same kind of access mode. One file is used to mediate who is authorized to access the application as a whole. While users are using the application they keep that file open, and update the file with their application level lock. Okay so far.

The problem comes with oplocks. How it is supposed to work:
  1. Station 100 opens LICENSE.LOG and requests an oplock on it
  2. Server, seeing no other stations with that file open, grants the oplock.
  3. Station 100 copies LICENSE.LOG to memory, thus improving access times to it.
  4. Station 105 opens LICENSE.LOG and requests and oplock on it.
  5. Server, seeing Station 100 has an oplock on it, tells Station 100 to release its oplock.
  6. Station 100 writes LICENSE.LOG to the server, and releases its oplock.
  7. Server tells Station 105 it can open the file, but can't have an oplock.
  8. Station 105 accesses the file without an oplock.
The problem comes when things break:
  1. Station 100 opens LICENSE.LOG and requests an oplock on it.
  2. Server, seeing no other stations with that file open, grants the oplock.
  3. Station 100 copies LICENSE.LOG to memory, thus improving access times to it.
  4. Station 100 crashes hard. Does not reset its connection to the server, and the Watchdog doesn't scavenge it.
  5. Station 105 opens LICENSE.LOG and requests an oplock on it.
  6. Server, seeing station 100 has an oplock on it, tells Station 100 to release its oplock.
  7. Station 100 is no longer there.
  8. Server waits until station 100 releases its oplock, which will be never.
  9. Station 105 doesn't get its lock. Application will not load for anyone else.
  10. Server Admin hard clears Station 100.
  11. Application Admin goes into LICENSE.LOG and cleans up bad entries.
  12. Application Admin tells all stations running Application to log out and log in again.
Not the most robust process, there.

One of the things that this vendor has done is decertify NetWare as a valid File Server to store this stuff. This is why I migrated the directory to a Windows 2003 server last night. And even there they had us do a reg-hack to turn off oplock support in Windows. They REALLY do not like oplocks.

Once this app goes web-based, it should help reduce the problems we have with that license file. I hope.

Joke from a friend

Notes posted on a malfunctioning coffee machine:
Post-it 1: Do not use. Leaks.
(handwriting changes)
Post-it 2: Denial of Service attack, pri 1, sev 2.

Post-it 3: Mitigation: use other machine.
(handwriting changes)
Post-it 4: Repairman patch expected OOB on Friday
I love geek humor!
Recent events in Virginia sparked discussions today about if something like that happened to us. All that national attention is akin to getting www.wwu.edu slashdotted, especially any emergency page we may think to prepare. This is why even in this day and age old fashioned media is the best way to get a specific message to LOTS of people. The wwu front page as it exists RIGHT NOW would melt the web-server should something that nationally recognized occur.

That said, given warning we could put together a server that can handle slashdotted loads. We know how. A static page works best, and we have enough web-servers scattered about that running the page through the BigIP to fork loads over 12 servers will allow us to keep up to loads. Heck, I still maintain that the MyWeb servers could handle those loads if given the go-ahead to run by itself.

Running a server with a database of all the students, staff, campus visitors, and Bellingham residents who are confirmed to be Not Dead, the sort of information most in demand by those concerned about aforementioned people, is a lot more work and a lot more resource intensive. Anything database driven has orders of magnitude more resources required to support that level of load.

This isn't something we've felt the need to prepare for, though. We do have an emergency page that can be hosted off site, somewhere, but it isn't designed for this type of disaster. It was designed with a Katrina-level (or more likely in our case, a &*!$ big earthquake in the area) disaster, where the school is closed and the whole region is suffering. Something like the previous paragraph could be hosted in town, even. Heck, even Mt. Baker popping wouldn't do us in because:
  1. We're up wind, so the ashfall wouldn't hit us.
  2. WWU is not in any of the historic lahar paths.
  3. Baker has no history of 'catastrophic flank collapse' eruptions (Mt. St. Helens in 1980).
Who knows. These sorts of events are the type that change disaster planning nation-wide.

OES2, not until 2008

The revelation about AFP in OES2 (how did I miss that?) is the last nail. OES2 will not be rolled out to the WUF cluster until August/September 2008 at the earliest. We'll be staying on NetWare until then. We have a couple of Mac labs and at least one class track that depends on AFP support. CIFS is not an option for many reasons.

So we will be waiting until Novell catches up. In the mean time our 'utility' servers could possibly move, but there aren't many of them. The other two NDS servers, and the server that ATUS hosts their Ghost images on. We're already running OES on one of the NDS servers. The other two are the SLPDAs for our environment, and also house the DFS databases.

OES2 and AFP

| 1 Comment
If you're an instituion of education like us, chances are real good you have PowerBooks and other Mac hardware desiring access to your NetWare/OES servers. It turns out I missed something while at BrainShare. OES2-Linux does NOT have an eDir integrated AFP stack like NetWare does. Whoa.

Details here: http://www.novell.com/coolblogs/?p=836

That's Jason Williams posting, and he is the Project Manager to OES. I spoke with him for a while during Meet the Experts regarding the concurrency concerns we have with OES in general. He has been on Novell Open Audio several times, so I know his voice. He was run downright ragged during BrainShare, which is very not surprising due to his level of oversight of a major product.

He's asking for people who need AFP to talk to them about it. The details of what he's looking for is in the posting I linked above. I've sent in my own impressions, and I've forwareded it to internal people who are Very Concerned about how Mac interacts with our NetWare servers.

Concurrency, again

| 1 Comment
I performed another test on Friday for concurrency. I had 9 workstations performing an iozone througput test. Each machine ran 20 threads each processing against a 15MB file, for a total working set size of 2.7GB which fits into the server's RAM. The results from the workstations were pretty consistant. The workstations had all of 384MB of RAM in them, and the number of IOZone threads running caused significant page-faulting to occur. Which has the side effect of minimizing client-side caching. The workstations were connected to the core by way of 100MB ethernet, so maximum theoretical speeds are 12.5MB/s.

Some typical results, units are in KB/s

Initial write
11058.47
Rewrite
11457.83
Read
5896.23
Re-read
5844.52
Reverse Read
6395.33
Stride read
5988.33
Random read
6761.84
Mixed workload
8713.86
Random write
7279.35

Consistantly, write performance is better than read performance. On the tests that are greatly benefitted by caching, reverse read and stride read, performance was quite acceptable. All nine machines wrote at near flank speed for 100MB ethernet, which means that the 1GB link the server was plugged in to was doing quite a bit of work during the Initial Write stage.

What is perhaps the most encouraging is that CPU loading on the server itself stayed below the saturation level. Having spoken with some of the engineers who write this stuff, this is not surprising. They've spent a lot of effort in making sure that incoming requests can be fulfilled from cache and not go to disk. Going to disk is more expensive in Linux than in NetWare due to architectural reasons. Had the working set been 4GB or larger I strongly suspect that CPU loading would have been significantly higher. Unfortunately, as school is back in session I can't 'borrow' that lab right now as the tests themselves consume 100% of the resources on the workstations. Students would notice that.

The next step for me is to see if I can figure out how large the 'working set' of open files on FacShare is. If it's much bigger than, say, 3.2GB we're going to need new hardware to make OES work for us. This won't be easy. A majority of the size of the open files are outlook archives (.PST files) for Facilities Management. PST files are low performance critters, so I don't care if they're slow. I do care about things like access databases, though, so figuring out what my 'active set' actually is will take some figuring.

Long story short: With OES2 and 64 bit hardware, I bet I could actually use a machine with 18GB of RAM!