Cluster realignments

One of my projects for the break is to realign which services run on which nodes in the cluster. We have six nodes, and previously there was a none-shall-pass division between the three Student nodes, and the three Faculty/Staff nodes. The division is gone in the drive for more reliable file-serving.

Without running the numbers, I'm guessing that 90-95% of the unexpected volume failovers are due to an application crashing and taking the node down with it. Last year and early this year we had a lot of problems with NDPS. Recently it has been SSHD, and NetStorage. I've recently reminded myself that anytime you run a "sshd reload" from the console you run a very real risk of a crash in the next few minutes.

While our overall downtime from pre-cluster is w-a-y better, the frequency of multi-second outages unfortunately has gone w-a-y up. Before it could be 12 minutes before a crashed server gets to the point it could serve files again. Now it's 12-45 seconds, but we get them a lot more often. We're trying to reduce even these small downtimes.

To do that we're dividing the cluster into two halves. The file-server side, and the application side. Due to technical reasons, printing overlaps a little. This is made possible thanks to improvements in LibC that permit things like MyWeb and SFTP to reliably work from servers that aren't also hosting the files. I couldn't have done this 4 months ago.

One of the side-effects of this is figuring out how to get myweb.students.wwu.edu and myweb.facstaff.wwu.edu to share a web-server and still be able to get separate logging from each side. On the surface this is trivial. Unfortunately, the NetWare application environment once again makes things more difficult.

Unlike on Linux, if you remove the access.log file as part of the rotation process, it won't re-create the next time someone hits the web server. All transactions after the access.log file is removed/rotated will not get logged. Getting it back requires a web-server restart. This behaves like Apache1.3 did, only with Apache2 you can read the access.log file while apache is running.

Novell includes a ROTLOG.NLM that allows you to pipe input through it to allow rotations.

CustomLog "|sys:/apache2/bin/rotlogs.nlm sys:/apache2/logs/myfiles_access_log 5M" common

Which works great for one logfile for an apache instance. Unfortunately, I need to run it three times in the same instance to provide for different log-files. Rotlogs doesn't like loading multiple times like that, so it has a tendency to crash out the memory space after a couple of hours of normal load. Hardly sporting. Clearly this would present issues in the OS memory space, so I haven't tried doing it over there even for testing.

Since I'm running these web-servers in a cluster, any one of three nodes could be running any one of three services. I can't just create a script that unloads then reloads Apache, since apache will bomb out unless it can bind a listener to every address it's configured to run on. Too tricky.

The solution is to modify the log format and then perform post-processing to split out the separate logs.:

LogFormat "%A %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" comb-vhost

By prefixing the "combined" pre-built LogFormat directory with the %A directive, each log-line starts with the IP of the VirtualServer that serviced the request. Then some scripting trickery later and I have three split log-files that look just like the standard "combined" format! So far, it is working well. We'll see if things hold up.