Recently in clustering Category

The things you learn

| 3 Comments | No TrackBacks
We had cause to learn this one the hard way this past week. We didn't know that Windows Server 2008 (64-bit) and Symantec Endpoint Protection just don't mix well. It affected SMBv1 clients, SMBv2 clients (Vista, Win7) were unaffected.

The presentation of it at the packet-level was pretty specific, though. XP clients (and Samba clients) would get to the second step of the connection setup process for mapping a drive and time out.

  1. -> Syn
  2. <- Syn/Ack
  3. -> NBSS, Session Request, to $Server<20> from $Client<00>
  4. <- NBSS, Positive Session Response
  5. -> SMB, Negotiate Protocol Request
  6. <- Ack
  7. [70+ seconds pass]
  8. -> FIN
  9. <- FIN/Ack
Repeat two more times, and 160+ seconds later the client times out. The timeouts between the retries are not consistent so the time it takes varies. Also sometimes the server issues the correct "Protocol Request Reply" packet and the connection continues just fine. There was no sign in any of the SEP logs that it was dropping these connections, and the Windows Firewall was quiet as well.

In the end it took a call to Microsoft. Once we got to the right network person, they knew immediately what the problem was.

ForeFront is now going on those servers. It really should have been on a month ago, but because these cluster nodes were supposed to go live for fall quarter they were fully staged up in August, before we even had the ForeFront clients. We never remembered to replaced SEP with ForeFront.

I have a degree in this stuff

| 1 Comment | No TrackBacks
I have a CompSci degree. This qualified me for two things:
  • A career in academics
  • A career in programming
You'll note that Systems Administration is not on that list. My degree has helped my career by getting me past the "4 year degree in a related field" requirement of jobs like mine. An MIS degree would be more appropriate, but there were very few of those back when I graduated. It has indirectly helped me in troubleshooting, as I have a much better foundation about how the internals work than your average computer mechanic.

Anyway. Every so often I stumble across something that causes me to go Ooo! ooo! over the sheer computer science of it. Yesterday I stumbled across Barrelfish, and this paper. If I weren't sick today I'd have finished it, but even as far as I've gotten into it I can see the implications of what they're trying to do.

The core concept behind the Barrelfish operating system is to assume that each computing core does not share memory and has access to some kind of message passing architecture. This has the side effect of having each computing core running its own kernel, which is why they're calling Barrelfish a 'multikernel operating system'. In essence, they're treating the insides of your computer like the distributed network that it is, and using already existing distributed computing methods to improve it. The type of multi-core we're doing now, SMP, ccNUMA, uses shared memory techniques rather than message passing, and it seems that this doesn't scale as far as message passing does once core counts go higher.

They go into a lot more detail in the paper about why this is. A big one is hetergenaity of CPU architectures out there in the marketplace, and they're not just talking just AMD vs Intel vs CUDA, this is also Core vs Core2 vs Nehalem. This heterogenaity in the marketplace makes it very hard for a traditional Operating System to be optimized for a specific platform.

A multikernel OS would use a discrete kernel for each microarcitecture. These kernels would communicate with each other using OS-standardized message passing protocols. On top of these microkernels would be created the abstraction called an Operating System upon which applications would run. Due to the modularity at the base of it, it would take much less effort to provide an optimized microkernel for a new microarcitecture.

The use of message passing is very interesting to me. Back in college, parallel computing was my main focus. I ended up not pursuing that area of study in large part because I was a strictly C student in math, parallel computing was a largely academic endeavor when I graduated, and you needed to be at least a B student in math to hack it in grad school. It still fired my imagination, and there was squee when the Pentium Pro was released and you could do 2 CPU multiprocessing.

In my Databases class, we were tasked with creating a database-like thingy in code and to write a paper on it. It was up to us what we did with it. Having just finished my Parallel Computing class, I decided to investigate distributed databases. So I exercised the PVM extensions we had on our compilers thanks to that class. I then used the six Unix machines I had access to at the time to create a 6-node distributed database. I used statically defined tables and queries since I didn't have time to build a table parser or query processor and needed to get it working so I could do some tests on how optimization of table positioning impacted performance.

Looking back on it 14 years later (eek) I can see some serious faults about my implementation. But then, I've spent the last... 12 years working with a distributed database in the form of Novell's NDS and later eDirectory. At the time I was doing this project, Novell was actively developing the first version of NDS. They had some problems with their implementation too.

My results were decidedly inconclusive. There was a noise factor in my data that I was not able to isolate and managed to drown out what differences there were between my optimized and non-optimized runs (in hindsight I needed larger tables by an order of magnitude or more). My analysis paper was largely an admission of failure. So when I got an A on the project I was confused enough I went to the professor and asked how this was possible. His response?
"Once I realized you got it working at all, that's when you earned the A. At that point the paper didn't matter."
Dude. PVM is a message passing architecture, like most distributed systems. So yes, distributed systems are my thing. And they're talking about doing this on the motherboard! How cool is that?

Both Linux and Windows are adopting more message-passing architectures in their internal structures, as they scale better on highly parallel systems. In Linux this involved reducing the use of the Big Kernel Lock in anything possible, as invoking the BKL forces the kernel into single-threaded mode and that's not a good thing with, say, 16 cores. Windows 7 involves similar improvements. As more and more cores sneak into everyday computers, this becomes more of a problem.

An operating system working without the assumption of shared memory is a very different critter. Operating state has to be replicated to each core to facilitate correct functioning, you can't rely on a common memory address to handle this. It seems that the form of this state is key to performance, and is very sensitive to microarchitecture changes. What was good on a P4, may suck a lot on a Phenom II. The use of a per-core kernel allows the optimal structure to be used on each core, with changes replicated rather than shared which improves performance. More importantly, it'll still be performant 5 years after release assuming regular per-core kernel updates.

You'd also be able to use the 1.75GB of GDDR3 on your GeForce 295 as part of the operating system if you really wanted to! And some might.

I'd burble further, but I'm sick so not thinking straight. Definitely food for thought!
It seems that all Mac OSX versions except for 10.4 (yes, including 10.6) don't like to talk to Window Server 2008 Failover clusters without special syntax. The reason for this boils down to two technology disagreements.

  1. OS X (except for 10.4) attempts to make smb/cifs connections by the resolved IP address of given names. So a connection string like smb://clu-share1.winclu.wwu.edu/share1/ will be translated into \\140.160.12.34\share1 when it attempts to talk to the server.
  2. Windows failover clustering requires the server name when connecting. Otherwise it tells you no-can-do. You can't use \\140.160.12.34\share1\ syntax, you MUST use a name.
For instance, the string "smb://msfs-class1.univ.dir.wwu.edu/class1" will cause the following packets to occur:
Packets showing fail
However, if you attempt to connect to a non-clustered share, perhaps a share on one of the cluster nodes rather than a cluster service, it works just fine.
Packets showing success
Funny, eh?

So what's a mac-owner, of which we have quite a lot, to do? The fix is pretty simple, append ":139" to the end of the server part of the connection string. In the above example, "smb://msfs-class1.univ.dir.wwu.edu:139/class1". For some reason, this forces the mac to use a name when connecting to the remote system.
Packets showing success
Apparently, OS X 10.4 (Tiger) did this normally, but Apple changed it back to the non-working version with 10.5 (Leopard). And we've tested, 10.6 (Snow Leopard) is broken the same way.

Why this is so is up for debate. I'm personally fond of the idea that the Windows SMB stack isn't detailed enough to tell what IP address an incoming connection came in on and virtualize answers accordingly. For stand-alone servers this is a simple thing; if you can talk to me at all, here are all of my shares. For conditional sharing like with clusters, you can only get these shares on these IP's, the SMB stack apparently lacks a way to discriminate appropriately. Clearly name-based is in there, but not IP.

No word on if 2008 R2 behaves this way. Microsoft dropped R2 about... three weeks too late for us to go with it for this cluster.

This is going to be one of those FAQs the helpdesks are going to get real used to answering.
Yesterday I ran into this:

http://blogs.msdn.com/clustering/archive/2009/03/02/9453288.aspx

On the surface it looks like NTFS behaving like OCFS. But Microsoft has a warning on this page:
In Windows Server® 2008 R2, the Cluster Shared Volumes feature included in failover clustering is only supported for use with the Hyper-V server role. The creation, reproduction, and storage of files on Cluster Shared Volumes that were not created for the Hyper-V role, including any user or application data stored under the ClusterStorage folder of the system drive on every node, are not supported and may result in unpredictable behavior, including data corruption or data loss on these shared volumes. Only files that are created for the Hyper-V role can be stored on Cluster Shared Volumes. An example of a file type that is created for the Hyper-V role is a Virtual Hard Disk (VHD) file.

Before installing any software utility that might access files stored on Cluster Shared Volumes (for example, an antivirus or backup solution), review the documentation or check with the vendor to verify that the application or utility is compatible with Cluster Shared Volumes.
So unlike OCFS2, this multi-mount NTFS is only for VM's and not for general file-serving. In theory you could use this in combination with Network Load Balancing to create a high-availability cluster with even higher uptime than failover clusters already provide. The devil is in the details though, and Microsoft aludes to them.

A file system being used for Hyper-V isn't a complex locking environment. You'll have as many locks as there are VHD files, and they won't change often. Contrast this with a file-server where you can have thousands of locks that change by the second. Additionally, unless you disable Opportunistic Locking you are at grave risk of corrupting files used by more than one user (Acess databases!) if you are using the multi-mount NTFS.

Microsoft will have to promote awareness of this type of file-system into the SMB layer before this can be used for file-sharing. SMB has its own lock layer, and this will have to coordinate the SMB layers in the other nodes for it to work right. This may never happen, we'll see.

A new version of BIND

| No Comments | No TrackBacks
I saw on the SANS log today that the ISC is starting work on BIND10. A list of the new stuff can be found here. A couple of those items are very interesting to me. Specifically the Modularity and Clustering items.

Modularity:
...the selection of a variety of back-ends for data storage, be it the current in-memory database, a traditional SQL-based server, an embedded database engine or back-ends for specific applications such as a high performance, pre-compiled answer database.
Which makes me think of eDirectory backed DNS. Novell has had this for ages with NetWare, and from what I recall it was based on BIND. But... BIND8. BIND10 would formalize this in the linux base, which would further allow Novell to publish a more 'pure' eDir-integrated BIND.

Clustering:
run on multiple but related systems simultaneously, using a pluggable, open-source architecture to enable backbone communications between individual members of the cluster. These coordination services would enable a server farm to maintain consistency and coherence.
This is exactly what AD-integrated DNS and the DNS on NetWare has been doing for over 8 years now. Glad to see BIND catch up.

The big thing about using a database of some kind as the back-end for DNS is that you no longer have to create Secondary servers and muck about with Zone Transfers. For domains that change on a second by second basis, such as an AD DNS domain with dynamic updates enabled and thousands of computers during morning power-on, it is entirely possible for a BIND secondary-server to be missing many, many DNS updates. Microsoft has known about this issue, which is why they have their own directory-integrated DNS service.

This also shows just how creaky the NetWare DNS service really is. That's based on BIND8 code, which is now over 10 years old. Very creaky.

I'm looking forward to BIND10. It is a needed update that addresses DNS as it is done today, and would better enable BIND to handle large Active Directory domains.
I've said before that you'll have to pry the login-script out of our cold dead hands. The simple Novell login-script is the single most pervasive workstation management tool we have, since EVERYONE needs the Novell Client to talk to their file servers. Its one reason we have computer labs when others are paring down or getting rid of theirs. People can live without the Zen agents if they work at it, but they can't live without the Novell Client. Therefore, we do a lot of our workstation management through the login-script.

The Vista client has been vexing in this regard since it is so painfully slow in our clustered environment. The reason it is slow is the same reason the first WinXP clients were slow, the Microsoft and Novell name-resolution processes conmpete in bad ways. As each drive letter we map is its own virtual-server, every time you attempt to display a Save/Open box or open Windows Explorer it has to resolve-timeout-resolve each and every drive letter. This means that opening a Save/Open box on a Vista machine running the Novell client can take upwards of 5 minutes to display thanks to the timeouts. Novell knows about this issue, and has reported it to Microsoft. This is something Microsoft has to fix, and they haven't yet.

This is vexing enough that certain highly influential managers want to make sure that the same thing doesn't happen again for Windows 7. As anyone who follows any piece of the tech media knows, Windows 7 has been deemed, "Vista done right," and we expect a lot faster uptake of Win7 than WinVista. So we need to make sure our network can accommodate that on release-day. Make it so, said the highly placed manager. Yessir, we said.

So last night I turned CIFS on for all the file services on the cluster. It was that or migrate our entire file-serving function to Windows. The choice, as you can expect, was an easy one.

This morning our Mac users have been decidedly gleeful, as CIFS has long password support where AFP didn't. The one sysadmin here in techservices running Vista as his primary desktop has uninstalled the Novell Client and is also cheerful. Happily for us, the directive from said highly placed manager was accompanied by a strong suggestion to all departments that domaining PCs into the AD domain would be a Really Good Idea. This allows us to use the AD login-script, as well as group-policies, for those Windows machines that lack a Novell Client.

Ultimately, I expect the Novell Client to slowly fade away as a mandatory install. So that clientless-future I said we couldn't take part in? Microsoft managed to push us there.
Last night I turned on multi-path support for the main NetWare file cluster. This has been a long time coming. When we upgraded the EVA3000 to an EVA6100 it gained the ability to do active/active IO on the controllers, something that the new EVA4400 can also do.

What's more, the two Windows backup-to-disk servers we've attached to the EVA4400 (and the MSA1500 for that matter) have the HP MPIO drivers installed, which are extensions of the Microsoft MPIO stack. Looking at the bandwidth chart on the fiber-channel fabric I see that these Windows servers are also doing load balancing over both of the paths. This is nifty! Also, when I last updated the XCS code on the EVA4400 both of those servers didn't even notice the controller reboots. EVEN NICER!

I want to do the same thing with NetWare. On the surface, turning on MPIO support is dead easy:

Startup.ncf file:
SET MULTI-PATH SUPPORT = ON
LOAD QL2X00.HAM SLOT=10001 /LUNS /ALLPATHS /PORTNAMES


Tada. Reboot, present both paths in your zoning, and issue the "list failover devices" command on the console, and you'll get a list. In theory should one go away, it'll seamlessly move over to the other.

But what it won't do is load-balance. Unfortunately, the documentation on NetWare's multi-path support is rather scanty, focusing more on configuring path failover priority. The fact that the QL2X00.HAM driver itself can do it all on its own without letting NetWare know (the "allpaths" and "portnames" options tell it to not do that and let NetWare do the work) is a strong hint that MPIO is a fairly light weight protocol.

On the support forums you'll get several references to the LSIMPE.CDM file. With interesting phrases like, "that's the multipath driver", and, "Yeah, it isn't well documented." The data on the file itself is scanty, but suggestive:
LSIMPE.CDM
Loaded from [C:\NWSERVER\DRIVERS\] on Feb 4, 2009 3:32:13 am
(Address Space = OS)
LSI Multipath Enhancer
Version 1.02.02 September 5, 2006
Copyright 1989-2006 Novell, Inc.
But the exact details of what it does remain unclear. One thing I do know, it doesn't do the load-balancing trick.

Dorm printing

| 2 Comments | No TrackBacks
On my post about finally running vista patrickbuller asked:
So you have printers that students in the dorms can print to? Wow. Do you audit all those and charge the numbers of pages against the student?
The answer to that is that we make big use of AND Technology's PCounter product. When paired with their PrintStations, it makes a very nice way to put a lid on unrestricted 'free' printing in the dorms. The PrintStations also make sure that only jobs people want to pick up get printed, which saves a serious amount of paper.

PCounter is core to our student printing. We'll only move our NDPS/iPrint infrastructure over to OES2-linux when Pcounter is supported on that platform, not before. We'll keep a 2 node NetWare cluster around just for printing if we have to. Since accounting support is one of the features that's supposed to be in OES2-SP1, it is my hope that PCounter will support OES2-Linux within a year after SP1's release. But I haven't heard any specifics.

Moving storage around

| 1 Comment | No TrackBacks
The EVA6100 went in just fine with that one hitch I mentioned, and now comes all the work we need to do now that we have actual space again. We're still arguing over how much space to add to which volumes, but once we decide all but Blackboard will be very easy to add.

Blackboard needs more space on both the SQL server and the Content server, and as the Content server is clustered it'll require an outage to manage the increase. And it'll be a long outage, as 300GB of weensy files takes a LONG time to copy. The SQL server uses plain old Basic partitions, so I don't think we can expand that partition, so we may have to do another full LUN copy which will require an outage. That has yet to be scheduled, but needs to happen before we get through much of the quarter.

Over on the EVA4400 side, I'm evacuating data off of the MSA1500cs onto the 4400. Once I'm done with that, I'm going to be:
  1. Rebuilding all of the Disk Arrays.
  2. Creating LUNs expressly for Backup-to-Disk functionality.
  3. Flashing the Active/Active firmware on to it, the 7.00 firmware rev.
  4. Get the two Backup servers installed with the right MPIO widgetry to take advantage of active/active on the MSA>
But first we need the DataProtector licensing updates to beat its way through the forest of paperwork and get ordered. Otherwise, we can't use more than 5TB of disk, and that's WAY wimpy. I need at LEAST 20, and preferably 40TB. Once that licensing is in place, we can finally decommission the out-of-license BackupExec server and use the 6 slot tape library with DataProtector instead. This should significantly increase how much data we can throw at backup devices during our backup window.

What has yet to be fully determined is exactly how we're going to use the 4400 in this scheme. I expect to get between 15-20TB of space out of the MSA once I'm done with it, and we have around 20TB on the 4400 for backup. Which is why I'd really like that 40TB license please.

Going Active/Active should do really good things for how fast the MSA can throw data at disk. As I've proven before the MSA is significantly CPU bound for I/O to parity LUNs (Raid5 and Raid6), so having another CPU in the loop should increase write throughput significantly. We couldn't do Active/Active before since you can only do Active/Active in a homogeneous OS environment, and we had Windows and NetWare pointed at the MSA (plus one non-production Linux box).

In the mean time, I watch progress bars. TB of data takes a long time to copy if you're not doing it at the block level. Which I can't.

EVA6100 upgrade a success

| No Comments | No TrackBacks
Friday night four HP tech arrived to put together the EVA6100 from a pile of parts and the existing EVA3000. It took them 5 hours to get it to the point where we could power-on and see if all of our data was still there (it was, yay), and a few hours after that on our behalf to put everything back together.

There was only one major hitch for the night, which meant I got to bed around 6am Saturday morning instead of 4am.

For EVA, and probably all storage systems, you present hosts to them and selectively present LUNs to those hosts. These host-settings need to have an OS configured for them, since each operating system has its own quirks for how it likes to see its storage. While the EVA6100 has a setting for 'vmware', the EVA3000 did not. Therefore, we had to use a 'custom' OS setting and a 16 digit hex string we copied off of some HP knowledge-base article. When we migrated to the EVA6100 it kept these custom settings.

Which, it would seem, don't work for the EVA6100. It caused ESX to whine in such a way that no VMs would load. It got very worrying for a while there, but thanks to an article on vmware's support site and some intuition we got it all back without data loss. I'll probably post what happened and what we did to fix it in another blog post.

The only service that didn't come up right was secure IMAP for Exchange. I don't know why it decided to not load. My only theory is that our startup sequence wasn't right. Rebooting the HubCA servers got it back.

Other Blogs

My Other Stuff

About this Archive

This page is an archive of entries from June 2010 listed from newest to oldest.

May 2010 is the previous archive.

July 2010 is the next archive.

Find recent content on the main index or look in the archives to find all content.