Recently in clustering Category

As I look around the industry with an eye towards further employment, I've noticed a difference of philosophy between startups and the more established players. One easy way to see this difference is on their job postings.

  • If it says RHEL and VMWare on it, they believe in support contracts.
  • If it says CentOS and OpenStack on it, they believe in community support.

For the same reason that tech startups almost never use Windows if they can get away with it, they steer clear of other technologies that come with license costs or mandatory support contracts. Why pay the extra support cost when you can get the same service by hiring extremely smart people and use products with a large peer support community? Startups run lean, and all that extra cost is... cost.

And yet some companies find that they prefer to run with that extra cost. Some, like StackExchange, don't mind the extra licensing costs of their platform (Windows) because they're experts in it and can make it do exactly what they want it to do with a minimum of friction, which means the Minimum Viable Product gets kicked out the door sooner. A quicker MVP means quicker profitability, and that can pay for the added base-cost right there.

Other companies treat support contracts like insurance: something you carry just in case, as a hedge against disaster. Once you grow to a certain size, business continuity insurance investments start making a lot more sense. Running for the brass ring of market dominance without a net makes sense, but once you've grabbed it keeping it needs investment. Backup vendors love to quote statistics on the percentage of business that fail after a major data-loss incident (it's a high percentage), and once you have a business worth protecting it's good to start protecting it.

This is part of why I'm finding that the long established companies tend to use technologies that come with support. Once you've dominated your sector, keeping that dominance means a contract to have technology experts on call 24/7 from the people who wrote it.

"We may not have to call RedHat very often, but when we do they know it'll be a weird one."

So what happens when startups turn into market dominators? All that no-support Open Source stuff is still there...

They start investing in business continuity, just the form may be different from company to company.

  • Some may make the leap from CentOS to RHEL.
  • Some may contract for 3rd party support for their OSS technologies (such as with 10gen for MongoDB).
  • Some may implement more robust backup solutions.
  • Some may extend their existing high-availability systems to handle large-scale local failures (like datacenter or availability-zone outages).
  • Some may acquire actual Business Continuity Insurance.

Investors may drive adoption of some BC investment, or may actively discourage it. I don't know, I haven't been in those board meetings and can argue both ways on it.

Which one do I prefer?

Honestly, I can work for either style. Lean OSS means a steep learning curve and a strong incentive to become a deep-dive troubleshooter of the platform, which I like to be. Insured means someone has my back if I can't figure it out myself, and I'll learn from watching them solve the problem. I'm easy that way.

Parallel vs Distributed Computing

Inspired from a closed ServerFault post:

I was just wondering why there is a need to go through all the trouble of creating distributed systems for massive parallel processing when, we could just create individual machines that support hundreds or thousands of cores/CPUs (or even GPGPUs) per machine?

So basically, why should you do parallel processing over a network of machines when it can rather be done at much lower cost and much more reliably on 1 machine that supports numerous cores?
profile for Henry Nash at Server Fault, Q&A for system administrators and desktop support professionals


The most common answer to this is really, Because it's generally far cheaper to assemble 25 4-core boxes into a single distributed system than a single 100-core box. Also, some problems don't parallelize well on a single system no matter how large it gets.

A Distributed Computing system is a loosely coupled parallel system. It generally does not present as a single operating system (a "single system image"), though that is an optional type. Latencies between processing nodes are very high, measurable in milliseconds (10-5 to 10-2) or more. Inter-node communication is typically done at the Application layer.

A Parallel Computing system is a tightly coupled parallel system. It presents as a single operating system. Processing nodes are very close together; on the same processing die or a very short distance away. Latencies between processing nodes are slight, measurable in nanoseconds (10-9 to 10-7). Inter-node communication is typically done at the Operating System layer, though Applications do have the ability to manage this if designed to.

Looking at those two definitions, you'd think the parallel computing system would win every time. But...

There are some workloads that don't behave nicely in a single system image, no matter how many cores are present. They tie up some key resource that blocks everything else. Each process consumes huge amounts of RAM. Multiple instances running in parallel don't behave nicely. Any number of things can prevent a massively parallel system from doing it's thing. In such cases it makes very clear sense to isolate these processes on smaller systems, and feed them work through a distributed work-management framework.

What kinds of things can "block key resources"?

  • The application requires a GUI element to be presented. This is not an application designed for highly parallel systems, but people do weird things all the time. Such as an Email-to-PDF gateway that utilizes MS Office for the rendering.
  • The application is attached to specialized hardware that is single-access only (an ancient ISA card, or a serial-connection to scientific gear).
  • The application uses hardware that generally only handles one or two devices per install, at least without extremely expensive engineering (GPU computing).
  • Each running instance of the application requires a hardware dongle due to hard-coded licensing requirements.

Also, distributed systems have a few advantages over their closely coupled brethren. First and foremost, it's generally a lot easier to scale them out after the initial build. This is very important if the workload is expected to grow over time.

Secondly, they tend to be more failure-resistant. If a single processing node fails, it does not always cause everything else to come to a halt. In a massive parallel computing system, a failure of a CPU or RAM can cause the whole thing to stop and that can be a lot of lost work.

Third, getting 1000 cores of processing power is generally a lot cheaper if that's done in 125 discrete 8-core servers than if it is done in a single box. If the problem being solved can tolerate a loosely coupled system, there are a lot of economics that make the distributed route very attractive. As a rule, the less custom engineering you need for a system, the cheaper it'll be.

And that is why we build large distributed systems instead of supercomputers.

One of the tasks I have is to build some separation between the production environment and test/integration environments. As we approach that Very Big Deadline I've talked about, this is a pretty significant task. Dev processes that we've been using for a really long time now need to change in order to provide the separation we need, and that's some tricky stuff.

One of the ways I plan on accomplishing this is through the use of Environments in Puppet. This looks like a great way to allow automation while delivering different code to different environments. Nifty keen stuff.

But it requires that I re-engineer our current Puppet environment, which is a single monolithic all-or-nothing repo to accommodate this. This is fairly straight-forward, as suggested by the Puppet Docs:

If the values of any settings in puppet.conf reference the $environment variable (like modulepath = $confdir/environments/$environment/modules:$confdir/modules, for example), the agent's environment will be interpolated into them.

So if I need to deliver a different set of files to production, such as a different authorized_keys file, I can drop that into $confdir/environments/production/modules and all will be swell.

So I tried to set that up. I made the environment directory, changed the puppet.conf file to reflect that, made an empty module directory for the module that'd get the changed files (but not put the files in there yet), and ran a client against it.

It didn't work so well.

The error I was getting on the puppetmaster was err: Could not find class custommod for clunod-wk0130.sub.example.local at /etc/puppet/manifests/site.pp:84 on node clunod-wk0130.sub.example.local

I checked eight ways from Sunday for spelling mistakes, bracket errors, and other such syntax problems but could not find out why it was having trouble locating the custommod directory. I posted on ServerFault for some added help, but didn't get much beyond, "Huh, it SHOULD work that way," which wasn't that helpful.

I decided that I needed to debug where it was searching for modules since it manifestly wasn't seeing the one staring it in the face. For this I used strace, which produced an immense logfile. Grepping out the 'stat' calls as it checked the import statements, I noticed something distinctly odd.


It was attempting to read in the init.pp file for the custommod, even though there wasn't a file there. And what's more, it wasn't then re-trying under /etc/puppet/modules/custommod/manifests/init.pp. Clearly, the modulepath statement does not work like the $PATH variable in bash and every other shell I've used. In those, if it fails to find a command in (for example) /usr/local/bin, it'll then try /usr/bin, and then ~/bin until it finds the file.

The modulepath statement in puppet.conf is focused on modules, not individual files.

This is a false cognate. By having that empty module in the production environment, I was in effect telling Puppet that the Production environment doesn't have that module. Also, my grand plan to provide a different authorized_keys file in the Production environment requires me to have a complete module for the Production environment directory, not just the one or two files I want changed.

Good to know.

When there is no clean back-out

| 1 Comment
Today I ended up banging my head on a particularly truculent wall. A certain clustering solution just wasn't doing the deed. No matter how I cut, sliced, sawed or hacked it wouldn't come to life. I wanted my two headed beast! Most vexing.

Then I hit that point where I said unto myself (I hope), "!#^#^ it, I'm reformatting these two and starting from scratch."

So I did.

It worked first time.


Ok, I'll take it.

Mind, all the cutting and slicing and sawing and hacking was with me knowing full well that iterative changes like that can stack unknowingly, and I was backing out the changes when I determined that they weren't working. Clearly, my back-out methods weren't clean enough. Okay, food for thought; dirty uninstallers are hardly an unusual thing.

Of course I kicked myself for not trying that earlier, but during the course of aforementioned reanimatory activity I did learn a fair amount about this particular clustering technology. Not a complete loss. I do wish I had those 4 hours back, since we're at a part of the product deployment schedule where hours count.

Deep computing

It's no secret that weather forecasting is a problem that can keep many a super-computer up all night. Cliff Mass of the U of WA recently posted a nice writeup of what they're doing:

That sort of parallel computing problem is the kind of thing that really fired me up back in college. So. Cool.

Is network now faster than disk?

Way back in college, when I was earning my Computer Science degree, the latencies of computer storage were taught like so:

  1. On CPU register
  2. CPU L1/L2 cache (this was before L3 existed)
  3. Main Memory
  4. Disk
  5. Network
This question came up today, so I thought I'd explore it.

The answer is complicated. The advent of Storage Area Networking was made possible because a mass of shared disk is faster, even over a network, than a few local disks. Nearly all of our I/O operations here at WWU are over a fibre-channel fabric, which is disk-over-the-network no matter how you dice it. With iSCSI and FC over Ethernet this domain is getting even busier.

That said, there are some constraints. "Network" in this case is still subject to distance limitations. A storage array 40km from the processing node will still see more storage latencies than the same type of over-the-network I/O 100m away. Our accesses are fast enough these days that the speed-of-light round-trip time for 40km is measurable versus 100m.

A very key difference here is that the 'network' component is handled by the operating system and not application code. For SAN an application requests certain portions of a file, the OS translates that into block requests, which are then translated into storage bus requests; the application doesn't know that the request was served over a network.

For application development the above tiers of storage are generally well represented.

  1. Registers, unless the programming is in assembly, most programmers just trust the compiler and OS to handles these right.
  2. L1/2/3 Cache, as above, although well tuned code can maximize the benefit this storage tier can provide.
  3. Main memory, this is directly handled through code. One might argue that at a low level memory handling constitutes a majority of what code does.
  4. Disk, This is represented by file-access or sometimes file-as-memory API calls. These tend to be discrete calls from main memory.
  5. Network, This is yet another completely separate call structure, which means using it requires explicit programming.
Storage Area Networking is parked in step 4 up there. Network can include things like making NFS connections and then using file-level calls to access data, or actual Layer 7 stuff like passing SQL over the network.

For massively scaled out applications, the network has even crept into step 3 thanks to things like memcached and single-system-image frameworks.

Network is now competitive with disk, though so far the best use-cases let the OS handle the network part instead of the application doing it.

The things you learn

We had cause to learn this one the hard way this past week. We didn't know that Windows Server 2008 (64-bit) and Symantec Endpoint Protection just don't mix well. It affected SMBv1 clients, SMBv2 clients (Vista, Win7) were unaffected.

The presentation of it at the packet-level was pretty specific, though. XP clients (and Samba clients) would get to the second step of the connection setup process for mapping a drive and time out.

  1. -> Syn
  2. <- Syn/Ack
  3. -> NBSS, Session Request, to $Server<20> from $Client<00>
  4. <- NBSS, Positive Session Response
  5. -> SMB, Negotiate Protocol Request
  6. <- Ack
  7. [70+ seconds pass]
  8. -> FIN
  9. <- FIN/Ack
Repeat two more times, and 160+ seconds later the client times out. The timeouts between the retries are not consistent so the time it takes varies. Also sometimes the server issues the correct "Protocol Request Reply" packet and the connection continues just fine. There was no sign in any of the SEP logs that it was dropping these connections, and the Windows Firewall was quiet as well.

In the end it took a call to Microsoft. Once we got to the right network person, they knew immediately what the problem was.

ForeFront is now going on those servers. It really should have been on a month ago, but because these cluster nodes were supposed to go live for fall quarter they were fully staged up in August, before we even had the ForeFront clients. We never remembered to replaced SEP with ForeFront.

I have a degree in this stuff

| 1 Comment
I have a CompSci degree. This qualified me for two things:
  • A career in academics
  • A career in programming
You'll note that Systems Administration is not on that list. My degree has helped my career by getting me past the "4 year degree in a related field" requirement of jobs like mine. An MIS degree would be more appropriate, but there were very few of those back when I graduated. It has indirectly helped me in troubleshooting, as I have a much better foundation about how the internals work than your average computer mechanic.

Anyway. Every so often I stumble across something that causes me to go Ooo! ooo! over the sheer computer science of it. Yesterday I stumbled across Barrelfish, and this paper. If I weren't sick today I'd have finished it, but even as far as I've gotten into it I can see the implications of what they're trying to do.

The core concept behind the Barrelfish operating system is to assume that each computing core does not share memory and has access to some kind of message passing architecture. This has the side effect of having each computing core running its own kernel, which is why they're calling Barrelfish a 'multikernel operating system'. In essence, they're treating the insides of your computer like the distributed network that it is, and using already existing distributed computing methods to improve it. The type of multi-core we're doing now, SMP, ccNUMA, uses shared memory techniques rather than message passing, and it seems that this doesn't scale as far as message passing does once core counts go higher.

They go into a lot more detail in the paper about why this is. A big one is hetergenaity of CPU architectures out there in the marketplace, and they're not just talking just AMD vs Intel vs CUDA, this is also Core vs Core2 vs Nehalem. This heterogenaity in the marketplace makes it very hard for a traditional Operating System to be optimized for a specific platform.

A multikernel OS would use a discrete kernel for each microarcitecture. These kernels would communicate with each other using OS-standardized message passing protocols. On top of these microkernels would be created the abstraction called an Operating System upon which applications would run. Due to the modularity at the base of it, it would take much less effort to provide an optimized microkernel for a new microarcitecture.

The use of message passing is very interesting to me. Back in college, parallel computing was my main focus. I ended up not pursuing that area of study in large part because I was a strictly C student in math, parallel computing was a largely academic endeavor when I graduated, and you needed to be at least a B student in math to hack it in grad school. It still fired my imagination, and there was squee when the Pentium Pro was released and you could do 2 CPU multiprocessing.

In my Databases class, we were tasked with creating a database-like thingy in code and to write a paper on it. It was up to us what we did with it. Having just finished my Parallel Computing class, I decided to investigate distributed databases. So I exercised the PVM extensions we had on our compilers thanks to that class. I then used the six Unix machines I had access to at the time to create a 6-node distributed database. I used statically defined tables and queries since I didn't have time to build a table parser or query processor and needed to get it working so I could do some tests on how optimization of table positioning impacted performance.

Looking back on it 14 years later (eek) I can see some serious faults about my implementation. But then, I've spent the last... 12 years working with a distributed database in the form of Novell's NDS and later eDirectory. At the time I was doing this project, Novell was actively developing the first version of NDS. They had some problems with their implementation too.

My results were decidedly inconclusive. There was a noise factor in my data that I was not able to isolate and managed to drown out what differences there were between my optimized and non-optimized runs (in hindsight I needed larger tables by an order of magnitude or more). My analysis paper was largely an admission of failure. So when I got an A on the project I was confused enough I went to the professor and asked how this was possible. His response?
"Once I realized you got it working at all, that's when you earned the A. At that point the paper didn't matter."
Dude. PVM is a message passing architecture, like most distributed systems. So yes, distributed systems are my thing. And they're talking about doing this on the motherboard! How cool is that?

Both Linux and Windows are adopting more message-passing architectures in their internal structures, as they scale better on highly parallel systems. In Linux this involved reducing the use of the Big Kernel Lock in anything possible, as invoking the BKL forces the kernel into single-threaded mode and that's not a good thing with, say, 16 cores. Windows 7 involves similar improvements. As more and more cores sneak into everyday computers, this becomes more of a problem.

An operating system working without the assumption of shared memory is a very different critter. Operating state has to be replicated to each core to facilitate correct functioning, you can't rely on a common memory address to handle this. It seems that the form of this state is key to performance, and is very sensitive to microarchitecture changes. What was good on a P4, may suck a lot on a Phenom II. The use of a per-core kernel allows the optimal structure to be used on each core, with changes replicated rather than shared which improves performance. More importantly, it'll still be performant 5 years after release assuming regular per-core kernel updates.

You'd also be able to use the 1.75GB of GDDR3 on your GeForce 295 as part of the operating system if you really wanted to! And some might.

I'd burble further, but I'm sick so not thinking straight. Definitely food for thought!

Mac OS X and Windows 2008 clusters

It seems that all Mac OSX versions except for 10.4 (yes, including 10.6) don't like to talk to Window Server 2008 Failover clusters without special syntax. The reason for this boils down to two technology disagreements.

  1. OS X (except for 10.4) attempts to make smb/cifs connections by the resolved IP address of given names. So a connection string like smb:// will be translated into \\\share1 when it attempts to talk to the server.
  2. Windows failover clustering requires the server name when connecting. Otherwise it tells you no-can-do. You can't use \\\share1\ syntax, you MUST use a name.
For instance, the string "smb://" will cause the following packets to occur:
Packets showing fail
However, if you attempt to connect to a non-clustered share, perhaps a share on one of the cluster nodes rather than a cluster service, it works just fine.
Packets showing success
Funny, eh?

So what's a mac-owner, of which we have quite a lot, to do? The fix is pretty simple, append ":139" to the end of the server part of the connection string. In the above example, "smb://". For some reason, this forces the mac to use a name when connecting to the remote system.
Packets showing success
Apparently, OS X 10.4 (Tiger) did this normally, but Apple changed it back to the non-working version with 10.5 (Leopard). And we've tested, 10.6 (Snow Leopard) is broken the same way.

Why this is so is up for debate. I'm personally fond of the idea that the Windows SMB stack isn't detailed enough to tell what IP address an incoming connection came in on and virtualize answers accordingly. For stand-alone servers this is a simple thing; if you can talk to me at all, here are all of my shares. For conditional sharing like with clusters, you can only get these shares on these IP's, the SMB stack apparently lacks a way to discriminate appropriately. Clearly name-based is in there, but not IP.

No word on if 2008 R2 behaves this way. Microsoft dropped R2 about... three weeks too late for us to go with it for this cluster.

This is going to be one of those FAQs the helpdesks are going to get real used to answering.
Yesterday I ran into this:

On the surface it looks like NTFS behaving like OCFS. But Microsoft has a warning on this page:
In Windows Server® 2008 R2, the Cluster Shared Volumes feature included in failover clustering is only supported for use with the Hyper-V server role. The creation, reproduction, and storage of files on Cluster Shared Volumes that were not created for the Hyper-V role, including any user or application data stored under the ClusterStorage folder of the system drive on every node, are not supported and may result in unpredictable behavior, including data corruption or data loss on these shared volumes. Only files that are created for the Hyper-V role can be stored on Cluster Shared Volumes. An example of a file type that is created for the Hyper-V role is a Virtual Hard Disk (VHD) file.

Before installing any software utility that might access files stored on Cluster Shared Volumes (for example, an antivirus or backup solution), review the documentation or check with the vendor to verify that the application or utility is compatible with Cluster Shared Volumes.
So unlike OCFS2, this multi-mount NTFS is only for VM's and not for general file-serving. In theory you could use this in combination with Network Load Balancing to create a high-availability cluster with even higher uptime than failover clusters already provide. The devil is in the details though, and Microsoft aludes to them.

A file system being used for Hyper-V isn't a complex locking environment. You'll have as many locks as there are VHD files, and they won't change often. Contrast this with a file-server where you can have thousands of locks that change by the second. Additionally, unless you disable Opportunistic Locking you are at grave risk of corrupting files used by more than one user (Acess databases!) if you are using the multi-mount NTFS.

Microsoft will have to promote awareness of this type of file-system into the SMB layer before this can be used for file-sharing. SMB has its own lock layer, and this will have to coordinate the SMB layers in the other nodes for it to work right. This may never happen, we'll see.