April 2012 Archives

Yes, I've been light on the blogging this month. I've been helping to move an office, and it takes quite a lot of effort. Chief among the work contributers being that we have two (2) support-type people in the DC office, and I'm one of 'em. Where in this case 'support' means 'supports the office's primary function', which in larger offices would include:

  • The Facilities team who keep everyone in heat, cooling, and power, as well as put a damper on structural decay
  • The network engineers who keep the phones and data going
  • The office-admins who make sure all the various supplies are kept up to stock, usually invisibly.
  • The people-facing IT team who fix broken computers

Our DC office has at most 12 people in it during a normal day, only two of whom fit into the above. Me and the person who fills the third bullet point up there. She was doing 9/10ths of the job of getting the new office built, ordering movers, coordinating new furniture, talking to the contractors whenever they hit a snag, and all that stuff. Me? I'm the Speaker to Networks. This is a new role for me, but I've got enough depth in the field I can fake it reasonably.

I'm also better-than-average at the whole HVAC thing, what with HVAC being a primary concern for server health. However, that only came up during the 5 times I got asked "you want HOW much cooling in that closed room?" (4KVA if you must know). They kept not believing me, even after I pointed to the pair of L5-30 outlets they were going to install. Where's that power going, huh? I'd rather over-cool than under-cool, you know.

But... we're in. Even though this move was a true tragi-comedy of errors, the old hands assure me that the previous one was worse. I'll have to get THAT story once the raw wounds from this one heal over a bit more.

I'll probably post pictures once I've had more sleep. 19 hours of work since close-of-business Friday.

Kind of an obvious statement, but an object lesson has been provided. In the linked case, the discovery request included the phrase "unallocated space" and included keywords with general meanings.

The result? An unmanageable wad of data only scant slivers of which were 'responsive', and would cost well over a million dollars to find it.

"Unallocated space" is what gives me the shivers. That would require sorting through the empty spots of partitions looking for complete or partial files and producing those files and fragments. I know WWU wasn't equipped for that kind of discovery request, and we'd be knock-kneed about how to handle "unallocated blocks" on the SAN arrays themselves. It would suck a lot.

And in this case it cost quite a bit to produce in the first place.

But this also shows the other side of the discovery request, the expense of sorting through the discovered data. My current employer does just that; pare down discovered data into the responsive parts (or make the responsive parts much easier to find during manual review). And yes, it costs a lot.

Pricing is somewhat complicated, but the dominant model is based on price-per-GB with modifiers for what exactly you want done with the data. OCR costs extra. Transformation into various industry-common formats costs extra. That kind of thing. The price has been dropping a lot lately, but it's still quite common to find prices over $200/GB, and very recently prices were hovering around $1,000/GB.

Many sysadmins I know pride themselves in their ability to phrase search queries into Google to get what they're looking for. It doesn't take long to locate exactly what we're looking for, or some hint on where to look next.

Lawyers have to get the search query right on the first try. Laziness (being overly broad) costs everyone.
Inspired from a closed ServerFault post:

I was just wondering why there is a need to go through all the trouble of creating distributed systems for massive parallel processing when, we could just create individual machines that support hundreds or thousands of cores/CPUs (or even GPGPUs) per machine?

So basically, why should you do parallel processing over a network of machines when it can rather be done at much lower cost and much more reliably on 1 machine that supports numerous cores?
profile for Henry Nash at Server Fault, Q&A for system administrators and desktop support professionals


The most common answer to this is really, Because it's generally far cheaper to assemble 25 4-core boxes into a single distributed system than a single 100-core box. Also, some problems don't parallelize well on a single system no matter how large it gets.

A Distributed Computing system is a loosely coupled parallel system. It generally does not present as a single operating system (a "single system image"), though that is an optional type. Latencies between processing nodes are very high, measurable in milliseconds (10-5 to 10-2) or more. Inter-node communication is typically done at the Application layer.

A Parallel Computing system is a tightly coupled parallel system. It presents as a single operating system. Processing nodes are very close together; on the same processing die or a very short distance away. Latencies between processing nodes are slight, measurable in nanoseconds (10-9 to 10-7). Inter-node communication is typically done at the Operating System layer, though Applications do have the ability to manage this if designed to.

Looking at those two definitions, you'd think the parallel computing system would win every time. But...

There are some workloads that don't behave nicely in a single system image, no matter how many cores are present. They tie up some key resource that blocks everything else. Each process consumes huge amounts of RAM. Multiple instances running in parallel don't behave nicely. Any number of things can prevent a massively parallel system from doing it's thing. In such cases it makes very clear sense to isolate these processes on smaller systems, and feed them work through a distributed work-management framework.

What kinds of things can "block key resources"?

  • The application requires a GUI element to be presented. This is not an application designed for highly parallel systems, but people do weird things all the time. Such as an Email-to-PDF gateway that utilizes MS Office for the rendering.
  • The application is attached to specialized hardware that is single-access only (an ancient ISA card, or a serial-connection to scientific gear).
  • The application uses hardware that generally only handles one or two devices per install, at least without extremely expensive engineering (GPU computing).
  • Each running instance of the application requires a hardware dongle due to hard-coded licensing requirements.

Also, distributed systems have a few advantages over their closely coupled brethren. First and foremost, it's generally a lot easier to scale them out after the initial build. This is very important if the workload is expected to grow over time.

Secondly, they tend to be more failure-resistant. If a single processing node fails, it does not always cause everything else to come to a halt. In a massive parallel computing system, a failure of a CPU or RAM can cause the whole thing to stop and that can be a lot of lost work.

Third, getting 1000 cores of processing power is generally a lot cheaper if that's done in 125 discrete 8-core servers than if it is done in a single box. If the problem being solved can tolerate a loosely coupled system, there are a lot of economics that make the distributed route very attractive. As a rule, the less custom engineering you need for a system, the cheaper it'll be.

And that is why we build large distributed systems instead of supercomputers.

Other Blogs

My Other Stuff

Monthly Archives