Parallel vs Distributed Computing

Inspired from a closed ServerFault post:

I was just wondering why there is a need to go through all the trouble of creating distributed systems for massive parallel processing when, we could just create individual machines that support hundreds or thousands of cores/CPUs (or even GPGPUs) per machine?

So basically, why should you do parallel processing over a network of machines when it can rather be done at much lower cost and much more reliably on 1 machine that supports numerous cores?
profile for Henry Nash at Server Fault, Q&A for system administrators and desktop support professionals

Well.

The most common answer to this is really, Because it's generally far cheaper to assemble 25 4-core boxes into a single distributed system than a single 100-core box. Also, some problems don't parallelize well on a single system no matter how large it gets.

A Distributed Computing system is a loosely coupled parallel system. It generally does not present as a single operating system (a "single system image"), though that is an optional type. Latencies between processing nodes are very high, measurable in milliseconds (10-5 to 10-2) or more. Inter-node communication is typically done at the Application layer.

A Parallel Computing system is a tightly coupled parallel system. It presents as a single operating system. Processing nodes are very close together; on the same processing die or a very short distance away. Latencies between processing nodes are slight, measurable in nanoseconds (10-9 to 10-7). Inter-node communication is typically done at the Operating System layer, though Applications do have the ability to manage this if designed to.

Looking at those two definitions, you'd think the parallel computing system would win every time. But...

There are some workloads that don't behave nicely in a single system image, no matter how many cores are present. They tie up some key resource that blocks everything else. Each process consumes huge amounts of RAM. Multiple instances running in parallel don't behave nicely. Any number of things can prevent a massively parallel system from doing it's thing. In such cases it makes very clear sense to isolate these processes on smaller systems, and feed them work through a distributed work-management framework.

What kinds of things can "block key resources"?

  • The application requires a GUI element to be presented. This is not an application designed for highly parallel systems, but people do weird things all the time. Such as an Email-to-PDF gateway that utilizes MS Office for the rendering.
  • The application is attached to specialized hardware that is single-access only (an ancient ISA card, or a serial-connection to scientific gear).
  • The application uses hardware that generally only handles one or two devices per install, at least without extremely expensive engineering (GPU computing).
  • Each running instance of the application requires a hardware dongle due to hard-coded licensing requirements.

Also, distributed systems have a few advantages over their closely coupled brethren. First and foremost, it's generally a lot easier to scale them out after the initial build. This is very important if the workload is expected to grow over time.

Secondly, they tend to be more failure-resistant. If a single processing node fails, it does not always cause everything else to come to a halt. In a massive parallel computing system, a failure of a CPU or RAM can cause the whole thing to stop and that can be a lot of lost work.

Third, getting 1000 cores of processing power is generally a lot cheaper if that's done in 125 discrete 8-core servers than if it is done in a single box. If the problem being solved can tolerate a loosely coupled system, there are a lot of economics that make the distributed route very attractive. As a rule, the less custom engineering you need for a system, the cheaper it'll be.

And that is why we build large distributed systems instead of supercomputers.