
Parallel Processing refers to the concept of speeding-up the execution of a
program by dividing the program into multiple fragments that can execute
simultaneously, each on its own processor. A program being executed across n
processors might execute n times faster than it would using a single processor.
Traditionally, multiple processors were provided within a specially designed
"parallel computer"; along these lines, Linux now supports SMP systems (often
sold as "servers") in which multiple processors share a single memory and bus
interface within a single computer. It is also possible for a group of computers
(for example, a group of PCs each running Linux) to be interconnected by a
network to form a parallel-processing cluster. The third alternative for
parallel computing using Linux is to use the multimedia instruction extensions
(i.e., MMX) to operate in parallel on vectors of integer data. Finally, it is
also possible to use a Linux system as a "host" for a specialized attached
parallel processing compute engine.
Although use of multiple processors can speed-up many operations, most applications cannot yet benefit from parallel processing. Basically, parallel processing is appropriate only if: Your application has enough parallelism to make good use of multiple processors. In part, this is a matter of identifying portions of the program that can execute independently and simultaneously on separate processors, but you will also find that some things that could execute in parallel might actually slow execution if executed in parallel using a particular system. For example, a program that takes four seconds to execute within a single machine might be able to execute in only one second of processor time on each of four machines, but no speedup would be achieved if it took three seconds or more for these machines to coordinate their actions. Either the particular application program you are interested in already has been parallelized (rewritten to take advantage of parallel processing) or you are willing to do at least some new coding to take advantage of parallel processing. The good news is that if all the above are true, you'll find that parallel processing using Linux can yield supercomputer performance for some programs that perform complex computations or operate on large data sets. What's more, it can do that using cheap hardware... which you might already own. As an added bonus, it is also easy to use a parallel Linux system for other things when it is not busy executing a parallel job. If parallel processing is not what you want, but you would like to achieve at least a modest improvement in performance, there are still things you can do. For example, you can improve performance of sequential programs by moving to a faster processor, adding memory, replacing an IDE disk with fast wide SCSI, etc.
Although parallel processing has been used for many years in many systems, it is still somewhat unfamiliar to most computer users. Thus, before discussing the various alternatives, it is important to become familiar with a few commonly used terms.
SIMD: SIMD (Single Instruction stream, Multiple Data stream) refers to a parallel execution model in which all processors execute the same operation at the same time, but each processor is allowed to operate upon its own data. This model naturally fits the concept of performing the same operation on every element of an array, and is thus often associated with vector or array manipulation. Because all operations are inherently synchronized, interactions among SIMD processors tend to be easily and efficiently implemented.
MIMD:
MIMD (Multiple Instruction stream, Multiple Data stream) refers to a parallel
execution model in which each processor is essentially acting independently.
This model most naturally fits the concept of decomposing a program for
parallel execution on a functional basis; for example, one processor might
update a database file while another processor generates a graphic display of
the new entry. This is a more flexible model than SIMD execution, but it is
achieved at the risk of debugging nightmares called race conditions, in which
a program may intermittently fail due to timing variations reordering the
operations of one processor relative to those of another.
SPMD:
Communication Bandwidth: The bandwidth of a communication system is the maximum amount of data that can be transmitted in a unit of time... once data transmission has begun. Bandwidth for serial connections is often measured in baud or bits/second (b/s), which generally correspond to 1/10 to 1/8 that many Bytes/second (B/s). For example, a 1,200 baud modem transfers about 120 B/s, whereas a 155 Mb/s ATM network connection is nearly 130,000 times faster, transferring about about 17 MB/s. High bandwidth allows large blocks of data to be transferred efficiently between processors.
Communication Latency: The latency of a communication system is the minimum time taken to transmit one object, including any send and receive software overhead. Latency is very important in parallel processing because it determines the minimum useful grain size, the minimum run time for a segment of code to yield speed-up through parallel execution. Basically, if a segment of code runs for less time than it takes to transmit its result value (i.e., latency), executing that code segment serially on the processor that needed the result value would be faster than parallel execution; serial execution would avoid the communication overhead.
Message Passing: Message passing is a model for interactions between processors within a parallel system. In general, a message is constructed by software on one processor and is sent through an interconnection network to another processor, which then must accept and act upon the message contents. Although the overhead in handling each message (latency) may be high, there are typically few restrictions on how much information each message may contain. Thus, message passing can yield high bandwidth making it a very effective way to transmit a large block of data from one processor to another. However, to minimize the need for expensive message passing operations, data structures within a parallel program must be spread across the processors so that most data referenced by each processor is in its local memory... this task is known as data layout.
Shared Memory: Shared memory is a model for interactions between processors within a parallel system. Systems like the multi-processor Pentium machines running Linux physically share a single memory among their processors, so that a value written to shared memory by one processor can be directly accessed by any processor. Alternatively, logically shared memory can be implemented for systems in which each processor has it own memory by converting each non-local memory reference into an appropriate inter-processor communication. Either implementation of shared memory is generally considered easier to use than message passing. Physically shared memory can have both high bandwidth and low latency, but only when multiple processors do not try to access the bus simultaneously; thus, data layout still can seriously impact performance, and cache effects, etc., can make it difficult to determine what the best layout is.
Aggregate Functions: In both the message passing and shared memory models, a communication is initiated by a single processor; in contrast, aggregate function communication is an inherently parallel communication model in which an entire group of processors act together. The simplest such action is a barrier synchronization, in which each individual processor waits until every processor in the group has arrived at the barrier. By having each processor output a datum as a side-effect of reaching a barrier, it is possible to have the communication hardware return a value to each processor which is an arbitrary function of the values collected from all processors. For example, the return value might be the answer to the question "did any processor find a solution?" or it might be the sum of one value from each processor. Latency can be very low, but bandwidth per processor also tends to be low. Traditionally, this model is used primarily to control parallel execution rather than to distribute data values.
Collective Communication: This is another name for aggregate functions, most often used when referring to aggregate functions that are constructed using multiple message-passing operations.
SMP: SMP (Symmetric Multi-Processor) refers to the operating system concept of a group of processors working together as peers, so that any piece of work could be done equally well by any processor. Typically, SMP implies the combination of MIMD and shared memory. In the IA32 world, SMP generally means compliant with MPS (the Intel MultiProcessor Specification.
SWAR: SWAR (SIMD Within A Register) is a generic term for the concept of partitioning a register into multiple integer fields and using register-width operations to perform SIMD-parallel computations across those fields. Given a machine with k-bit registers, data paths, and function units, it has long been known that ordinary register operations can function as SIMD parallel operations on as many as n, k/n-bit, field values. Although this type of parallelism can be implemented using ordinary integer registers and instructions, many high-end microprocessors have recently added specialized instructions to enhance the performance of this technique for multimedia-oriented tasks.
In addition to the Intel/AMD/Cyrix MMX (MultiMedia eXtensions), there are: Digital Alpha MAX (MultimediA eXtensions), Hewlett-Packard PA-RISC MAX (Multimedia Acceleration eXtensions), MIPS MDMX (Digital Media eXtension, pronounced "Mad Max"), and Sun SPARC V9 VIS (Visual Instruction Set). Aside from the three vendors who have agreed on MMX, all of these instruction set extensions are roughly comparable, but mutually incompatible.


Cluster parallel processing offers several important advantages: Each of the machines in a cluster can be a complete system, usable for a wide range of other computing applications. This leads many people to suggest that cluster parallel computing can simply claim all the "wasted cycles" of workstations sitting idle on people's desks. It is not really so easy to salvage those cycles, and it will probably slow your co-worker's screen saver, but it can be done.
The current explosion in networked systems means that most of the hardware for building a cluster is being sold in high volume, with correspondingly low "commodity" prices as the result. Further savings come from the fact that only one video card, monitor, and keyboard are needed for each cluster (although you will have to swap these to each machine to perform the initial installation of Linux, once running, a typical Linux PC does not need a "console"). In comparison, SMP and attached processors are much smaller markets, tending toward somewhat higher price per unit performance. Cluster computing can scale to very large systems. While it is currently hard to find a Linux-compatible SMP with many more than four processors, most commonly available network hardware easily builds a cluster with up to 16 machines. With a little work, hundreds or even thousands of machines can be networked. In fact, the entire Internet can be viewed as one truly huge cluster.

The fact that replacing a "bad machine" within a cluster is trivial compared to fixing a partly faulty SMP yields much higher availability for carefully designed cluster configurations. This becomes important not only for particular applications that cannot tolerate significant service interruptions, but also for general use of systems containing enough processors so that single-machine failures are fairly common. (For example, even though the average time to failure of a PC might be two years, in a cluster with 32 machines, the probability that at least one will fail within 6 months is quite high.) OK, so clusters are free or cheap and can be very large and highly available... why doesn't everyone use a cluster? Well, there are problems too:
With a few exceptions, network hardware is not designed for parallel processing. Typically latency is very high and bandwidth relatively low compared to SMP and attached processors. For example, SMP latency is generally no more than a few microseconds, but is commonly hundreds or thousands of microseconds for a cluster. SMP communication bandwidth is often more than 100 MBytes/second, whereas even the fastest ATM network connections are more than five times slower. There is very little software support for treating a cluster as a single system.

Thus, the basic story is that clusters offer great potential, but that potential may be very difficult to achieve for most applications. The good news is that there is quite a lot of software support that will help you achieve good performance for programs that are well suited to this environment, and there are also networks designed specifically to widen the range of programs that can achieve good performance.

