Of the decompositions driven by the properties of the data (symmetry, regularity), we will look at partitioning, and mapping. Both are closely related. The former is involved with the way the data structures are broken down into individual modules to be distributed to the processors, the latter with the specific assignment of these modules to specific processors.
Assume that we have an application to make parallel, and that this application can be decomposed into a total of seven tasks, as shown in Figure 9-1.
T0 is the first task to start the application, and T1, T2, T3, and T4 can run in parallel once T0 is finished. After T1, T2, and T3 have finished, T5 can run, and finally T6, given that T4 has also terminated. The numbers in brackets represent some measure of the computation. Since we are assuming a network of identical transputers, this number can simply be an estimate of the execution time for each task. In case the network contains transputers running at different clock rates, this number could represent the number of CPU cycles required by each task. In this case, multiplying this number by the cycle time of the processor the task gets assigned to yield the execution time of that task.
Figure 9-a shows the Gantt chart for our example application when run on one processor. The tasks are shown here executed in their natural order, T0, T1,T2, T3, T4, T5, and T6, although T0, T3,T2, T1, T4, T5, and T6 could have been just as acceptable. With a network of two transputers, and using a greedy algorithm that schedules tasks so that their starting time is as early as possible, we get the Gantt chart of Figure 9-2b.
Let's analyze it in detail. Task T0 is started first on Processor 1, followed directly by T1. We assume that the two data items that T0 and T1 must exchange[2] do not cost anything if the two tasks run on the same transputer. This is a simplifying assumption here. If the two tasks had communicated via a soft channel, the two data items would have required a memory transfer which would have being indicated on the Gantt chart by a slight delay between the start of T1 and the end of T0. Because T1, T2, and T3 can be run in parallel, we schedule T2 on the second processor as soon as T0 has ended. But in this situation, the two data items that must be exchanged by T0 and T2 force T2 to start later than T1. We have assumed here that each data item requires 1 unit of time for communication. That unit of time is the same unit used to represent the execution times of the tasks. Therefore T2 starts two units later than T1, or at Time 5. At Time 8 T1 is done and T3 can now start. At Time 12 T2 finishes on Processor 2, and since Processor 1 is busy with T3, T4 can be scheduled. But we must make sure that a delay of 5 time units exist between the end of T0 and the beginning of T4 as indicated by the task dependency graph. Indeed T2 finishes at Time 12, which is long enough for T0 to have sent the five data items[3]. You will note that we have a one unit of idle time between T3 and T5. Why is that? Since T1 and T2 have already terminated when T3 finishes, why can't T5 start right away on Processor 1? The reason, as you may have guessed, is that T5 needs two data items from T2, which is run on the other processors, and therefore T5 cannot start until two units of time have elapsed after the end of T2. Finally T6 follows T5 and ends the computation after a total of 23 units of time. The goal of our scheduler is to minimize this number, and, in this case, the greedy algorithm provides the best possible solution.
Under such drastic restrictions, we see that scheduling tasks on a processor network is going to be a hard endeavor. One should remember, however, that NP-completeness does not mean that applications with a small pool of parallel tasks cannot be scheduled optimally. It simply means that large applications with a abundant pool of parallel tasks will require computation times beyond any reasonable bounds. But smaller applications with a small number of task can be solved.
Adam et al. [ADAM74] propose several list-scheduling heuristics, and in particular one which is easy to implement and which yields close to optimal solutions. They compute the priority of a task as its level in the dependency graph. The level of a task is defined as the maximum number of tasks (including itself) in the graph on any path from that task to a terminal task. In the graph of Figure 9-1, for example, Task T6 has level 1, T5 and T4 have level 2, T1, T2, and T3 have level 3, and T0 has level 4. Their scheduling algorithm operates as follows:
Scheduling on transputers is thus easier to implement off-line, as a static operation, in a succession of several steps. The first one is to write the application in which tasks are well defined, and where they exchange information with each other at the beginning and end of their execution. Once the application is written, the next step is to profile it on a single transputer. The profiling will report the execution time of the different tasks, along with the amount of data exchanged. Using this information, a scheduler can process this information and generate a Gantt chart of the execution times of the different tasks. The output is a schedule of the tasks on the transputer network.
Compiling can fit this model as well, with a four-stage or five-stage pipeline, depending on whether optimization is added after the code generation or not.
Fast Fourier Transforms (FFT) are also frequently implemented on pipelined networks [WILS, HUAN91] in the context of digital signal processing (DSP), image analysis, or in finding the solution of differential equations. Typically, a set of N complex points fi must be processed to get a set of N new complex values gk, where N is assumed to be a power of 2 (N=2n) to simplify the computation.
(9.1)
where j=sqrt(-1). The presence of the minus sign in the exponential characterizes a forward transformation, while the reverse transformation exhibits a plus sign and a division of the sum by N. Since each of the N resulting points are computed from a sum of the original N points, the complexity of the transformation is clearly (N2) serial steps. However, in 1965, Cooley and Tukey [COOL65] found a simplification using partial results which requires only (Nlog2N). Their discovery is that the final points can be computed as the combination of two sums, one containing all the original points with even indexes, the other sum containing those with odd indexes. When N is a power of 2, these two sums can be considered as the result of an FFT carried out on half the original points, and a recursive process can be implemented. Starting with the original N points, the first recursive steps combines them in pairs, taking points with indexes differing by N/2. This yields N new points, which can be combined in a second recursive step which pairs them by selecting points with indexes N/4 away, and so on, until each resulting point contains the influence of the first N points.
Figure
9-
3:
How wormhole routing reduces communication delays seen by receiving nodes. (a)
The message is sent by T0 for T3 as a unit, and relay nodes must buffer the
whole message before retransmitting it. (b) With wormhole routing, the packet
is divided into flits that are sent individually. Here the relay nodes buffer
only single flits, and the network of transputers acts as a pipeline for the
flits.
In a parallel implementation the solution is thus to create a network where the N data points march through a pipeline of log2N stages. When a small number of transputers is available, the stages can be mapped directly to single transputers. In systems where the number of transputers is sufficiently large, each stage can be implemented by a column of transputers, each processing one or several pairs of points.
This example of a specialized FFT network of transputers relies on the network to compute the partial results of the FFT transformation. Each transputer is computing a small part of the transformation, processing and creating intermediary results until the final stages outputs the transformed vector. In some cases, however, the functional decomposition can use a much coarser approach. Imagine for example an application where some audio signal must be filtered to remove its continuous component. A logical decomposition for this application would be to process the digitized signal as follows:
This time the forward and reverse FFT transformations could be done by a single transputer, using Logical Systems C's FFT library functions.
In cases where just a few processors are available, one could "fold" the pipeline so that the transputers support two logical stages, as shown in the following diagram.
This highlights the flexibility and ease with which an application can be made parallel when its functional description lends itself to a simple and efficient decomposition. This ease of programming, however, is often traded off for lower performance. In the case of the pipeline for example, the performance is dependent on the sustained rate at which data can be fed to the first stage, and also on the processing speed of each stage. Even with buffers between stages, the processing rate of the pipeline is only as good as that of its slowest stage, and a functional decomposition may not recognize discrepancies between the processing load of the different modules. When performance is at a premium, tuning of the system is often required. When tuning does not provide a sufficient level of performance, another approach might be worth implementing.
We now move to data-driven approaches to decompose applications on multiprocessor networks.