Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



9
Problem Decomposition on a Multiprocessor Network

9-1 Introduction

In this chapter we look at the problem of decomposing an application into several modules to be run on separate processors. This decomposition is usually influenced by the specific properties of the application and can affect the code, the data structures, or both. Scheduling and functional decomposition are two examples of a division of the application where the code is broken up into modules that are assigned to specific processors. The scheduling problem is concerned mostly with an efficient dispersion of the code modules over the processors to minimize the total execution time, while the functional decomposition is more concerned with a decomposition that maintains some important property of the application. The effectiveness of a logical decomposition makes for efficient programming, and results in good performance. Pipeline decompositions are typical examples of functional decomposition.

Of the decompositions driven by the properties of the data (symmetry, regularity), we will look at partitioning, and mapping. Both are closely related. The former is involved with the way the data structures are broken down into individual modules to be distributed to the processors, the latter with the specific assignment of these modules to specific processors.

9-2 Code-Driven Decomposition

Scheduling

Scheduling applies to many areas of computer science, and a large number of classes of scheduling methods exist [REWI92], making it a broad concept with many variants. When applied to multiprocessor systems, however, scheduling takes the form of a resource allocation problem where a set of tasks requiring different amounts of computation must be allocated to a network of processors. The tasks are linked by a dependence graph indicating the order in which they must be executed, one relative to the other, along with weights associated with the nodes and edges of the graph. The weights attached to the vertices represent the amount of computing required by the tasks, while the weights associated with the edges represent the amount of data exchanged by the two tasks on each side of the directed edge. An example will illustrate these definitions.

Assume that we have an application to make parallel, and that this application can be decomposed into a total of seven tasks, as shown in Figure 9-1.


Figure 9- 1: Sample task graph example.

T0 is the first task to start the application, and T1, T2, T3, and T4 can run in parallel once T0 is finished. After T1, T2, and T3 have finished, T5 can run, and finally T6, given that T4 has also terminated. The numbers in brackets represent some measure of the computation. Since we are assuming a network of identical transputers, this number can simply be an estimate of the execution time for each task. In case the network contains transputers running at different clock rates, this number could represent the number of CPU cycles required by each task. In this case, multiplying this number by the cycle time of the processor the task gets assigned to yield the execution time of that task.


Gantt chart

task dependency graph

The goal of the scheduling problem is to "schedule" the tasks on the available processors so that as to minimize the longest execution time[1]. This can be done using a Gantt chart representing the execution time of the tasks on the processor network. The Gantt chart is a graphic tool showing a mapping of the task dependency graph onto a set of linear axis depicting time. There are as many axes as processors available, and tasks are assigned to processors in such a way that the order imposed by the dependency graph is respected, and that communication delays are taken into account.

Figure 9-a shows the Gantt chart for our example application when run on one processor. The tasks are shown here executed in their natural order, T0, T1,T2, T3, T4, T5, and T6, although T0, T3,T2, T1, T4, T5, and T6 could have been just as acceptable. With a network of two transputers, and using a greedy algorithm that schedules tasks so that their starting time is as early as possible, we get the Gantt chart of Figure 9-2b.

Let's analyze it in detail. Task T0 is started first on Processor 1, followed directly by T1. We assume that the two data items that T0 and T1 must exchange[2] do not cost anything if the two tasks run on the same transputer. This is a simplifying assumption here. If the two tasks had communicated via a soft channel, the two data items would have required a memory transfer which would have being indicated on the Gantt chart by a slight delay between the start of T1 and the end of T0. Because T1, T2, and T3 can be run in parallel, we schedule T2 on the second processor as soon as T0 has ended. But in this situation, the two data items that must be exchanged by T0 and T2 force T2 to start later than T1. We have assumed here that each data item requires 1 unit of time for communication. That unit of time is the same unit used to represent the execution times of the tasks. Therefore T2 starts two units later than T1, or at Time 5. At Time 8 T1 is done and T3 can now start. At Time 12 T2 finishes on Processor 2, and since Processor 1 is busy with T3, T4 can be scheduled. But we must make sure that a delay of 5 time units exist between the end of T0 and the beginning of T4 as indicated by the task dependency graph. Indeed T2 finishes at Time 12, which is long enough for T0 to have sent the five data items[3]. You will note that we have a one unit of idle time between T3 and T5. Why is that? Since T1 and T2 have already terminated when T3 finishes, why can't T5 start right away on Processor 1? The reason, as you may have guessed, is that T5 needs two data items from T2, which is run on the other processors, and therefore T5 cannot start until two units of time have elapsed after the end of T2. Finally T6 follows T5 and ends the computation after a total of 23 units of time. The goal of our scheduler is to minimize this number, and, in this case, the greedy algorithm provides the best possible solution.




NP-complete problem

The fact that the greedy algorithm gave the best solution on this simple example shouldn't lead you to believe that it is the best algorithm to use in practice. It is certainly one candidate, but will not in general yield the optimal solution. In fact, the problem of scheduling tasks on multiple processors has been shown to be NP-complete for all but a few special cases [PRAS87]. These cases assume that only two processors are available, that all tasks have unit execution time, and that the task dependency graph is a tree. This indeed applies to very few real-life cases.

Under such drastic restrictions, we see that scheduling tasks on a processor network is going to be a hard endeavor. One should remember, however, that NP-completeness does not mean that applications with a small pool of parallel tasks cannot be scheduled optimally. It simply means that large applications with a abundant pool of parallel tasks will require computation times beyond any reasonable bounds. But smaller applications with a small number of task can be solved.


Figure 9- 2: Gantt charts of the sample application on two configurations: (a) 1-transputer network. The tasks are executed in their natural order. (b) On a two-transputer network, where a greedy algorithm is used to schedule a task so that its starting time is as early as possible. Note that T
5 cannot start until T4 is finished.

Scheduling heuristics

In general, though, scheduling must resort to approximate solutions through the use of heuristics. One such heuristic is referred to as list scheduling, and works by assigning priorities to the tasks and by scheduling the task in order of those priorities. The heuristics differ in the way the priorities are computed. The example above is a list-scheduling heuristic, where the priority is computed based on the dependency and on the earliest possible execution time.

Adam et al. [ADAM74] propose several list-scheduling heuristics, and in particular one which is easy to implement and which yields close to optimal solutions. They compute the priority of a task as its level in the dependency graph. The level of a task is defined as the maximum number of tasks (including itself) in the graph on any path from that task to a terminal task. In the graph of Figure 9-1, for example, Task T6 has level 1, T5 and T4 have level 2, T1, T2, and T3 have level 3, and T0 has level 4. Their scheduling algorithm operates as follows:

Scheduling on transputers networks

The above description implies a dynamic allocation of tasks to processors as the application is being run. Although this is technically feasible, it implies substantial overhead for a transputer network. Migrating tasks from one transputer to another is not an easy process, especially if one is trying to implement it in C. It is much easier to migrate data structures, and we will see in Chapter 10 how load-balancing offers a dual to scheduling by dynamically allocating data to processors.

Scheduling on transputers is thus easier to implement off-line, as a static operation, in a succession of several steps. The first one is to write the application in which tasks are well defined, and where they exchange information with each other at the beginning and end of their execution. Once the application is written, the next step is to profile it on a single transputer. The profiling will report the execution time of the different tasks, along with the amount of data exchanged. Using this information, a scheduler can process this information and generate a Gantt chart of the execution times of the different tasks. The output is a schedule of the tasks on the transputer network.



Simplifying assumptions

Taking into account the time spent in communication when the transputer uses DMA transfers that can be overlapped with the execution of other tasks is the only non-trivial part of this process, and some simplifying approximations can be used by the off-line scheduler.

Functional Decomposition


Intuitive and logical partition

The next example of a decomposition driven by the architecture of the code of the application rather than the data is the functional decomposition. While scheduling is an attempt to distribute independent tasks to processors in an effort to yield the best possible performance, functional decomposition attempts to find a logical and intuitive partition of the application into independent tasks. A good example is a pipeline decomposition. Many applications naturally fit a pipe decomposition where some input data is filtered by a succession of stages where the final stage outputs the result. Image processing, for example, can fit a four-stage pipeline:


Compiling can fit this model as well, with a four-stage or five-stage pipeline, depending on whether optimization is added after the code generation or not.


Fast Fourier Transforms (FFT) are also frequently implemented on pipelined networks [WILS, HUAN91] in the context of digital signal processing (DSP), image analysis, or in finding the solution of differential equations. Typically, a set of N complex points fi must be processed to get a set of N new complex values gk, where N is assumed to be a power of 2 (N=2n) to simplify the computation.

          (9.1)

where j=sqrt(-1). The presence of the minus sign in the exponential characterizes a forward transformation, while the reverse transformation exhibits a plus sign and a division of the sum by N. Since each of the N resulting points are computed from a sum of the original N points, the complexity of the transformation is clearly (N2) serial steps. However, in 1965, Cooley and Tukey [COOL65] found a simplification using partial results which requires only (Nlog2N). Their discovery is that the final points can be computed as the combination of two sums, one containing all the original points with even indexes, the other sum containing those with odd indexes. When N is a power of 2, these two sums can be considered as the result of an FFT carried out on half the original points, and a recursive process can be implemented. Starting with the original N points, the first recursive steps combines them in pairs, taking points with indexes differing by N/2. This yields N new points, which can be combined in a second recursive step which pairs them by selecting points with indexes N/4 away, and so on, until each resulting point contains the influence of the first N points.


Figure 9- 3: How wormhole routing reduces communication delays seen by receiving nodes. (a) The message is sent by T0 for T3 as a unit, and relay nodes must buffer the whole message before retransmitting it. (b) With wormhole routing, the packet is divided into flits that are sent individually. Here the relay nodes buffer only single flits, and the network of transputers acts as a pipeline for the flits.

In a parallel implementation the solution is thus to create a network where the N data points march through a pipeline of log2N stages. When a small number of transputers is available, the stages can be mapped directly to single transputers. In systems where the number of transputers is sufficiently large, each stage can be implemented by a column of transputers, each processing one or several pairs of points.

This example of a specialized FFT network of transputers relies on the network to compute the partial results of the FFT transformation. Each transputer is computing a small part of the transformation, processing and creating intermediary results until the final stages outputs the transformed vector. In some cases, however, the functional decomposition can use a much coarser approach. Imagine for example an application where some audio signal must be filtered to remove its continuous component. A logical decomposition for this application would be to process the digitized signal as follows:

This time the forward and reverse FFT transformations could be done by a single transputer, using Logical Systems C's FFT library functions.

In cases where just a few processors are available, one could "fold" the pipeline so that the transputers support two logical stages, as shown in the following diagram.

This highlights the flexibility and ease with which an application can be made parallel when its functional description lends itself to a simple and efficient decomposition. This ease of programming, however, is often traded off for lower performance. In the case of the pipeline for example, the performance is dependent on the sustained rate at which data can be fed to the first stage, and also on the processing speed of each stage. Even with buffers between stages, the processing rate of the pipeline is only as good as that of its slowest stage, and a functional decomposition may not recognize discrepancies between the processing load of the different modules. When performance is at a premium, tuning of the system is often required. When tuning does not provide a sufficient level of performance, another approach might be worth implementing.

We now move to data-driven approaches to decompose applications on multiprocessor networks.

[Previous] [HOME] [NEXT]