Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



9-3 Data-driven decomposition

Just as sometimes the algorithmic description of an application will lend itself naturally to a efficient and/or natural decomposition on a processor network, the data structures that support the application may themselves have good symmetrical or regular properties that can make a particular decomposition more appealing. Vectors, two-dimensional and three-dimensional matrices are examples where the regularity of the data lends itself well to parallelism. In other cases the data may lack any regularity in its structure, but it might be partitioned easily, making it appealing to replicate tasks in the processors and have each one work on different members of the partition. Search trees and linked structures are typical examples of this later type of data structures.

Partitioning, mapping and load balancing

In this section we investigate partitioning, and mapping. Partitioning refers to the way a data structure can be cut into logical blocks that are independent enough of each other that they can be processed individually by separate processors. Once the data structures are partitioned, the next task is to map them to the processors. The goal here is to find, again, a logical assignment function that provides efficient communication between processors.

Partitioning

Partitioning involves using the regularity of a data structure or of the problem formulation and using it to find a decomposition where subsets of the data or of the problem are assigned to individual processors.

SPMD

In the Mandelbrot Set example we explored in the previous chapter, the data structure is a static two-dimensional array of HxV pixels where the computation of the color of each pixel represents the work to be performed. This a rare case of an application with an extremely regular data domain, where pixels do not depend on their neighbors, and hence the data traffic is reduced to that of processors sending their result to the host for display. In cases presenting such level of symmetry and of very low communication requirements, the easiest solution is to implement a Single Program Multiple Data (SPMD) decomposition. Each processor essentially runs the same program, which in most cases is a serial program, on a data set unique to each processor.

In Chapter 8, we analyzed a decomposition where each processor was assigned a column of pixels, corresponding to a vector of complex points with identical real components. Because of the lack of dependencies, we could have chosen many other alternative partitions to distribute the HxV pixels to the N processors: a row of pixels per processor, a horizontal band of height V/N, a vertical band of width H/N, or a rectangular area of hxv pixels, among others.

The choice may or may not be arbitrary, depending on several factors. For example, if you looked at the Mandelbrot code closely, you may have wondered why the transputers did the computation twice, one for a complex point of coordinate (a,b) and one for the point of coordinate (a,-b). This is a good observation, and in a performance-minded environment, we should take advantage of the symmetry in the final tuning stage. Certainly the vertical-vector partition, or the vertical band partition would allow the computation to go almost twice as fast since the color of two pixels could be found at the same time (assuming that the picture displayed on the screen is symmetrical around the real axis). Hence a partition favoring vertical sections of the screen would have provided a parallel program with twice the speed of one where the partition is based on horizontal sections of the screen.

Applications like the Mandelbrot set that yield very simple and efficient partitioning are often referred to as perfectly decomposable applications [HWAN93].

Conway's game of life presents a situation similar to that of the Mandelbrot set in terms of the static data structure which is extremely regular and two-dimensional. With the game of life, however, each cell depends on its neighbors and the partition must take into account this interaction between the cells. Since the problem is essentially static in nature (the array of the old generation is used to compute the array of new generation), we can partition the data into regular subsets and assign them to individual processors.

When the number of processor is a perfect square, the array of cell can be partitioned as shown in Figure 9-4. The solution shown in Figure 9-4a is appealing for its regularity. Each processor covers a square area of the board. However, this regularity is deceiving. If the board is assumed to have edges, that is a cell on an edge of the board has fewer neighbors than one inside the board.. The processors in the corners, for example, have three neighbors, while processors "in the middle" of the board have eight, and processors on an edge have five. Moreover, the communication of one transputer with eight neighbors requires some algorithmic control of the data exchange since the transputers have only four links[1].

A simpler solution is to adopt the partition shown in Figure 9-4b. It is interesting for two separate reasons. The first one is that all but two processors must exchange information with only two other processors. The exceptions are the first and last processors. The second advantage is that this partition can fit a wide variety of processor networks, including chains.


Figure 9- 4: Two examples of a partition of the array of cells in a game of life. (a) the number of processor is an exact square and each processor covers a square area of cells. (b) the array is divided in bands, here vertical, of equal width.

We see, then, that as the communication structure connecting the individual elements of the data domain becomes more complex, our partitioning must be done carefully to minimize the number of special cases.

Other applications sharing partitioning properties similar to those of the game of life are many-body problems, where a fixed number of objects move around a fixed size domain. Usually each processor keeps track of a subset of the objects and shares information with other processors to account for the objects' interaction.

The N-body problem

Let's take a look at a solution for partitioning an N-body problem in astrophysics [LEST93]. In this problem N bodies circulate in a 2- or 3-dimensional space and exert attractive forces on each other. This force is proportional to the mass of the bodies and the distance separating them. For example, assume two bodies B1 and B2, of mass m1 and m2, respectively, separated by the distance d12. The attractive force they exert on each other is given by:

    (9.2)

where g is the universal gravitational constant. Typically, the goal of a simulation of an N-body problem in astrophysics is to simulate the movement of several planets over some period of time T by dividing the time into small increments. Then the incremental change in force, velocity and position relative to each planet over each time interval is computed. For N bodies, clearly N2 operations are required during each interval, just to compute the gravitational influence of the planets on each other. Let's examine how we can partition the problem on a ring of transputers. In this approach the N bodies are divided equally (or close to equally) among P transputers. The N bodies form the data domain, and each transputer manages N/P bodies. Since each body acts on every other body in the system, each transputer must exchange information with all the others, but this interaction can be done in "shifts."


Figure 9- 5: Example of three transputers computing the interactions between N bodies divided into three partition slices.

The idea is for each transputer to keep two slices of the domain. One that is private and contains that transputer's "own" bodies, and one other slice, of equal size, containing the bodies of other transputers in the ring. Figure 9-5 illustrates how three transputers would compute the interaction between N bodies during one simulation step. The domain containing the N bodies is originally divided into three slices containing an equal amount of bodies. They are shown as rectangles of different colors. In the first step of the simulation (Figure 9-5a) each slice is assigned to an individual transputer which saves it as its "own" slice. Then (Figure 9-5b), the transputers shift their slice to their right neighbors, in a ring fashion, and receive a new slice from their left neighbor. The incoming slice is then stored in a temporary buffer. Each transputer then starts computing the force caused by the bodies of the incoming slice on its own bodies (Figure 9-5c). Once all the force have been computed, the transputers shift the temporary slice out to their right neighbor and receives a new slice from their left neighbor (Figure 9-5d). The new slice is stored in the same temporary buffer, and the transputer computes a new set of forces (Figure 9-5e). Because we have only three transputers, only two shift and compute steps are necessary for each transputer to compute the forces on its private bodies.

The coding of this N-body problem in 2-dimensional space is shown in the next page.




[Previous] [HOME] [NEXT]