/* ======================================================================= matmult.c D. Thiebaut Department of Computer Science thiebaut@sophia.smith.edu Smith College Northampton, MA 01063 DESCRIPTION: This program implements the matrix multiplication algorithm on a network of PxP transputers, when the matrices to multiply are of size NxN (we assume N a multiple of P). This program works in two phases. During the first phase, the Host interface transputer, which is not part of the toroidal mesh, reads the A and B matrices and sends them, using the indexing technique described in Chapter 9, to each transputer in the mesh. A star network of virtual channels is used for this purpose. During the second phase, the network goes through the Compute/ Rotate cycles and computes the elements of the C matrix. This program does not regroup the data in the host interface, but simply prints the contents of the C subblocks directly to the screen. EXAMPLE: Compute C = AxB, where A, B, and C are 4x4 matrices on a toroidal mesh of 2x2 transputers (5 transputers total). PHYSICAL NETWORK: (Note that because CIO requires the host interface to have node Id 1, we remap the physical node numbers to logical node numbers (shown in parentheses BELOW the physical numbers). VIRTUAL STAR NETWORK (used to distribute subblocks originally): 5 [6] > [10] 1 5 [7] > [11] 2 5 [8] > [12] 3 5 [9] > [13] 4 VIRTUAL TOROIDAL MESH NETWORK (Logical node number in the middle surrounded with virtual channel number). NIF FILE: (virtual channels only) 1[6],101[10]; 1[7],102[11]; 1[8],103[12]; 1[9],104[13]; 101[14],102[20]; 101[15],103[25]; 101[16],102[18]; 101[17],103[23]; 102[19],104[29]; 102[21],104[27]; 103[22],104[28]; 103[24],104[26]; SYNTAX: ldnet matmult C_matrix_name A_matrix_name B_matrix_name where the last three parameters are the names of the text files containing the matrices, one row per line. ====================================================================== */ #include <stdio.h> #include <stdlib.h> #include <string.h> #include "conc.h" /* =========================== DEFINITIONS ============================ */ #define N 8 /* Dimension of matrices (NxN) */ #define P 2 /* Dimension of transputer mesh (PxP) */ #define n (N/P) /* Dimension of subblocks */ #define Myi (((_logical_node_number1)/P)+1) /* row index */ #define Myj (P(_logical_node_number%P)) /* column index */ #define HOST_INTERFACE (P*P+1) /* id of host interface */ #define ItemFormat "%d" /* type to use to read or print items */
#define index(i, j) (((i+j2)%P)+1) /* indexing function for */ /* distributing A and B subblocks to transputers */
typedef int ItemType; /* type of matrix elements */ /* =========================== PROTOTYPES ============================= */ int Adjust(int node_number); void InitStar(Channel *StarChannels[P][P]); void GetMatrixFromFile(char *filename, ItemType Matrix[N][N]); void MatrixMult(void); void PrintResults(ItemType Matrix[n][n]); void GetMatrices(char *filename2, char *filename3); void Show(char *msg, ItemType Matrix[N][N]); /* ============================= GLOBALS ============================== */ int _logical_node_number; ItemType A[n][n], /* the A subblock */ B[n][n], /* the B subblock */ C[n][n]; /* the C subblock */ /*  */ /* MAIN */ /*  */ main(int argc, char *argv[]) { _logical_node_number = Adjust(_node_number); GetMatrices(argv[2], argv[3]); if (_logical_node_number != HOST_INTERFACE) { MatrixMult(); PrintResults(C); } printf("_Node_number %d exiting\n",_node_number); exit(0); } /*  */ /* ADJUST */ /* Change the physical node number into a logical node number. */ /* The requirement that CIO interacts with a Root transputer */ /* with node Id equal to 1 requires the use of another mapping. */ /*  */ int Adjust(int node_number) { if (node_number > 100) return _node_number  100; else return HOST_INTERFACE; } /*  */ /* MATRIXMULT */ /* Multiplies the subblocks of the A and B matrices together */ /* adds the result to the C subblock. */ /* When the update is done, the A subblock is rotated to the */ /* right neighbor, and the B subblock to the lower neighbor. */ /*  */ void MatrixMult() { int i, j, k, NoSteps; Channel *ToroidalChannels[4]; /* virtual channels to neighbors */ enum Directions {Left, Up, Right, Down}; /* directions for channels */ ItemType Temp[n][n]; /* temporary subblock */ /* initialize toroidal mesh with virtual channels */ for (i=0; i<4; i++) ToroidalChannels[i] = VChan(6+2*P*P+(_logical_node_number1)*4+i); /* clear C subblock */ for (i=0; i<n; i++) for (j=0; j<n; j++) C[i][j] = 0; /* start the Compute/Rotate process */ for (NoSteps = 0; NoSteps <P; NoSteps++) { /* Compute step */ for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) C[i][j] += A[i][k]*B[k][j]; /* if last step, avoid unnecessary rotation */ if (NoSteps==P1) break; /* rotate B down */ if (Myi%2) { VChanOut(ToroidalChannels[Down], B, (int) sizeof(B)); VChanIn(ToroidalChannels[Up], B, (int) sizeof(B)); } else { VChanIn(ToroidalChannels[Up], Temp, (int) sizeof(B)); VChanOut(ToroidalChannels[Down], B, (int) sizeof(B)); memcpy(B, Temp, n*n*sizeof(ItemType)); } /* rotate A right */ if (Myj%2) { VChanOut(ToroidalChannels[Right],A, (int) sizeof(A)); VChanIn(ToroidalChannels[Left], A, (int) sizeof(A)); } else { VChanIn(ToroidalChannels[Left], Temp, (int) sizeof(A)); VChanOut(ToroidalChannels[Right], A, (int) sizeof(A)); memcpy(A, Temp, n*n*sizeof(ItemType)); } } } /*  */ /* PRINTRESULTS */ /* Prints the contents of a subblock. */ /* The information is first printed to a character string */ /* to prevent lines to be broken when printed concurrently */ /* with other processors'output. */ /*  */ void PrintResults(ItemType Matrix[n][n]) { int i, j; char whole[200], line[80]; ProcWait(_logical_node_number*1000000/64); sprintf(whole,"Node %d: C subblock\n", _logical_node_number); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { sprintf(line,"%4d ", Matrix[i][j]); strcat(whole,line); } sprintf(line,"\n"); strcat(whole,line); } printf("%s",whole); } /*  */ /* GETMATRICES */ /* The host interface reads the A and B matrices from a text */ /* file and sends them to the transputers in the toroidal */ /* mesh. The virtual channels creating the stars are used */ /* only in this module. */ /*  */ void GetMatrices(char *filename2, char *filename3) { /*=============================================================== */ /* HOST_INTERFACE */ /*=============================================================== */ if (_logical_node_number == HOST_INTERFACE) { ItemType Matrix[N][N]; ItemType BlockMatrix[n][n]; Channel *StarChannels[P][P]; int i, j, blocki, blockj; /* initialize star channels, from Host Interface to each of the transputers */ InitStar(StarChannels); /* get A matrix */ GetMatrixFromFile(filename2, Matrix); /* Cut A matrix into subblocks */ for (blocki = 0; blocki < P; blocki++) for (blockj = 0; blockj < P; blockj++) { /* copy a section of A into a subblock */ for (i = 0; i < n; i++) for (j = 0; j < n; j++) BlockMatrix[i][j] = Matrix[blocki*n + i][blockj*n + j]; /* send that subblock to T[Blocki+1][k] */ /* where k is given by the indexing function */
VChanOut(StarChannels[blocki][index(blocki+1, blockj+1)1], BlockMatrix, (int) sizeof(BlockMatrix));
} /* get B matrix */ GetMatrixFromFile(filename3, Matrix); /* Cut B matrix into subblocks */ for (blocki = 0; blocki < P; blocki++) for (blockj = 0; blockj < P; blockj++) { /* copy section of B into a subblock */ for (i = 0; i < n; i++) for (j = 0; j < n; j++) BlockMatrix[i][j] = Matrix[blocki*n + i][blockj*n + j]; /* send it to T[k][Blockj+1] */ /* where k is given by indexing function */
VChanOut(StarChannels[index(blocki+1, blockj+1)1][blockj], BlockMatrix, (int) sizeof(BlockMatrix));
} } else /* =============================================================== */ /* TOROIDAL TRANSPUTER */ /* =============================================================== */ { Channel *FromStarCenter; /* Initialize virtual channels */ FromStarCenter = VChan(6 + P*P + _logical_node_number  1); /* get A and B matrix from Host Interface */ VChanIn(FromStarCenter, A, (int) sizeof(A)); VChanIn(FromStarCenter, B, (int) sizeof(B)); } } /*  */ /* INITSTAR */ /* Initialize star of virtual channels going from Host I/F */ /* to all other transputers in the mesh */ /*  */ void InitStar(Channel *StarChannels[P][P]) { int i, j; int ChannelCount = 6; for (i = 0; i < P; i++) for (j = 0; j < P; j++, ChannelCount++) StarChannels[i][j] = VChan(ChannelCount); } /*  */ /* GETMATRIXFROMFILE */ /*  */ void GetMatrixFromFile(char *filename, ItemType Matrix[N][N]) { FILE *DataFile; int i, j; DataFile = fopen(filename, "rt"); if (DataFile == NULL) { printf("Cannot open file %s\n", filename); exit(1); } for (i = 0; i < N; i++) { for (j = 0; j < N; j++) fscanf(DataFile, ItemFormat, &(Matrix[i][j])); } fclose(DataFile); }
Note the indexing function used by the Host Interface transputer to send the subblocks to the individual transputers in the mesh:
#define index(i, j) (((i+j2)%P)+1)
When the Host Interface sends a subblock Aij of the A matrix to Tik, where k is given by the indexing function, it uses the virtual channel to Tik which is stored in StarChannels[i][k]:
/* send that subblock to T[Blocki+1][k] */ /* where k is given by the indexing function */ VChanOut(StarChannels[blocki][index(blocki+1, blockj+1)1], BlockMatrix, (int) sizeof(BlockMatrix));
where i and j represent the indexes of the transputer in the mesh, the upper left transputer being labeled as T11 and the one at the bottom right corner as TNN. Because the matrix indexes start at 1, while the storage indexes start as 0, as is conventional in C, the storage indexes blocki and blockj must be adjusted (incremented by 1) before being used in the index function. In turn, the result from the indexing function must be converted to storage indexing (decremented by 1).
We should note here that the problem of finding good domain decomposition and a schedule of shift/compute cycles to process the individual items in an efficient way is typical of systolic parallel processors, where the processors operate in a synchronized fashion (KUNG88). Although the conceptual difference is minimal between indexing in multiprocessors and in systolic arrays, the high cost of data communication in messagepassing systems makes larger block size per processor a must if performance needs to be maintained. This way the computation effort can best match the communication delays. In systolic arrays, however, the power comes from the closeness of the processors, and their tremendous data bandwidth. In this case it is more interesting to keep the data slices very small.
94 
Using the matrix multiplication algorithm as an example, write a parallel program for the transputer that will find the transitive closure of a graph. The transitive closure of a graph G=(V,E) is defined as the graph G'=(V,E') where there is an edge in E' linking two vertices v1 and v2 of V if there is a path from v1 to v2 in G. When the graph is represented by an adjacency matrix A of dimension N (N is also the number of vertices in the graph), Warshall's serial algorithm computes the adjacency matrix T of the transitive closure as follows: /* A and T are NxN matrices. A contains the adjacency matrix */ /* of the graph G, and T will contain the transitive closure on */ /* completion of the algorithm. */ /* copy A to T */ for (i=0; i<N; i++) for (j=0; j<N; j++) T[i][j] = A[i][j] /* compute T */ for (i=0; i<N; i++) for (j=0; j<N; j++) if (T[j][i]) for (k=0; k<N; k++) if (T[i][k]) T[j][k] = 1;. 
95 
The matrixmultiplication program we just explored does not gather the subblocks of the C matrix once the computation is over. However, it would be interesting to be able to save the result to a file. Write a program that saves the C matrix to a text file, and make the Host Interface responsible for this action. 
96 
Investigate Gustafson's observation that the speedup is a function of
the size of the problem just as much as it is a function of the number of
processors. For a toroidal mesh of fixed size P by P, measure
the execution of the matrix multiplication algorithm for matrices of increasing
sizes: P by P, 2P by 2P, 3P by 3P,
4P by 4P, 5P by 5P, etc. Verify that for a fixed
size mesh, the speedup varies with the size of the problem. Is there an optimal size matrix that yields a fastest execution time? What elements of the program influence the variation that you observe? 
97 
Virtual channels provide great flexibility. Some parallel algorithms
call for communication patterns that multiprocessor networks sometimes cannot
provide. But virtual channels allow any arbitrary logical network to be mapped
to a particular physical interconnect. If you have access to a reconfigurable
transputer machine where different networks can be implemented, study the
performance of the matrix multiplication algorithm on the following networks:

The algorithm we select to solve the problem may have some natural decomposition which can lead to a functional decomposition of the application on a particular network topology. At other times, the algorithm may result in a collection of concurrent tasks which may not yield any intuitive functional decomposition. In this case one can use their dependence graph and use a heuristic to schedule them in such a way that the total execution time is (close to) minimal.
Most often, however, it is the symmetry or regularity we find in the data structure implementing the domain of the problem which will lead our design. Here too, the decomposition of the domain and its distribution on the network of processor must be done carefully, so as to minimize storage requirements at the processor nodes and reduce communication. In some mathematical applications, a detailed analysis of the requirements and dynamics of the computation at each node may lead to an optimal assignment of the data to the nodes such that the application can be done as a series of compute/communication phases. In such cases buffering is essential to increase concurrency between the compute and communication phase.
Efficient programs will be the result of a careful balance of programming and design decisions based on these issues. Choosing the right network configuration is also a dimension that will bear a strong influence on the overall performance of the program.
In some cases, though, the problem might be simple to formulate but this formulation might be associated with a dynamic data domain lacking any symmetry or shape that we can utilize efficiently, preventing us from implementing any of the approaches developed in this chapter. In such cases it is essential to maintain an even balance of work among the processors. This is the subject of the next chapter: load balancing.