Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



/* =======================================================================
   matmult.c

   D. Thiebaut                              Department of Computer Science
   thiebaut@sophia.smith.edu                Smith College
                                            Northampton, MA 01063


   DESCRIPTION:

        This program implements the matrix multiplication algorithm on
        a network of PxP transputers, when the matrices to
        multiply are of size NxN (we assume N a multiple of P).

        This program works in two phases.  During the first phase, the
        Host interface transputer, which is not part of the toroidal mesh,
        reads the A and B matrices and sends them, using the indexing
        technique described in Chapter 9, to each transputer in the mesh.
        A star network of virtual channels is used for this purpose.
        During the second phase, the network goes through the Compute/
        Rotate cycles and computes the elements of the C matrix.

        This program does not regroup the data in the host interface, but
        simply prints the contents of the C sub-blocks directly to the
        screen.

   EXAMPLE:  Compute C = AxB, where A, B, and C are 4x4 matrices
                  on a toroidal mesh of 2x2 transputers (5 transputers
                  total).

   PHYSICAL NETWORK:

         (Note that because CIO requires the host interface to have node
         Id 1, we remap the physical node numbers to logical node numbers
         (shown in parentheses BELOW the physical numbers).




 


   VIRTUAL STAR NETWORK (used to distribute sub-blocks originally):

            5 [6] ----------> [10] 1
            5 [7] ----------> [11] 2
            5 [8] ----------> [12] 3
            5 [9] ----------> [13] 4


   VIRTUAL TOROIDAL MESH NETWORK (Logical node number in the middle
                                 surrounded with virtual channel
                                      number).









   NIF FILE: (virtual channels only)

        1[6],101[10];
        1[7],102[11];
        1[8],103[12];
        1[9],104[13];
        101[14],102[20];
        101[15],103[25];
        101[16],102[18];
        101[17],103[23];
        102[19],104[29];
        102[21],104[27];
        103[22],104[28];
        103[24],104[26];

   SYNTAX:

        ld-net matmult C_matrix_name A_matrix_name B_matrix_name

               where the last three parameters are the names of the text
               files containing the matrices, one row per line.

 ====================================================================== */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "conc.h"

/* =========================== DEFINITIONS ============================ */
#define  N              8                /* Dimension of matrices (NxN) */
#define  P              2         /* Dimension of transputer mesh (PxP) */
#define  n              (N/P)                /* Dimension of sub-blocks */
#define  Myi            (((_logical_node_number-1)/P)+1)   /* row index */
#define  Myj            (P-(_logical_node_number%P))    /* column index */
#define  HOST_INTERFACE (P*P+1)                 /* id of host interface */
#define  ItemFormat     "%d"      /* type to use to read or print items */
#define  index(i, j)    (((i+j-2)%P)+1)        /* indexing function for */
                      /* distributing A and B sub-blocks to transputers */
typedef  int            ItemType;            /* type of matrix elements */


/* =========================== PROTOTYPES ============================= */
int  Adjust(int node_number);
void InitStar(Channel *StarChannels[P][P]);
void GetMatrixFromFile(char *filename, ItemType Matrix[N][N]);
void MatrixMult(void);
void PrintResults(ItemType Matrix[n][n]);
void GetMatrices(char *filename2, char *filename3);
void Show(char *msg, ItemType Matrix[N][N]);


/* ============================= GLOBALS ============================== */
int      _logical_node_number;
ItemType A[n][n],                                   /* the A sub-block  */
         B[n][n],                                   /* the B sub-block  */
         C[n][n];                                   /* the C sub-block  */


/* -------------------------------------------------------------------- */
/*                                 MAIN                                 */
/* -------------------------------------------------------------------- */
main(int argc, char *argv[])
{
    _logical_node_number = Adjust(_node_number);

    GetMatrices(argv[2], argv[3]);
    if (_logical_node_number != HOST_INTERFACE)
    {
        MatrixMult();
        PrintResults(C);
    }
    printf("_Node_number %d exiting\n",_node_number);
    exit(0);
}

/* -------------------------------------------------------------------- */
/* ADJUST                                                               */
/* Change the physical node number into a logical node number.          */
/* The requirement that CIO interacts with a Root transputer            */
/* with node Id equal to 1 requires the use of another mapping.         */
/* -------------------------------------------------------------------- */
int Adjust(int node_number)
{
    if (node_number > 100)
        return _node_number - 100;
    else
        return HOST_INTERFACE;
}

/* -------------------------------------------------------------------- */
/* MATRIXMULT                                                           */
/* Multiplies the sub-blocks of the A and B matrices together           */
/* adds the result to the C sub-block.                                  */
/* When the update is done, the A sub-block is rotated to the           */
/* right neighbor, and the B sub-block to the lower neighbor.           */
/* -------------------------------------------------------------------- */
void MatrixMult()
{
    int      i, j, k, NoSteps;
    Channel  *ToroidalChannels[4];      /* virtual channels to neighbors */
    enum     Directions
             {Left, Up, Right, Down};         /* directions for channels */
    ItemType Temp[n][n];                          /* temporary sub-block */

    /*--- initialize toroidal mesh with virtual channels ---*/
    for (i=0; i<4; i++)
        ToroidalChannels[i] = VChan(6+2*P*P+(_logical_node_number-1)*4+i);

    /*--- clear C sub-block ---*/
    for (i=0; i<n; i++)
        for (j=0; j<n; j++)
            C[i][j] = 0;

    /*--- start the Compute/Rotate process ---*/
    for (NoSteps = 0; NoSteps <P; NoSteps++)
    {
        /*--- Compute step ---*/
        for (i=0; i<n; i++)
            for (j=0; j<n; j++)
                for (k=0; k<n; k++)
                    C[i][j] += A[i][k]*B[k][j];

        /*--- if last step, avoid unnecessary rotation ---*/
        if (NoSteps==P-1)
            break;

        /*--- rotate B down ---*/
        if (Myi%2)
        {
            VChanOut(ToroidalChannels[Down], B, (int) sizeof(B));
            VChanIn(ToroidalChannels[Up], B, (int) sizeof(B));
        }
        else
        {
            VChanIn(ToroidalChannels[Up], Temp, (int) sizeof(B));
            VChanOut(ToroidalChannels[Down], B, (int) sizeof(B));
            memcpy(B, Temp, n*n*sizeof(ItemType));
        }

        /*--- rotate A right ---*/
        if (Myj%2)
        {
            VChanOut(ToroidalChannels[Right],A, (int) sizeof(A));
            VChanIn(ToroidalChannels[Left], A, (int) sizeof(A));
        }
        else
        {
            VChanIn(ToroidalChannels[Left], Temp, (int) sizeof(A));
            VChanOut(ToroidalChannels[Right], A, (int) sizeof(A));
            memcpy(A, Temp, n*n*sizeof(ItemType));
        }
    }
}

/* -------------------------------------------------------------------- */
/* PRINTRESULTS                                                         */
/* Prints the contents of a sub-block.                                  */
/* The information is first printed to a character string               */
/* to prevent lines to be broken when printed concurrently              */
/* with other processors'output.                                        */
/* -------------------------------------------------------------------- */
void PrintResults(ItemType Matrix[n][n])
{
    int i, j;
    char whole[200], line[80];

    ProcWait(_logical_node_number*1000000/64);

    sprintf(whole,"Node %d: C sub-block\n", _logical_node_number);
    for (i = 0; i < n; i++)
    {
        for (j = 0; j < n; j++)
        {
            sprintf(line,"%4d ", Matrix[i][j]);
            strcat(whole,line);
        }
        sprintf(line,"\n");
        strcat(whole,line);
    }
    printf("%s",whole);
}

/* -------------------------------------------------------------------- */
/* GETMATRICES                                                          */
/* The host interface reads the A and B matrices from a text            */
/* file and sends them to the transputers in the toroidal               */
/* mesh.  The virtual channels creating the stars are used              */
/* only in this module.                                                 */
/* -------------------------------------------------------------------- */
void GetMatrices(char *filename2, char *filename3)
{
    /*=============================================================== */
    /*                        HOST_INTERFACE                          */
    /*=============================================================== */
    if (_logical_node_number == HOST_INTERFACE)
    {
        ItemType Matrix[N][N];
        ItemType BlockMatrix[n][n];
        Channel *StarChannels[P][P];
        int i, j, blocki, blockj;

        /*--- initialize star channels, from Host Interface to
              each of the transputers ---*/
        InitStar(StarChannels);

        /*--- get A matrix ---*/
        GetMatrixFromFile(filename2, Matrix);

        /*--- Cut A matrix into sub-blocks ---*/
        for (blocki = 0; blocki < P; blocki++)
            for (blockj = 0; blockj < P; blockj++)
            {
                /*--- copy a section of A into a sub-block ---*/
                for (i = 0; i < n; i++)
                    for (j = 0; j < n; j++)
                        BlockMatrix[i][j] =
                             Matrix[blocki*n + i][blockj*n + j];

                /*--- send that sub-block to T[Blocki+1][k] ---*/
                /*--- where k is given by the indexing function ---*/
                VChanOut(StarChannels[blocki][index(blocki+1, blockj+1)-1],
                     BlockMatrix, (int) sizeof(BlockMatrix));
            }

        /*--- get B matrix ---*/
        GetMatrixFromFile(filename3, Matrix);

        /*--- Cut B matrix into sub-blocks ---*/
        for (blocki = 0; blocki < P; blocki++)
            for (blockj = 0; blockj < P; blockj++)
            {
                /*--- copy section of B into a sub-block ---*/
                for (i = 0; i < n; i++)
                    for (j = 0; j < n; j++)
                        BlockMatrix[i][j] =
                             Matrix[blocki*n + i][blockj*n + j];

                /*--- send it to T[k][Blockj+1] ---*/
                /*--- where k is given by indexing function ---*/
                VChanOut(StarChannels[index(blocki+1, blockj+1)-1][blockj],
                     BlockMatrix, (int) sizeof(BlockMatrix));
            }
    }
    else
    /* =============================================================== */
    /*                       TOROIDAL TRANSPUTER                       */
    /* =============================================================== */
    {
        Channel *FromStarCenter;

        /*--- Initialize virtual channels ---*/
        FromStarCenter = VChan(6 + P*P + _logical_node_number - 1);

        /*--- get A and B matrix from Host Interface ---*/
        VChanIn(FromStarCenter, A, (int) sizeof(A));
        VChanIn(FromStarCenter, B, (int) sizeof(B));
    }
}

/* -------------------------------------------------------------------- */
/* INITSTAR                                                             */
/* Initialize star of virtual channels going from Host I/F              */
/* to all other transputers in the mesh                                 */
/* -------------------------------------------------------------------- */
void InitStar(Channel *StarChannels[P][P])
{
    int i, j;
    int ChannelCount = 6;

    for (i = 0; i < P; i++)
        for (j = 0; j < P; j++, ChannelCount++)
            StarChannels[i][j] = VChan(ChannelCount);
}

/* -------------------------------------------------------------------- */
/* GETMATRIXFROMFILE                                                    */
/* -------------------------------------------------------------------- */
void GetMatrixFromFile(char *filename, ItemType Matrix[N][N])
{
    FILE *DataFile;
    int        i, j;

    DataFile = fopen(filename, "rt");
    if (DataFile == NULL)
    {
        printf("Cannot open file %s\n", filename);
        exit(1);
    }
    for (i = 0; i < N; i++)
    {
        for (j = 0; j < N; j++)
            fscanf(DataFile, ItemFormat, &(Matrix[i][j]));
    }
    fclose(DataFile);
}

Listing 9-1: Code of matrix multiplication program.

Note the indexing function used by the Host Interface transputer to send the sub-blocks to the individual transputers in the mesh:

#define  index(i, j)    (((i+j-2)%P)+1)

When the Host Interface sends a sub-block Aij of the A matrix to Tik, where k is given by the indexing function, it uses the virtual channel to Tik which is stored in StarChannels[i][k]:

/*--- send that sub-block to T[Blocki+1][k]     ---*/
/*--- where k is given by the indexing function ---*/
VChanOut(StarChannels[blocki][index(blocki+1, blockj+1)-1],
         BlockMatrix, (int) sizeof(BlockMatrix));

where i and j represent the indexes of the transputer in the mesh, the upper left transputer being labeled as T11 and the one at the bottom right corner as TNN. Because the matrix indexes start at 1, while the storage indexes start as 0, as is conventional in C, the storage indexes blocki and blockj must be adjusted (incremented by 1) before being used in the index function. In turn, the result from the indexing function must be converted to storage indexing (decremented by 1).

We should note here that the problem of finding good domain decomposition and a schedule of shift/compute cycles to process the individual items in an efficient way is typical of systolic parallel processors, where the processors operate in a synchronized fashion (KUNG88). Although the conceptual difference is minimal between indexing in multiprocessors and in systolic arrays, the high cost of data communication in message-passing systems makes larger block size per processor a must if performance needs to be maintained. This way the computation effort can best match the communication delays. In systolic arrays, however, the power comes from the closeness of the processors, and their tremendous data bandwidth. In this case it is more interesting to keep the data slices very small.


EXERCISES


 
9-4

Using the matrix multiplication algorithm as an example, write a parallel program for the transputer that will find the transitive closure of a graph. The transitive closure of a graph G=(V,E) is defined as the graph G'=(V,E') where there is an edge in E' linking two vertices v1 and v2 of V if there is a path from v1 to v2 in G. When the graph is represented by an adjacency matrix A of dimension N (N is also the number of vertices in the graph), Warshall's serial algorithm computes the adjacency matrix T of the transitive closure as follows:

/*--- A and T are NxN matrices.  A contains the adjacency matrix   ---*/
/*--- of the graph G, and T will contain the transitive closure on ---*/
/*--- completion of the algorithm.                                 ---*/
/*--- copy A to T ---*/
for (i=0; i<N; i++)
        for (j=0; j<N; j++)
                T[i][j] = A[i][j]

/*--- compute T ---*/
for (i=0; i<N; i++)
        for (j=0; j<N; j++)
                if (T[j][i])
                        for (k=0; k<N; k++)
                                if (T[i][k])
                                        T[j][k] = 1;
.

 

 
9-5
The matrix-multiplication program we just explored does not gather the sub-blocks of the C matrix once the computation is over. However, it would be interesting to be able to save the result to a file. Write a program that saves the C matrix to a text file, and make the Host Interface responsible for this action.

 

 
9-6
Investigate Gustafson's observation that the speedup is a function of the size of the problem just as much as it is a function of the number of processors. For a toroidal mesh of fixed size P by P, measure the execution of the matrix multiplication algorithm for matrices of increasing sizes: P by P, 2P by 2P, 3P by 3P, 4P by 4P, 5P by 5P, etc. Verify that for a fixed size mesh, the speedup varies with the size of the problem.

Is there an optimal size matrix that yields a fastest execution time? What elements of the program influence the variation that you observe?


 

 
9-7
Virtual channels provide great flexibility. Some parallel algorithms call for communication patterns that multiprocessor networks sometimes cannot provide. But virtual channels allow any arbitrary logical network to be mapped to a particular physical interconnect. If you have access to a re-configurable transputer machine where different networks can be implemented, study the performance of the matrix multiplication algorithm on the following networks:

  • A chain of transputers,
  • A ring of transputers,
  • A hypercube of transputers (select a dimension higher than 2!), and
  • A square, non toroidal mesh.
For each network, keep the number of processor the same. Rate the performance degradation introduced by the different networks, compared to the performance obtained on a toroidal mesh.

 


9-4 Concluding Remarks

This chapter introduced new dimensions to parallel programming. Selecting a problem and thinking up of a way to solve it in parallel has much more to do than just writing a parallel program. Many issues dealing with the problem itself and the way we represent it will influence our design effort.

The algorithm we select to solve the problem may have some natural decomposition which can lead to a functional decomposition of the application on a particular network topology. At other times, the algorithm may result in a collection of concurrent tasks which may not yield any intuitive functional decomposition. In this case one can use their dependence graph and use a heuristic to schedule them in such a way that the total execution time is (close to) minimal.

Most often, however, it is the symmetry or regularity we find in the data structure implementing the domain of the problem which will lead our design. Here too, the decomposition of the domain and its distribution on the network of processor must be done carefully, so as to minimize storage requirements at the processor nodes and reduce communication. In some mathematical applications, a detailed analysis of the requirements and dynamics of the computation at each node may lead to an optimal assignment of the data to the nodes such that the application can be done as a series of compute/communication phases. In such cases buffering is essential to increase concurrency between the compute and communication phase.

Efficient programs will be the result of a careful balance of programming and design decisions based on these issues. Choosing the right network configuration is also a dimension that will bear a strong influence on the overall performance of the program.

In some cases, though, the problem might be simple to formulate but this formulation might be associated with a dynamic data domain lacking any symmetry or shape that we can utilize efficiently, preventing us from implementing any of the approaches developed in this chapter. In such cases it is essential to maintain an even balance of work among the processors. This is the subject of the next chapter: load balancing.




[Previous] [HOME] [NEXT]