Parallel Programming in C for the Transputer
© D. Thiébaut, 1995



4-4 Adding a relay node

We continue this chapter with a closing example that introduces an important function that is omnipresent in any message-passing multiprocessor system: That of relaying information, or messages, from neighbors on one side of the network to neighbors on another side. Relaying information is important in a message-passing environment. In all networks only one transputer in the network is attached to the host, and the initial distribution of data or the collection of results will require nodes deep in the network to exchange information with the root transputer attached to the host. There is no choice but for this information to be relayed by other nodes. To illustrate this concept, we use our same prime-finding algorithm, but this time Node 1 and 3 are computing the primes, while Node 2 acts as a relay between them.


main function

Using our previous top-down approach, we can organize main as a switch starting one of three possible tasks.

main()
{
switch (_node_number)
{
case 1: Task1();
exit(0);
case 2: Task2() ;
break;
case 3: Task3();
}
}
The top-down approach here will allow us to modularize our program nicely. We are still using the farm programming approach, and all three transputers receive the same copy of the program. Different sections of the program are dedicated to different transputers.

We start our analysis of the program with the farthest node, Node 3, which executes Task 3.

Task3

Node 3 does exactly what Node 2 was doing in our previous examples. Task3 is hence a simple encapsulation of the code we saw before:

void Task3(void)
{
     int x, j;

     /*--- scan interval ---*/
     for (x = INTERVAL1; x<INTERVAL2; x++)
     {
          /*--- if prime then send number to Node 1 ---*/
          if (IsPrime(x))
               ChanOut(LINK0OUT, &x, INTSIZE);
     }

     /*--- signal Node 2 and Node 1 that we are done ---*/
     x = SENTINEL;
     ChanOut(LINK0OUT, &x, INTSIZE);
}

Task3 does not store the primes internally but sends them as soon as they are found. It thus blocks every times it has found a new prime.

Task3 sends its prime to Task2 running on Node 2 over LINK0OUT. Task2 receives the numbers on its LINK1IN channel.

Task2

The job of Task2 is to get integer primes sent by Node 3 and to relay them to Node 1. Hence, each ChanIn operation is associated to a ChanOut operation. A first (hasty) solution would be to understand this as "each ChanIn operation is followed by a ChanOut operation", and code the task as follows:

/*--- hasty implementation of Task2's inner loop ---*/
do
{
ChanIn(LINK1IN, (char *) &x, INTSIZE);
ChanOut(LINK0OUT, (char *) &x, INTSIZE);
} while (x != SENTINEL);
This implementation of Task2 would work, but it would be inefficient. The reason is that Node 3 could (and does) generate primes faster than Node 1 can accept them, and because Node 2 uses a loop containing two blocking functions for receiving and sending information between the two remote tasks, it runs only as fast as the slowest of its two counterparts, which in our case is Task1. So this type of relay is inefficient in that it does not decouple the input from the output. The slow task (Task1) slows down the faster one (Task3) through the relaying implemented by Task2.

Using a buffer

The solution is to introduce a buffer in Node 2 to decouple the two channels. A queue is an ideal candidate to implement the buffer. The algorithm is simple

while (not done)
{
wait for either Node 1 or Node 3 to request a transfer;
if (Node 1 is ready)
{
Dequeue a prime;
Send it to Node 1;
If the integer sent was the sentinel we're done;
}
if (Node 3 is ready)
{
Get a prime;
Enqueue it;
}
}
The outer loop repeats until the SENTINEL is received from Node 3 and passed on to Node 1. Task2 can either send a prime to Task1, or receive a prime from Task3. It must thus monitor both, so that it can respond to the first one requesting a transfer. This is the function of ProcAlt which is given a list of channels to monitor. It would be tempting to have Task2 do this:

ProcAlt(LINK0OUT, LINK1IN, NULL);

But this won't work. Can you see why?

The reason is that ProcAlt can only monitor input channels[1]. Therefore, we must use a trick here, and make Task1 send Task2 a short message whenever it is ready to receive a prime from Task2. Because we do not want Task1 to have to wait on Task2 if its queue is empty, Task2 will send "pad" integers (fake primes) back if it cannot send a real prime. This way Task1 will always be guarantied to receive something within a short period of time, so that it is not slowed down too much.

void Task2(void)
{
     QueueType Q
     int x;
     int Done = 0;
     char dummy;              /* used to get "ready" signal from Node 1 */

     while (!Done)
     {
          /*--- find out which node is ready first (give priority to
          Node 3, as Node 1 will send many signals ---*/
          index = ProcAlt(LINK1IN, LINK0IN, 0);
          switch (index)
          {
               case  0 : /* LINK1IN from Node 3 */
                         ChanIn(LINK1IN, &x, INTSIZE);
                         Enqueue(&Q,x);
                         break;
               case  1 : /* LINK0IN from Node 1 */
                         ChanIn(LINK0IN, &dummy, 1);
                         if (Empty(&Q))
                              x = PAD;  /* tell Node 1 we have nothing */
                         else
                              x = Dequeue(&Q);
                         /*--- send integer or pad ---*/
                         ChanOut(LINK0OUT, &x, INTSIZE);
                         Done = (x==SENTINEL);
                         break;
          }
     }
}

QueueType, Enqueue(), Dequeue(), and Empty() define a queue data-type and some of its standard operators. The program prime_v3.c on the distribution disk contains an implementation with a circular buffer implemented with an integer array. An alternate implementation with linked lists could have been possible [SEDG83].

Task1()

The code executed by Node 1 is almost the same as in prime_v1.c, with one exception. In the previous 2-node version of this program, Node 2 was blocking on a ChanOut every time it found a prime. It was easy for Node 1 to detect such a condition with a ProcSkipAlt. In our new 3-node version, however, Node 2 is not blocking anymore when it has a prime in its queue. Instead, it waits for either Task1 or Task2 to show some activity on the channels. Therefore, we cannot keep the ProcSkipAlt statement in Task1, otherwise Task1 and Task2 would never talk to each other! Take a second to see why. If none of them commit to block on the channel connecting them, then no communication can occur. The situation is similar to two people anxious to talk to each other sitting patiently by their telephone and waiting for it to ring. The solution in this case is for Task1 to assert its readiness to accept an integer by sending a dummy byte. This way Task2 can safely use a ProcAlt statement to monitor Nodes 1 and 3.

void Task1(void)
{
     int x, j, y = 0, NoPrimes = 0;
     char dummy;           /* used to tell Node 2 "ready to receive" */

     /*--- Compute, print, and relay primes ---*/
     for (x = 1; x<INTERVAL1; x++)
     {
          /*--- if x is prime then print it ---*/
          if (IsPrime(x))
          {
               printf("%8d", x);
               NoPrimes++;
          }
          /*--- if Node 2 already done, continue ---*/
          if (y==SENTINEL) continue;

          /*--- otherwise tell it we're ready to receive ---*/

          ChanOut(LINK1OUT, &dummy, 1);/* send a dummy byte */

          ChanIn(LINK1IN, &y, INTSIZE);
          if ((y!=SENTINEL) && (y!=PAD))
          {
               printf("%8d", y);
               NoPrimes++;
          }
     }

     /*--- Keep relaying primes if Node 2 isn't done yet ---*/
     while ((y!=SENTINEL))
     {

          ChanOut(LINK1OUT, &dummy, 1);
          ChanIn(LINK1IN, &y, INTSIZE);
          if ((y!=SENTINEL) && (y!=PAD))
          {
               printf("%8d", y);
               NoPrimes++;
          }
     }
     printf("\nReceived %d primes\n", NoPrimes);
}

Notice that every time Node 1 is ready to receive a prime from Node 2, it sends the char variable dummy to indicate its readiness. Because Node 2 may not have a prime in its queue, it may return a pad integer, and Task1 must test for this condition, and discard them if necessary.

The second point of interest is that in the final while loop, Task1 must still use the sending of dummy since Node 2 can only send a prime when it receives this signal from Node 1. Because Task1 in that case would start sending a lot of dummy bytes, Task2 uses LINK1IN first in its list of arguments in ProcAlt, so that it will choose to receive a prime from Node 3 before receiving a dummy signal from Node 1.

The full code of prime_v3.c is now shown below.

/* =======================================================================
   prime_v3.c

   DESCRIPTION:
   Compute primes on 3 nodes.

   Node 3 computes the primes and sends them to Node 2.
   Node 2 enqueues the primes received from Node 2 in a queue and releases
        them to Node 1 when ready.
   Node 1 (Root) computes primes, prints them, and prints any numbers
        received from Node 2.

   TO COMPILE AND RUN:

   make -f prime_v3
   chainnif -# 3 -1 prime_v3 -nif prime_v3.nif
   ld-net prime_v3

   ASSOCIATED NIF-FILE

   buffer_size     200;
   host_server     CIO.EXE;
   level_timeout   400;
   decode_timeout  2000;
     1, test, R0,  0     ,   2[0] ,          ,         ;
     2, test, R1,  1[1]  ,   3[0] ,          ,         ;
     3, test, R2,  2[1]  ,        ,          ,         ;

 ====================================================================== */
#include <stdio.h>
#include <stdlib.h>
#include <conc.h>       /* transputer library */

/* =========================== DEFINITIONS ============================ */
#define INTSIZE   ((int) sizeof(int))
#define INTERVAL1 1000                   /* bound for primes for Node 1 */
#define INTERVAL2 2000                   /* bound for primes for Node 3 */
#define QUEUESIZE 1000                 /* maximum queue size for Node 2 */
#define SENTINEL  -1                 /* sent by Node 3 to indicate done */
#define PAD       -2                 /* sent by Node 2 when queue empty */

/* =========================== PROTOTYPES ============================= */
void Task1(void);
void Task2(void);
void Task3(void);
int IsPrime(int x);

/* -------------------------------------------------------------------- */
/*                                 MAIN                                 */
/* -------------------------------------------------------------------- */
main(void)
{
     switch (_node_number)
     {
          case 1: Task1();
                  exit(0);
                  break;
          case 2: Task2();
                  break;
          case 3: Task3();
     }
}

/* -------------------------------------------------------------------- */
/* TASK1                                                                */
/* Runs on Root transputer.  Computes primes between 1 and              */
/* INTERVAL1 and prints them.  Prints primes received from Node 2       */
/* -------------------------------------------------------------------- */
void Task1(void)
{
     int x, j, y = 0, NoPrimes = 0;
     char dummy;              /* used to tell Node 2 "ready to receive" */

     /*--- Compute, print, and relay primes ---*/
     for (x = 1; x<INTERVAL1; x++)
     {
          /*--- if x is prime then print it ---*/
          if (IsPrime(x))
          {
               printf("%8d", x);
               NoPrimes++;
          }
          /*--- if Node 2 already done, continue ---*/
          if (y==SENTINEL) continue;

          /*--- otherwise tell it we're ready to receive ---*/
          ChanOut(LINK1OUT, &dummy, 1);/* send a dummy byte */
          ChanIn(LINK1IN, &y, INTSIZE);
          if ((y!=SENTINEL) && (y!=PAD))
          {
               printf("%8d", y);
               NoPrimes++;
          }
     }

     /*--- Keep relaying primes if Node 2 isn't done yet ---*/
     while ((y!=SENTINEL))
     {
          ChanOut(LINK1OUT, &dummy, 1);
          ChanIn(LINK1IN, &y, INTSIZE);
          if ((y!=SENTINEL) && (y!=PAD))
          {
               printf("%8d", y);
               NoPrimes++;
          }
     }
     printf("\nReceived %d primes\n", NoPrimes);
}

/* -------------------------------------------------------------------- */
/* TASK 2                                                               */
/* Relay integers from Node 3 to Node 2.  Stops when the SENTINEL       */
/* is received.                                                         */
/* -------------------------------------------------------------------- */
void Task2(void)
{
     int Q[QUEUESIZE], x, index, Front = 0;
     int Rear = 0, Empty = 1, Done = 0;
     char dummy;              /* used to get "ready" signal from Node 1 */

     while (!Done)
     {
          /*--- find out which node is ready first (give priority to
          Node 3, as Node 1 will send many signals ---*/
          index = ProcAlt(LINK1IN, LINK0IN, 0);
          switch (index)
          {
               case  0 : /* LINK1IN from Node 3 */
                         ChanIn(LINK1IN, &x, INTSIZE);
                         Q[Front] = x;
                         Front = (Front+1)%QUEUESIZE;
                         Empty = 0;
                         break;
               case  1 : /* LINK0IN from Node 1 */
                         ChanIn(LINK0IN, &dummy, 1);
                         if (Empty)
                              x = PAD; /* tell Node 1 we have nothing */
                         else
                         {
                              /*--- dequeue an integer from Node 3 ---*/
                              x = Q[Rear];
                              Rear = (Rear+1)%QUEUESIZE;
                              if (Rear==Front)
                              {
                                   Rear = 0;
                                   Front = 0;
                                   Empty = 1;
                              }
                         }
                         /*--- send integer or pad ---*/
                         ChanOut(LINK0OUT, &x, INTSIZE);
                         Done = (x==SENTINEL);
                         break;
          }
     }
}

/* -------------------------------------------------------------------- */
/* TASK 3                                                               */
/* Runs on Node 3.  Generates all the primes between INTERVAL1 and      */
/* INTERVAL2 and sends them to Node 2.                                  */
/* -------------------------------------------------------------------- */
void Task3(void)
{
     int x, j;

     /*--- scan interval ---*/
     for (x = INTERVAL1; x<INTERVAL2; x++)
     {
          /*--- if prime then send number to Node 1 ---*/
          if (IsPrime(x))
               ChanOut(LINK0OUT, &x, INTSIZE);
     }

     /*--- signal Node 2 and Node 1 that we are done ---*/
     x = SENTINEL;
     ChanOut(LINK0OUT, &x, INTSIZE);
}


/* -------------------------------------------------------------------- */
/* ISPRIME                                                              */
/* Semi-efficient prime finding function which, given x, returns 1 if   */
/* it is prime, and 0 otherwise                                         */
/* -------------------------------------------------------------------- */
int IsPrime(int x)
{
     int i;                    /*  0 1 2 3 4 5 6 7 8 9 */
     static int SmallPrimes[10] = {0,0,1,1,0,1,0,1,0,0};
     if (x<10) return SmallPrimes[x];
     if (x%2==0) return 0;
     if (x%3==0) return 0;
     if (x%5==0) return 0;
     for (i = 2; i*i<=x; i++)
          if (x%i==0) return 0;
     return 1;
}

Listing 4-3: Listing of prime_v3.c

The output of the program is shown below. Note that it is slightly different from the output of the previous program running on two transputers, even though we didn't modify the code running on the two "outside" tasks.

     101       2     103       3     107     109       5     113     127
       7     131     137     139     149      11     151     157      13
     163     167     173     179      17     181     191      19     193
     197     199      23      29      31      37      41      43      47
      53      59      61      67      71      73      79      83      89
      97
Received 46 primes

This output is the same as the one generated by prime_v2.c, even though the buffer should be decoupling the two nodes. But the advantage of the buffering here is negated by the need to have Task1 send a dummy byte to Task2 every time it is ready. There are more efficient solutions, but they involve implementing concurrent tasks in charge of receiving primes and enqueueing them, and in charge of dequeueing them and sending them out. The material covered in the next chapter will help us do this.


EXERCISES


 
4-6

A note on granularity. If you analyze the program prime_v2.c (as well as prime_v3.c) with a critical eye for efficiency and performance, you may question whether we made the right choice in interrupting the prime-generating loops every time a piece of information was ready to be transferred. After all, why not have the tasks on the two extreme nodes generate the primes in an uninterrupted loop that stores them in an array, and have the communication and printing take place after the loops are done? This is an idea worth considering for several reasons. The first advantage is that it results in much simpler code for Task1 and Task2. The algorithms for Task1 and Task2 become:

void Task1(void)
{
generate primes less than 100 and store them in an array;
print primes found;
get primes from Node 2 in the form of an array;
prin t contents of array;
}

void Task2(void)
{
generate primes in [100,200] and store them in an array;
send the array to Task1;
}

The second advantage is that both prime-generating loops run faster since we have removed the communication from the loop. Also, transferring the primes as an array of integers rather than one integer at a time removes some of the overhead involved in establishing the communication, and therefore the total time spent in the communication is lower. But do these two improvements result in a global performance increase? Not necessarily. By exchanging huge chunks of information, we have serialized part of the computation that executed in parallel before. What we effectively accomplish is a modification of the granularity of the communication component of our algorithm. Transferring one integer at a time corresponds to an extremely fine granularity, while transferring all the integers encapsulated in an array results in a coarse granularity of communication. In the context of our example, where the two computing tasks find the primes in an interval of 100 integers, both granularity may yield comparable performance. But increasing the size of the intervals to, say, 10000, or 100000, may show a dramatic difference between the two. Most likely, a granularity level between the two extremes will yield the best performance. Understanding the effect of such an issue is paramount, and we will devote a full chapter, Chapter 8, later on to performance issues.
For right now, let's conclude by stating that although the coarser granularity approach would have yielded a simpler program, the approach presented here is more general. In practice, we'll find that the best performance is achieved when the whole data set is transferred in blocks, rather than as a whole, allowing increased parallelism between the tasks.

Implement the two-transputer program described above, where each node first computes the primes, stores them in an array, and then disposes of the array. Make sure the communication between the two nodes is efficient. For example, do not force Node 2 to send an array of 100 integers if it has only found 17 primes!

Time the execution of your program and compare it to that of prime_v2.c or prime_v3.c.


 

 
4-7
In a manner inspired by the ideas exposed in Exercise 4-6, modify prime_v2.c so that Task2 waits until it has two primes before it sends them to Task1. Time the execution of your program and compare it to the execution time of the original prime_v2.c program. (You may want to increase the size of the intervals tested by the two transputers.)

Same question for packets of 4 primes, 8 primes, 16 primes. Compare graphically the execution times of all 5 programs.

Hints: At the end of the computation, Task2 may not have exactly the right number of primes to fill a packet. Although it is less efficient a solution, you can still make Task2 send a whole packet at the end, by padding the empty part of the packet with appropriate integers.


 

 
4-8
Write a 3-transputer program where each transputer computes primes. Node 1 will be given the interval [1,99], Node 2 the interval [100,199], and Node 3 the interval [200,299]. You are free to implement the computation at the highest or coarsest level of granularity. Time the execution of your program against that of prime_v2.c or prime_v3.c when they are given the same range of integers to test. Try to make your code efficient so that you get a faster execution time for your program!


 

 
4-9
Write the code of a snooping node. A snooping node is a node that acts as a relay between two other nodes, relaying messages back and forth, without modifying them or altering them in any way, but that records the whole conversation in an internal buffer. This type of node is useful for debugging applications for which the code is not available. For example, one could use a snooping node to find out the protocol used by the root and the host machine during the activation of Input/Output functions by the root transputer. In this situation, one would make Node 1 the snooping node, and would port the code normally running on the root to Node 2. Node 2 would then issue printf and scanf statements to what it think is the host, but they would in fact be directed to the snooping node. This one would pass on the information to the host, while making a copy of the message. The host, responding to what it thinks is the root requesting an I/O operation, would send information back, which, you will have guessed, will be relayed and recorded by the snooping node.

The snooping node can be programmed to output its buffer after a predetermined amount of time has elapsed, or after some predetermined number of packets have been exchanged between Node 2 and the host.


 


4-5 Concluding Remarks

This chapter introduced us to the communication primitives of Logical Systems' parallel library. From now on, they will probably be the functions appearing most often in our programs. Although we used ChanIn and ChanOut exclusively, ChanInInt, ChanOutInt, ChanInChar and ChanOutChar could have been used just as easily, providing an even cleaner and leaner code (you may give as an exercise to yourself the task of modifying the three programs we implemented in this chapter by using the most compact form of ChanIn and Chanout).

Because ChanIn and ChanOut are blocking, once a task commits itself to either one of these function, a transfer must occur at some time, otherwise the task will never be able to resume its execution. In cases when several links might bring in information in a node, an alternation function can be used to find out the first link showing information ready to arrive. ProcAlt is an alternation function which monitors input channels. It blocks the task calls it, and returns only when incoming data have been detected. For cases where we simply need to monitor the status of input channels without block, ProcSkipAlt can be used.

The last example we covered in the chapter shows an important feature that we will often find in parallel programs. This feature is the required relaying of information from one transputer to another via their common neighbors. Relays will in fact be such common modules that they will be one of the first tasks we will create in most of our parallel programs. The example we saw here shows only a crude implementation of a relay. When we learn how to implement multitasking in the next two chapters, we will have the tools to implement simple yet sophisticated relay tasks.




[Previous] [HOME] [NEXT]