The top-down approach here will allow us to modularize our program nicely. We are still using the farm programming approach, and all three transputers receive the same copy of the program. Different sections of the program are dedicated to different transputers.
main()
{
switch (_node_number)
{
case 1: Task1();
exit(0);
case 2: Task2() ;
break;
case 3: Task3();
}
}
We start our analysis of the program with the farthest node, Node 3, which executes Task 3.
void Task3(void)
{
int x, j;
/*--- scan interval ---*/
for (x = INTERVAL1; x<INTERVAL2; x++)
{
/*--- if prime then send number to Node 1 ---*/
if (IsPrime(x))
ChanOut(LINK0OUT, &x, INTSIZE);
}
/*--- signal Node 2 and Node 1 that we are done ---*/
x = SENTINEL;
ChanOut(LINK0OUT, &x, INTSIZE);
}
Task3 does not store the primes internally but sends them as soon as they are found. It thus blocks every times it has found a new prime.
Task3 sends its prime to Task2 running on Node 2 over LINK0OUT. Task2 receives the numbers on its LINK1IN channel.
/*--- hasty implementation of Task2's inner loop ---*/This implementation of Task2 would work, but it would be inefficient. The reason is that Node 3 could (and does) generate primes faster than Node 1 can accept them, and because Node 2 uses a loop containing two blocking functions for receiving and sending information between the two remote tasks, it runs only as fast as the slowest of its two counterparts, which in our case is Task1. So this type of relay is inefficient in that it does not decouple the input from the output. The slow task (Task1) slows down the faster one (Task3) through the relaying implemented by Task2.
do
{
ChanIn(LINK1IN, (char *) &x, INTSIZE);
ChanOut(LINK0OUT, (char *) &x, INTSIZE);
} while (x != SENTINEL);
while (not done)
{
wait for either Node 1 or Node 3 to request a transfer;
if (Node 1 is ready)
{
Dequeue a prime;
Send it to Node 1;
If the integer sent was the sentinel we're done;
}
if (Node 3 is ready)
{
Get a prime;
Enqueue it;
}
}
The
outer loop repeats until the SENTINEL is received from Node 3 and passed on to
Node 1. Task2 can either send a prime to Task1, or receive a prime from Task3.
It must thus monitor both, so that it can respond to the first one requesting a
transfer. This is the function of ProcAlt which is given a list of channels to
monitor. It would be tempting to have Task2 do this:
ProcAlt(LINK0OUT, LINK1IN, NULL);
But this won't work. Can you see why?
The reason is that ProcAlt can only monitor input channels[1]. Therefore, we must use a trick here, and make Task1 send Task2 a short message whenever it is ready to receive a prime from Task2. Because we do not want Task1 to have to wait on Task2 if its queue is empty, Task2 will send "pad" integers (fake primes) back if it cannot send a real prime. This way Task1 will always be guarantied to receive something within a short period of time, so that it is not slowed down too much.
void Task2(void)
{
QueueType Q
int x;
int Done = 0;
char dummy; /* used to get "ready" signal from Node 1 */
while (!Done)
{
/*--- find out which node is ready first (give priority to
Node 3, as Node 1 will send many signals ---*/
index = ProcAlt(LINK1IN, LINK0IN, 0);
switch (index)
{
case 0 : /* LINK1IN from Node 3 */
ChanIn(LINK1IN, &x, INTSIZE);
Enqueue(&Q,x);
break;
case 1 : /* LINK0IN from Node 1 */
ChanIn(LINK0IN, &dummy, 1);
if (Empty(&Q))
x = PAD; /* tell Node 1 we have nothing */
else
x = Dequeue(&Q);
/*--- send integer or pad ---*/
ChanOut(LINK0OUT, &x, INTSIZE);
Done = (x==SENTINEL);
break;
}
}
}
QueueType, Enqueue(), Dequeue(), and Empty() define a queue data-type and some of its standard operators. The program prime_v3.c on the distribution disk contains an implementation with a circular buffer implemented with an integer array. An alternate implementation with linked lists could have been possible [SEDG83].
void Task1(void)
{
int x, j, y = 0, NoPrimes = 0;
char dummy; /* used to tell Node 2 "ready to receive" */
/*--- Compute, print, and relay primes ---*/
for (x = 1; x<INTERVAL1; x++)
{
/*--- if x is prime then print it ---*/
if (IsPrime(x))
{
printf("%8d", x);
NoPrimes++;
}
/*--- if Node 2 already done, continue ---*/
if (y==SENTINEL) continue;
/*--- otherwise tell it we're ready to receive ---*/
ChanOut(LINK1OUT, &dummy, 1);/* send a dummy byte */
ChanIn(LINK1IN, &y, INTSIZE);
if ((y!=SENTINEL) && (y!=PAD))
{
printf("%8d", y);
NoPrimes++;
}
}
/*--- Keep relaying primes if Node 2 isn't done yet ---*/
while ((y!=SENTINEL))
{
ChanOut(LINK1OUT, &dummy, 1);
ChanIn(LINK1IN, &y, INTSIZE);
if ((y!=SENTINEL) && (y!=PAD))
{
printf("%8d", y);
NoPrimes++;
}
}
printf("\nReceived %d primes\n", NoPrimes);
}
Notice that every time Node 1 is ready to receive a prime from Node 2, it sends the char variable dummy to indicate its readiness. Because Node 2 may not have a prime in its queue, it may return a pad integer, and Task1 must test for this condition, and discard them if necessary.
The second point of interest is that in the final while loop, Task1 must still use the sending of dummy since Node 2 can only send a prime when it receives this signal from Node 1. Because Task1 in that case would start sending a lot of dummy bytes, Task2 uses LINK1IN first in its list of arguments in ProcAlt, so that it will choose to receive a prime from Node 3 before receiving a dummy signal from Node 1.
The full code of prime_v3.c is now shown below.
/* =======================================================================
prime_v3.c
DESCRIPTION:
Compute primes on 3 nodes.
Node 3 computes the primes and sends them to Node 2.
Node 2 enqueues the primes received from Node 2 in a queue and releases
them to Node 1 when ready.
Node 1 (Root) computes primes, prints them, and prints any numbers
received from Node 2.
TO COMPILE AND RUN:
make -f prime_v3
chainnif -# 3 -1 prime_v3 -nif prime_v3.nif
ld-net prime_v3
ASSOCIATED NIF-FILE
buffer_size 200;
host_server CIO.EXE;
level_timeout 400;
decode_timeout 2000;
1, test, R0, 0 , 2[0] , , ;
2, test, R1, 1[1] , 3[0] , , ;
3, test, R2, 2[1] , , , ;
====================================================================== */
#include <stdio.h>
#include <stdlib.h>
#include <conc.h> /* transputer library */
/* =========================== DEFINITIONS ============================ */
#define INTSIZE ((int) sizeof(int))
#define INTERVAL1 1000 /* bound for primes for Node 1 */
#define INTERVAL2 2000 /* bound for primes for Node 3 */
#define QUEUESIZE 1000 /* maximum queue size for Node 2 */
#define SENTINEL -1 /* sent by Node 3 to indicate done */
#define PAD -2 /* sent by Node 2 when queue empty */
/* =========================== PROTOTYPES ============================= */
void Task1(void);
void Task2(void);
void Task3(void);
int IsPrime(int x);
/* -------------------------------------------------------------------- */
/* MAIN */
/* -------------------------------------------------------------------- */
main(void)
{
switch (_node_number)
{
case 1: Task1();
exit(0);
break;
case 2: Task2();
break;
case 3: Task3();
}
}
/* -------------------------------------------------------------------- */
/* TASK1 */
/* Runs on Root transputer. Computes primes between 1 and */
/* INTERVAL1 and prints them. Prints primes received from Node 2 */
/* -------------------------------------------------------------------- */
void Task1(void)
{
int x, j, y = 0, NoPrimes = 0;
char dummy; /* used to tell Node 2 "ready to receive" */
/*--- Compute, print, and relay primes ---*/
for (x = 1; x<INTERVAL1; x++)
{
/*--- if x is prime then print it ---*/
if (IsPrime(x))
{
printf("%8d", x);
NoPrimes++;
}
/*--- if Node 2 already done, continue ---*/
if (y==SENTINEL) continue;
/*--- otherwise tell it we're ready to receive ---*/
ChanOut(LINK1OUT, &dummy, 1);/* send a dummy byte */
ChanIn(LINK1IN, &y, INTSIZE);
if ((y!=SENTINEL) && (y!=PAD))
{
printf("%8d", y);
NoPrimes++;
}
}
/*--- Keep relaying primes if Node 2 isn't done yet ---*/
while ((y!=SENTINEL))
{
ChanOut(LINK1OUT, &dummy, 1);
ChanIn(LINK1IN, &y, INTSIZE);
if ((y!=SENTINEL) && (y!=PAD))
{
printf("%8d", y);
NoPrimes++;
}
}
printf("\nReceived %d primes\n", NoPrimes);
}
/* -------------------------------------------------------------------- */
/* TASK 2 */
/* Relay integers from Node 3 to Node 2. Stops when the SENTINEL */
/* is received. */
/* -------------------------------------------------------------------- */
void Task2(void)
{
int Q[QUEUESIZE], x, index, Front = 0;
int Rear = 0, Empty = 1, Done = 0;
char dummy; /* used to get "ready" signal from Node 1 */
while (!Done)
{
/*--- find out which node is ready first (give priority to
Node 3, as Node 1 will send many signals ---*/
index = ProcAlt(LINK1IN, LINK0IN, 0);
switch (index)
{
case 0 : /* LINK1IN from Node 3 */
ChanIn(LINK1IN, &x, INTSIZE);
Q[Front] = x;
Front = (Front+1)%QUEUESIZE;
Empty = 0;
break;
case 1 : /* LINK0IN from Node 1 */
ChanIn(LINK0IN, &dummy, 1);
if (Empty)
x = PAD; /* tell Node 1 we have nothing */
else
{
/*--- dequeue an integer from Node 3 ---*/
x = Q[Rear];
Rear = (Rear+1)%QUEUESIZE;
if (Rear==Front)
{
Rear = 0;
Front = 0;
Empty = 1;
}
}
/*--- send integer or pad ---*/
ChanOut(LINK0OUT, &x, INTSIZE);
Done = (x==SENTINEL);
break;
}
}
}
/* -------------------------------------------------------------------- */
/* TASK 3 */
/* Runs on Node 3. Generates all the primes between INTERVAL1 and */
/* INTERVAL2 and sends them to Node 2. */
/* -------------------------------------------------------------------- */
void Task3(void)
{
int x, j;
/*--- scan interval ---*/
for (x = INTERVAL1; x<INTERVAL2; x++)
{
/*--- if prime then send number to Node 1 ---*/
if (IsPrime(x))
ChanOut(LINK0OUT, &x, INTSIZE);
}
/*--- signal Node 2 and Node 1 that we are done ---*/
x = SENTINEL;
ChanOut(LINK0OUT, &x, INTSIZE);
}
/* -------------------------------------------------------------------- */
/* ISPRIME */
/* Semi-efficient prime finding function which, given x, returns 1 if */
/* it is prime, and 0 otherwise */
/* -------------------------------------------------------------------- */
int IsPrime(int x)
{
int i; /* 0 1 2 3 4 5 6 7 8 9 */
static int SmallPrimes[10] = {0,0,1,1,0,1,0,1,0,0};
if (x<10) return SmallPrimes[x];
if (x%2==0) return 0;
if (x%3==0) return 0;
if (x%5==0) return 0;
for (i = 2; i*i<=x; i++)
if (x%i==0) return 0;
return 1;
}
The output of the program is shown below. Note that it is slightly different from the output of the previous program running on two transputers, even though we didn't modify the code running on the two "outside" tasks.
101 2 103 3 107 109 5 113 127
7 131 137 139 149 11 151 157 13
163 167 173 179 17 181 191 19 193
197 199 23 29 31 37 41 43 47
53 59 61 67 71 73 79 83 89
97
Received 46 primes
This output is the same as the one generated by prime_v2.c, even though the buffer should be decoupling the two nodes. But the advantage of the buffering here is negated by the need to have Task1 send a dummy byte to Task2 every time it is ready. There are more efficient solutions, but they involve implementing concurrent tasks in charge of receiving primes and enqueueing them, and in charge of dequeueing them and sending them out. The material covered in the next chapter will help us do this.
| 4-6 |
A note on granularity. If you analyze the program prime_v2.c (as well as prime_v3.c) with a critical eye for efficiency and performance, you may question whether we made the right choice in interrupting the prime-generating loops every time a piece of information was ready to be transferred. After all, why not have the tasks on the two extreme nodes generate the primes in an uninterrupted loop that stores them in an array, and have the communication and printing take place after the loops are done? This is an idea worth considering for several reasons. The first advantage is that it results in much simpler code for Task1 and Task2. The algorithms for Task1 and Task2 become:
void Task1(void)
The second advantage is that both prime-generating loops run faster since we
have removed the communication from the loop. Also, transferring the primes as
an array of integers rather than one integer at a time removes some of the
overhead involved in establishing the communication, and therefore the total
time spent in the communication is lower. But do these two improvements result
in a global performance increase? Not necessarily. By exchanging huge chunks
of information, we have serialized part of the computation that executed in
parallel before. What we effectively accomplish is a modification of the
granularity of the communication component of our algorithm. Transferring one
integer at a time corresponds to an extremely fine granularity, while
transferring all the integers encapsulated in an array results in a coarse
granularity of communication. In the context of our example, where the two
computing tasks find the primes in an interval of 100 integers, both
granularity may yield comparable performance. But increasing the size of the
intervals to, say, 10000, or 100000, may show a dramatic difference between the
two. Most likely, a granularity level between the two extremes will yield the
best performance. Understanding the effect of such an issue is paramount, and
we will devote a full chapter, Chapter 8, later on to performance issues.
Implement the two-transputer program described above, where each node first computes the primes, stores them in an array, and then disposes of the array. Make sure the communication between the two nodes is efficient. For example, do not force Node 2 to send an array of 100 integers if it has only found 17 primes! Time the execution of your program and compare it to that of prime_v2.c or prime_v3.c.
|
| 4-7 |
In a manner inspired by the ideas exposed in Exercise 4-6, modify
prime_v2.c so that Task2 waits until it has two primes before it sends them to
Task1. Time the execution of your program and compare it to the execution time
of the original prime_v2.c program. (You may want to increase the size of the
intervals tested by the two transputers.) Same question for packets of 4 primes, 8 primes, 16 primes. Compare graphically the execution times of all 5 programs. Hints: At the end of the computation, Task2 may not have exactly the right number of primes to fill a packet. Although it is less efficient a solution, you can still make Task2 send a whole packet at the end, by padding the empty part of the packet with appropriate integers.
|
| 4-8 |
Write a 3-transputer program where each transputer computes primes.
Node 1 will be given the interval [1,99], Node 2 the interval [100,199], and
Node 3 the interval [200,299]. You are free to implement the computation at
the highest or coarsest level of granularity. Time the execution of your
program against that of prime_v2.c or prime_v3.c when they are given the same
range of integers to test. Try to make your code efficient so that you get a
faster execution time for your program!
|
| 4-9 |
Write the code of a snooping node. A snooping node is a node
that acts as a relay between two other nodes, relaying messages back and forth,
without modifying them or altering them in any way, but that records the whole
conversation in an internal buffer. This type of node is useful for debugging
applications for which the code is not available. For example, one could use a
snooping node to find out the protocol used by the root and the host machine
during the activation of Input/Output functions by the root transputer. In
this situation, one would make Node 1 the snooping node, and would port the
code normally running on the root to Node 2. Node 2 would then issue printf
and scanf statements to what it think is the host, but they would in fact be
directed to the snooping node. This one would pass on the information to the
host, while making a copy of the message. The host, responding to what it
thinks is the root requesting an I/O operation, would send information back,
which, you will have guessed, will be relayed and recorded by the snooping
node. The snooping node can be programmed to output its buffer after a predetermined amount of time has elapsed, or after some predetermined number of packets have been exchanged between Node 2 and the host.
|
Because ChanIn and ChanOut are blocking, once a task commits itself to either one of these function, a transfer must occur at some time, otherwise the task will never be able to resume its execution. In cases when several links might bring in information in a node, an alternation function can be used to find out the first link showing information ready to arrive. ProcAlt is an alternation function which monitors input channels. It blocks the task calls it, and returns only when incoming data have been detected. For cases where we simply need to monitor the status of input channels without block, ProcSkipAlt can be used.
The last example we covered in the chapter shows an important feature that we will often find in parallel programs. This feature is the required relaying of information from one transputer to another via their common neighbors. Relays will in fact be such common modules that they will be one of the first tasks we will create in most of our parallel programs. The example we saw here shows only a crude implementation of a relay. When we learn how to implement multitasking in the next two chapters, we will have the tools to implement simple yet sophisticated relay tasks.