
          The Active Message Interface for Shared Memory Machines

                       Copyright (C) 1995 ICSI 

**************************************************************************
*                  THIS IS A PRELIMINARY BETA VERSION                    *
**************************************************************************

The Swiss National Science Foundation (SNSF) has supported the development
of this software. Their support is gratefully acknowledged.

This library implements the Active Message Interface between different
processes with shared memory, like the IPC interface offered by most Unix
versions. It has been tested on SunOS 4, Solaris 2.4 and Linux 1.1.59 and
1.2.4, but it will probably run on other systems as well.  If the thread
support option is used to compile the library, it will also include
functions to start threads on other processes, use locks and semaphores and
send signals to threads on other processes.

A description of the active message interface can be found on
HTTP://now.cs.berkeley.edu

Please send questions and bug reports to fleiner@icsi.berkeley.edu

-------------------------------------------------------------------------

TABLE OF CONTENTS

1. LICENSE AND COPYRIGHT
2. THREAD SUPPORT
3. LIBRARY API	
4. SPEED
5. IMPLEMENTATION DETAILS
6. EXAMPLE
7. INSTALLATION
8. RUNNING AN ACTIVE MESSAGE PROGRAM
9. BUGS AND LIMITATIONS

-------------------------------------------------------------------------

1. LICENSE AND COPYRIGHT

This package is covered by the LGPL contained in the Sather distribution
(file: Doc/LGPL) see there for details.

-------------------------------------------------------------------------

2. THREAD SUPPORT

The active message interface supports two different thread packages.

	LWP

	The first, written by Steven Crane, works on SunOS4 and Linux.
	It should also work on other architectures. The complete package
	has been included in this distribution.  Please note that the
	version you have here has been modified. You can get the original
	package at ftp://gummo.doc.ic.ac.uk:/rex/lwp.tar.gz, but this 
	package does not work with the active messages.

	These threads are non-preemptive, and should therefor work with
	any library. The drawback is that any system call which blocks
	the running thread, blocks all other threads too.

	If you have threads which frequently call yieldp(), you don't
	need to poll for messages, as they are received automatically.
	If your program does not switch from one thread to another often
	enough, you should not insert am_poll() to read messages, but
	rather yieldp() to switch more often.

	Please note that LWP has nothing to do with Suns Lightweight
	Processes.


	SOLARIS 2.4

	The am-library can also be used together with Solaris threads.
	In this case you never need to poll for messages, as they are
	received anyway. 


-------------------------------------------------------------------------

3. API	

This library simulates nodes of a parallel computer with processes on
one computer. Nevertheless the description uses the term node instead of
process.


3.1 Active Messages

As the generic interface for active messages is available on the net, only the 
peculiarities of this implementation are described here. To get the complete 
description, connect to HTTP://now.cs.berkeley.edu/active_messages.html

types:
handler_N_t
handler_df_t
	Type declaration for function pointers passed to am_reply_N,
	am_reply_df, ...
handler_mem_t
	Type declarations for function pointers used in am_get, am_store, ...


am_enable(int procs)
	This call creates procs nodes which can communicate through
	active messages. You may not use exec(2) to execute another
	program, as this would break the implementation.  procs must be
	smaller or equal to 32. If you want to change this number,
	you must have a computer where the maximum size of a shared
	memory segment must be larger than (1024 * procs^2) bytes.
	You need to change line 234 of am.c too.

am_disable()
	Must be called before a node terminates. If a node stops
	without calling am_disable(), all other nodes are stopped
	immediately. A node may only call am_disable if all threads
	on all nodes have finished their work.

am_poll()
	If you are not using a thread package, you must poll for messages
	regularly with am_poll().

am_max_size()
	A maximum of 436 bytes can be transfered per am_store() call.

am_wait_for(cond)
	This macro blocks the node until cond!=0. It will poll and
	read messages during its wait. Only the Solaris 2.4 Version with
	thread support will not use a busy-wait, but rather wait on a
	semaphore until a message arrives. (If thread support has been
	used, am_wait_for() blocks only the current thread).

am_sleep(int time)
	Puts a node (thread) to sleep for the specified amount of time, but
	continues to poll for messages. The headerfile am.h defines
	sleep(n) to be am_sleep(n).

am_dummy()
	am_store(), am_store_async() and am_get() need a handler each
	time they are used. NULL is not an acceptable handler, you may
	however use this dummy handler in cases where you don't need
	a handler at all.

AM_PROCS
	A macro implementation of am_procs()

AM_MY_PROC
	A macro implementation of am_my_proc()


3.2 Synchronization

The Active Message interface supports two synchronization primitives
which can be used to synchronize nodes. Some others, which are only
supported if threads are used, are described below in 3.3.2.

3.2.1 Wait Counters

Wait counters are similar to semaphores, but a node does not block
until the value is greater than zero, but until it is zero. They are
primarily used after a series of requests has been made to wait for the
replies. Before sending out the requests, the counter would be
incremented by one for each request, and the node then waits for the
counter to become zero again. Each reply is supposed to decrement the
counter.

Waitcounters are declared as
	WAITCNTR(cnt);
and are of type 
	cntr_t;

A counter can only be used as long as the function in which it was
declared does not return. It is possible to pass counters too other
threads and nodes by using am_request or am_reply, but you can only 
manipulate them on the node where they have been created.

All functions to manipulate counters exist as macros and as a function.
Macros are written with upper case letters, while functions with lower
case letters.

cntr_incr(cntr_t cnt)
CNTR_INCR(cnt)		increment counter

cntr_decr(cntr_t cnt)	
CNTR_DECR(cnt)		decrement counter

cntr_incr_by(cntr_t cnt,long v)
CNTR_INCR(cnt,v)	increment counter by v

cntr_decr_by(cntr_t cnt,long v)	
CNTR_DECR_BY(cnt,v)	decrement counter by v

cntr_wait_for_zero(cnt)
CNTR_WAIT_FOR_ZERO(cnt)	wait until cnt reaches zero. This function may
			NOT be used in a reply or request handler

cntr_is_zero(cnt)
CNTR_IS_ZERO(cnt)	return 0 if the counter has not yet reached zero.

As the decr and incr functions are really handy when used with replies,
requests and remote memory functions, the following functions exists too:

r_cntr_incr(vnn_t from,cntr_t cnt)
r_cntr_decr(vnn_t from,cntr_t cnt)
	These functions should be used in reply_1 or request_1. They
	increment/decrement counter cnt on the remote node. 
	Example:
		am_reply_1(remote_node,r_cntr_decr,wait_counter)

r_cntr_incr_mem(vnnt_t from,void*lva,int size,cntr_t cnt)
r_cntr_decr_mem(vnnt_t from,void*lva,int size,cntr_t cnt)
	These two functions are meant for am_store, am_store_async and
	and am_get. They work exactly as r_cntr_incr/decr, but
	have a different prototype. Note that when used with
	am_store and am_store_async they increment the counter
	on a remote node, while when used with am_get or as end_handler
	in am_store_async the increment the counter on the local
	(calling) node.
	Example: decrement counter cnt when the memory as arrived:
		am_get(remote,rva,size,lva,r_cntr_decr_mem,cnt);

r_cntr_incr_reply_mem(vnnt_t from,void*lva,int size,cntr_t cnt)
r_cntr_decr_reply_mem(vnnt_t from,void*lva,int size,cntr_t cnt)
	These two functions are meant to be used with am_store
	or am_store_async. They reply to the calling node
	with r_cntr_incr/decr, and hence increment/decrement
	the counter on the local node.


3.2.2 Throw Away Semaphores

These semaphores are similar to standard semaphores, but you can
signal and wait for them just once.

To declare a TA-Semaphore, use
	TA_SEMAPHORE(tas)
The scope and usage limitations are the same as for wait counters.
The type of tas is
	ta_sema_t

ta_sema_signal(ta_sema_t sem)
TA_SEMA_SIGNAL(ta_sema_t sem)	signal sem

ta_sema_wait(ta_sema_t sem)
TA_SEMA_WAIT(ta_sema_t sem)	wait on sem. This routine cannot be used
				in any reply or request handler

r_ta_sema_signal(vnn_t from,long cnt)
r_ta_sema_signal_mem(vnn_t from,void *a,int size,void *cnt)
r_ta_sema_signal_reply_mem(vnn_t from,void *a,int size,void *cnt)
	These routines work exactly as the r_cntr_* routines described above.


3.3 Threads

This part of the interface is only available if the am library has
been compiled with thread support.

3.3.1 Working with Threads

int thr_create_N(vnn_t where,handler_0_t start_routine,long arg0,... argN-1)
int thr_create_df(vnn_t where,handler_0_t start_routine,long arg0,long arg1,double arg2)
	0 <= N <= 4
	These calls creates a new thread on node "where". The thread starts 
	by executing start_routine() with the given parameters. The thread
	ends when it returns from start_routine()

thread_t thr_id()
THR_ID
	Returns the ID of the thread. thread_t is not necessarly a long
	or a pointer, but may be a struct as well. It is therefor not
	possible to pass an id as an argument in any am_request(), or
	am_reply() call. If you need to pass an id from one thread
	to the next, you have to do it with either an am_get() or am_store().

int thr_same_id(thread_t id1,thread_t id2)
THR_SAME_ID(id1,id2)
	returns 1 if both id's are equal, 0 otherwise

char *thr_print_id(thread_t id,char *p)
	dumps a humane readable form for the thread id into the character
	array pointed to by p. The caller must ensure that there is enough
	free place in p. The format of the string is implementation
	dependent.

thread_t thr_no_thread()
THR_NO_THREAD
	returns a value of type thread_t which can never be a thread id.


3.3.2 Synchronization

The two synchronization primitives described here, are primarily to
synchronize threads on a per node basis. Locks and Semaphores are local 
to a node and it is therefor not possible to create a lock on one
node and have a thread on another node lock it.
Unlike wait counters and throw away semaphores locks and semaphores are
not created on the stack, but rather on the heap. This means that you
have to delete locks and semaphores to reclaim the memory they use.

All operations on locks and semaphores come in two versions, one as
macro (in upper case letters) and one as function (written in 
lower case letters).


3.3.2.1 Locks

typedef void *lock_t;
lock_t lck_create()
LCK_CREATE()			creates a new "unlocked" lock

void lck_lock(lock_t lck)
LCK_LOCK(lck)			locks lck. The use of this function
				in reply and request handlers is very
				restrictive, see below.

void lck_unlock(lock_t lck)
LCK_UNLOCK(lck)			Unlock lck.

int lck_try(lock_t lck)
LCK_TRY(lck)			Tries to lock lck, returns 1 on success, 0
				otherwise

void lck_delete(lock_t lck)
LCK_DELETE(lck)			Deletes lock lck.
				
r_lck_unlock(...)
r_lck_unlock_mem(...)
r_lck_unlock_reply_mem(...)	These three functions work the same way as 
				r_cntr_*.


3.3.2.2 Semaphores

typedef void *sema_t;
sema_t sema_create(unsigned int c)
SEMA_CREATE(c)			Creates a new semaphore, who's initial count
				is c.

void sema_signal(sema_t sem)	
SEMA_SIGNAL(sem)		Signals (increments the counter) of semaphore
				sem

void sema_wait(sema_t sem)
SEMA_WAIT(sem)			Blocks the thread until the counter of sem
				is larger than zero. The use of this function
				in reply and request handlers is very
				restrictive, see below. Returns 0 if the 
				sempaphore has been decremented, -1 
				if the thread has been interrupted.

void sema_delete(sema_t sem)
SEMA_DELETE(sem)		Deletes semaphore sem.
				
r_sema_signal(...)
r_sema_signal_mem(...)
r_sema_signal_reply_mem(...)	These three functions work the same way as 
				r_cntr_*.


3.3.2.3 Using lck_lock and sema_wait in reply and request handlers

You can use either of these functions in a reply or request handler if
you make sure that no thread ever calls am_request or am_reply while it
hold either a semaphore or a lock which could be locked in a request 
or reply handler.


3.3.3 Local Memory

If you need global variables which are local to the thread, you can
store one pointer and retrieve it later. This pointer is private to
the thread that stored it, and no other thread can retrieve it
with these calls. (It is of course possible to send the pointer 
to another thread on the same system).

void thr_set_local(void *ptr) 	
THR_SET_LOCAL(ptr)		set the local memory of a thread

void *thr_local()
THR_LOCAL			return the local memory of a thread.


3.3.4 Signals

Each thread can send a signal to any other thread on the system, if 
it knows its ID. A signal has only an effect if
- the signal is not ignored and
- either the signal is blocked (in this case its arrival is stored)
- or the thread defined a signal handler, in which case the signal 
  handler will be called.

The following functions exists:
void thr_signal(thread_t tid)
	send thread tid a signal. Note that there is just one
	signal. If the thread already got a signal and did neither reset it
	nor has the signal handling function returned, the signal is
	ignored.

void thr_set_signal_handler(signal_handler_t signal_handler)
	Installs a signal handler for the current thread.

void thr_ignore_signal()
	All incoming signals are ignored

void thr_block_signal();
	Blocks incoming signals. However, the arrival of a signal is stored
	(the signal is said to be pending) and the signal is delivered as
	soon as signals are reenabled.  thr_got_signal() will return 1 in
	this case. At most one signal can be pending.

void thr_enable_signal()
	Enables signals. If a signal was pending, it
	is immediately delivered.

void thr_reset_signal()
	If a signal is pending, it is removed.
	
int thr_got_signal()
	Returns 1 if a signal is pending, 0 otherwise.
	

3.3.5 Delays

The library offers two functions to send a signal at a later timer
or have a function executed at a later time. Such a function will
not necessarly be executes by the thread that asked for its 
execution.

typedef struct { long sec,long nsec} delay_t;
thr_delay_signal(delay_t n)
	Sends a signal to the current thread after a delay of at least
	n.sec seconds and n.nsec nano seconds.
thr_delay_function(delay_t n,(handler_delay_t)function,void *arg)
	The function func is called after at least n.sec seconds and n.nsec 
	nano seconds, with arg as argument.

-------------------------------------------------------------------------

4. SPEED

On a multiprocessor computer like a Sparc 20, roundtrip time can be as low
as 6usec, but if the number of nodes used is greater than the number
of processors available, it can be as bad as 6ms. This is due to the 
time lost when switching from one node to the next.

-------------------------------------------------------------------------

5. IMPLEMENTATION DETAILS

MESSAGE EXCHANGE

The library uses a large (1MB) shared memory segment to exchange
messages. This memory is organized like a matrix. Each process owns one
row and one column. It reads messages from its row, and writes messages
to other processes in its column. This organization ensures that exactly
one process writes in and one process reads from each cell in the matrix:

Example:    1   2   3   4   5
          +---+---+---+---+---+
        1 |   |   |   | # |   | 
          +---+---+---+---+---+
        2 |   |   |   |   |   | 
          +---+---+---+---+---+
        3 |   |   |   |   |   | 
          +---+---+---+---+---+
        4 | @ |   |   |   |   | 
          +---+---+---+---+---+
        5 |   |   |   |   |   | 
          +---+---+---+---+---+

If processor 1 sends a message to processor 4, it uses cell @, while
the reply will be written to #.

Each cell has 30 slots for messages and two bit fields named r and w.
each bit in r and w represents one message slot.  A message slot is said
to be free if the bits in r and w are equal.

So ~(r^w) returns a bitfield with all free slots, while (r^w) returns a
bit field with all filled slots. Whenever a process has red a message, it
changes the corresponding bit in r, while a process which writes a
message changes the corresponding bit in w. As there is no memory
position where two processes are writing, we don't need to synchronize
the processes with locks. On the other hand we need quite a lot of shared
memory, namely 1MB for 32 processes.  Note that the above algorithm works
only if reads from a shared memory position are guaranteed to return
either an old value or a new value on a per bit basis (it still works if
some of the bits represent an old value, while some represent a new
value, but not if some bits return a value which has not been written to
them in the last write).


PROCESSES

Message handler are sent from one process to the next as function
pointers. This implies that each process executes EXACTLY THE SAME binary
program image. This policy is enforced by the am_enable() call, which
forks as many processes as requested. If you try to exec another image,
the behavior is undefined and will most probably result in a crash. If
you really need to run different programs to run together, you must
rename their main() procedure and create a new main() as follows:

main()
{	
	am_enable(2);
	switch(am_my_proc()) {
	case 0: main_0();am_disable();break;
	case 1: main_1();am_disable();break;
	}
}

It the two programs cannot be linked because of name conflicts, you
are out of luck.

-------------------------------------------------------------------------

6. EXAMPLE

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/time.h>
#include "am.h"

long reply_counter=0;

/*
 * count how many replies have been received
 * If we would use a preemptive thread package like SOLARIS threads,
 * we would need a lock around the reply_counter++ statement.
 */
void reply(int from,long arg)
{
	reply_counter++;
}

/*
 * receive a message and reply immediately
 */
void got_msg(int from,long arg)
{
	am_reply_1(from,reply,arg);
}

void bye(int from)
{
	am_disable();
	exit(0);
}

int main()
{
	int i,j;
	long msg;

	am_enable(4);

	switch(am_my_proc()) {
	case 0:
		printf("process 0 sends now a requests to all other processes\n",msg);
		for(i=1;i<am_procs();i++) am_request_1(i,got_msg,i);

		/* wait for all replies */
		am_wait_for(reply_counter==(am_procs()-1));

		/* stop all processes */
		for(i=1;i<am_procs();i++) am_request_0(i,bye);
		am_disable();
		break;

	default:
		/* all other processes just sleep and wait for messages */
		while(1) sleep(10000);
	}
	return 0;
}

-------------------------------------------------------------------------

7. INSTALLATION

7.1 The Shared Memory Version

To compile and use this library, you need
- an ANSI compliant C compiler, like gcc
- a computer where shared memory segments can be as big as 1MB

First edit the Makefile, and select the package you would like
and define the place where the header and the library should go.
(Sorry, there is no configure script)

type    
	make

WARNING: during the make a test program is started which starts 32 nodes!
         This means that you have 32 running processes, which may bring
	 a computer with not enough memory/CPU to slow down considerably. 
	 If you prefer not to run such a heavy test, change the line 
	 	./mesg 32 100
	 to
	 	./mesg 4 10
	 in the Makefile (100 specifies the number of messages sent around 
	 during various tests).

If all went well, type
	make install

You can install different versions of the library if you use different
library names (note that all versions may use the same header file).

Warning: if you use the lwp-package, it will be compiled, BUT NOT 
         INSTALLED. You have to install it manually! If you already have
	 the lwp package installed, you must either replace it or
	 install this one under a different name. Note that all programs
	 that run with the 0.1.2 version of the lwp package should
	 also work with the new version. The new version fixes some bugs,
	 and adds locks and local memory to the package. These changes
	 are currently not documented, but this should change in the
	 future.


7.1 The TCP/IP Version

To compile and use this library, you need
- an ANSI compliant C compiler, like gcc
- a network of computers connected by TCP/IP

First edit the Makefile, and select the package you would like
and define the place where the header and the library should go.
(Sorry, there is no configure script)

type    
	make

WARNING: You will have to execute the test program yourself,
         as there is no support from the library. See below
	 on how to start a program on a network of computers.
	 The test program is named 'mesg' and should be started
	 with two arguments on each computer. The first argument
	 is the number of nodes while the second defines how
	 many testmessages are sent between the different nodes.

If all went well, type
	make install

You can install different versions of the library if you use different
library names (note that all versions may use the same header file).


-------------------------------------------------------------------------

8. RUNNING AN ACTIVE MESSAGE PROGRAM

8.1 The Shared Memory Version

Just start your program as you would any normal program. It will
start up as many nodes as you specify in the am_enable() call.
If onw node crashess due to some errors, all others are immediatly
stopped.

8.2 The TCP/IP Version

Here a copy of the original version of the README file for the 
TCP/IP version of the active messages:

------------------------BEGIN README FILE -----------------------------------
The library in bin-tcp is compiled for Solaris. The code can be easily ported
to other platform like HPUX, AIX and Linux. You may need to make some
minor modifications to the header files and #define's.

To run a program written in TCPAM without GLUnix:
1. You need to add this to your .cshrc:

   setenv TCPAM_CONFIG <full path name of configuration file>

2. configuration file format:

<number of nodes>
<server port number> /* pick some number > 5000 */
<fast network hostname 1> <slow network hostname 1>
<fast network hostname 2> <slow network hostname 2>
 .
 .
 .

Messages will go through the fast network. The slow network is for
spawning the program. If you have only one network, e.g. Ethernet, use the
same name for the fast and slow network.

3. You need to link your program with the BSD libraries.  A sample
   program for Solaris can be found in ~/example.

4. To start a job, you can use the prun script:
     prun <number of nodes> <program> <arguments> 
   This script will read configuartion file and start the program with rsh.
   You need to add the hostnames in your .rhosts file for rsh to work.
------------------------END README FILE -----------------------------------


-------------------------------------------------------------------------

9. BUGS AND LIMITATIONS

- no man pages and nearly no documentation

