Procset should define the left and right nodes so that the most distant
is first.

For a mesh-connected machine, the children should be defined to minimize
potential contention on the mesh.  The general version used tends to
concentrate nodes (left and right children are adjacent; one could
argue the one child should receive the nbr message and send the result to
the parent, particularly in the case of reductions.  The disadvantage is
that the startup times are not overlaped.  An advantage is that each node
receives only ONE message, not two (could use forcetypes after a sync for
large amounts of data).  

I really should find some papers on this.

Also see mesh/compile about having a scheduling algorithm with a generic
interface.

Define a DistributedPointer structure:
typedef struct {
    int  p;         /* Processor */
    int  nbytes;    /* Size of object */
    void *obj;      /* pointer on owning Processor */
    /* In addition, we may want things like is_local (rather than
       testing on the processor number), some sort of id field,
       and routines to move data to/from a buffer (this allows the
       handling of data-structures that reference other items).
       */
    } DistPtr;
To implement this, a WRITE to the item sends the appropriate data to
the owning processor, with a header/trailer that contains (*obj,nbytes);
on receipt, the owning processor does a memcpy.  In a more general
case, the object pointer must contain a copy and free routine that
will handle copying the object in.
A READ is just the inverse, though a request is sent and a reply
waited for.  Naturally, there needs to be an Asynchronous READ as well.
Finally, a MOVE takes advantage of the copy routines to move it
somewhere else (processors need to be ready to accept it).

Difficulties:  Since the actual pointer is used REMOTELY, once that
pointer becomes invalid, we can have problems.  Some possibilities
include leaving the header on the owner, and have it forward the
response (as well as sending an updated location back to the requestor).
The object pointer could simply point into a list of once-owned
objects that could be flushed when safe (for example, by some sort of
reference counting).  This is not a completely general solution, but
it might provide a better way to implement some algorithms, such as the
FMM, since the objects there are well-defined and relatively large (at
least 160 bytes for a 20-term multipole expansion).
A DeferedWRITE could also be defined that attempted to pack several
messages into a single operation; we'd also need a FLUSH then.

