N-2-2-030.50.2

Parallel Processors and Gigabit Networks
by Craig Partridge
<craig@aland.bbn.com>

    Computers that use multiple processors are becoming an increasingly
important part of the computing milieu.  Most supercomputers today use
multiple processors in parallel, and the technology is rapidly moving into
workstations and PCs.  As a result, it seems inevitable that attaching
parallel processors to gigabit networks is a a problem we will need to
solve.  This quarter's column takes a look at some of what is currently
known about parallel processing of network protocols.

    It is important to recognize that parallel processing of network
protocols is a hard problem.  An example illustrates the point.  A major
supercomputing company took some care about a year ago to develop what they
thought would be a high quality TCP/IP implementation for their multiprocessor.
But when they tested its performance they discovered that it could handle only
a few thousand TCP/IP datagrams per second! (That's about what a good PC can
handle).  There are a lot of potential bottlenecks in parallel processing
of protocols, and this implementation had hit a couple of them.  In the
next few paragraphs we'll examine some of the major problems and their
solutions (if known).

    The most important major bottleneck can be described as the
serial-to-parallel problem.  In brief, the problem that there's a single
network interface into the parallel computer and that something in the system
has to coordinate moving data from a single interface to multiple processors
(or from multiple processors to a single interface).  The coordinating
something is typically a single processor, which inevitably means that
the multiprocessor system's packet processing rate is bounded by the
maximum rate that a single one of its processors can move packets on
and off the network.  Complicating this problem is the fact that on many
parallel processing machines, a single application may be running in parallel
on several processors.  So a single TCP segment may contain data destined
for several different processors - in some implementations, the poor
processor managing the interface often has to parse the TCP segment
to figure out which processors get which pieces of the data.  There
are at least two ways people are trying to handle this problem.

    One approach is to put multiple network interfaces into the parallel
processor.  Data for a particular application, or piece of an application, is
sent to the network interface closest to the processors on which that
application is running.  By properly balancing networked applications across
the processors and interfaces, there's a good chance of achieving high
performance.  Experiments using this approach (as part of a broader study
of networking with parallel processors) are currently underway using
ATM network interfaces and Thinking Machines computers at the US Naval
Research Laboratory.

    Another approach is to try to minimize the work that the processor
managing the interface must do.  A popular idea is to simply the processor
move packets on and off the network, and move all the protocol processing to
a group of processors inside the system.  (Note that this approach is
complimentary to the multiple interface approach, and so they can both
be used).  The major puzzle in this approach is figuring out how best
to process packets in parallel within a group of processors.

    There are at least four known methods for parallel packet processing.
The first method is to assign a processor to each logical connection, which
handles all packets for that connection.  The major disadvantage of this
method is that even a connection that needs high throughput is limited
by the capacity of a single processor.

    The second method is assign a processor to each network protocol (e.g.,
one processor for IP, one for TCP, one for UDP).  Packets move from 
processor to processor for protocol processing.  Some of the problems with
this approach are that throughput is limited by the slowest protocol in
pipeline, and hopping from one processor to another is expensive in some
multiprocessors.

    The third method is to assign a processor to individual protocol
subfunctions.  So one processor might compute the TCP checksum, while another
does TCP sequencing and a third does round-trip time estimation.  The major
problem with this approach is that the cost of splitting pieces of the packet
(and related state like TCP protocol control blocks) across multiple processors
is typically far expensive than the cost of processing the TCP segment itself
(which is only about 100 instructions).

    The fourth (and to my mind, most promising) method is to assign a different
processor to each packet that comes in, and have the processor do all the
protocol processing on that packet.  This method largely addresses the
problems in the other three (a single connection can use multiple processors,
slow protocols don't affect other packets as much, and we don't have to split
packets across processors).  The main performance barrier is the occasional
need to share state.  For example, if two TCP segments for the same connection
are being handled on two different processors, the processors need to
coordinate their access to the TCP protocol control block to avoid corrupting
the TCP state information.  Work on this problem is being done at the
Swedish Institute of Computer Science.

    Finally, one of the key problems in multiprocessors is that most
multiprocessors are built using regular processor chips like those in
workstations and PCs.  And as earlier columns have explained, the networking
community is still working hard on making uniprocessor implementations on those
processors run fast.  So the people working on parallel processing actually
have a dual problem: first they have to make sure that their implementation
is well-tuned for their processors, and then they have to figure out how
to make the implementation run well in parallel.  So there are lots of
challenges in this area.

References: For a good survey of the work on parallel packet processing, see
the paper by Bjorkman and Gunningberg in the Proceedings of ACM SIGCOMM '94.