N-2-2-030.50.2 Parallel Processors and Gigabit Networks by Craig Partridge Computers that use multiple processors are becoming an increasingly important part of the computing milieu. Most supercomputers today use multiple processors in parallel, and the technology is rapidly moving into workstations and PCs. As a result, it seems inevitable that attaching parallel processors to gigabit networks is a a problem we will need to solve. This quarter's column takes a look at some of what is currently known about parallel processing of network protocols. It is important to recognize that parallel processing of network protocols is a hard problem. An example illustrates the point. A major supercomputing company took some care about a year ago to develop what they thought would be a high quality TCP/IP implementation for their multiprocessor. But when they tested its performance they discovered that it could handle only a few thousand TCP/IP datagrams per second! (That's about what a good PC can handle). There are a lot of potential bottlenecks in parallel processing of protocols, and this implementation had hit a couple of them. In the next few paragraphs we'll examine some of the major problems and their solutions (if known). The most important major bottleneck can be described as the serial-to-parallel problem. In brief, the problem that there's a single network interface into the parallel computer and that something in the system has to coordinate moving data from a single interface to multiple processors (or from multiple processors to a single interface). The coordinating something is typically a single processor, which inevitably means that the multiprocessor system's packet processing rate is bounded by the maximum rate that a single one of its processors can move packets on and off the network. Complicating this problem is the fact that on many parallel processing machines, a single application may be running in parallel on several processors. So a single TCP segment may contain data destined for several different processors - in some implementations, the poor processor managing the interface often has to parse the TCP segment to figure out which processors get which pieces of the data. There are at least two ways people are trying to handle this problem. One approach is to put multiple network interfaces into the parallel processor. Data for a particular application, or piece of an application, is sent to the network interface closest to the processors on which that application is running. By properly balancing networked applications across the processors and interfaces, there's a good chance of achieving high performance. Experiments using this approach (as part of a broader study of networking with parallel processors) are currently underway using ATM network interfaces and Thinking Machines computers at the US Naval Research Laboratory. Another approach is to try to minimize the work that the processor managing the interface must do. A popular idea is to simply the processor move packets on and off the network, and move all the protocol processing to a group of processors inside the system. (Note that this approach is complimentary to the multiple interface approach, and so they can both be used). The major puzzle in this approach is figuring out how best to process packets in parallel within a group of processors. There are at least four known methods for parallel packet processing. The first method is to assign a processor to each logical connection, which handles all packets for that connection. The major disadvantage of this method is that even a connection that needs high throughput is limited by the capacity of a single processor. The second method is assign a processor to each network protocol (e.g., one processor for IP, one for TCP, one for UDP). Packets move from processor to processor for protocol processing. Some of the problems with this approach are that throughput is limited by the slowest protocol in pipeline, and hopping from one processor to another is expensive in some multiprocessors. The third method is to assign a processor to individual protocol subfunctions. So one processor might compute the TCP checksum, while another does TCP sequencing and a third does round-trip time estimation. The major problem with this approach is that the cost of splitting pieces of the packet (and related state like TCP protocol control blocks) across multiple processors is typically far expensive than the cost of processing the TCP segment itself (which is only about 100 instructions). The fourth (and to my mind, most promising) method is to assign a different processor to each packet that comes in, and have the processor do all the protocol processing on that packet. This method largely addresses the problems in the other three (a single connection can use multiple processors, slow protocols don't affect other packets as much, and we don't have to split packets across processors). The main performance barrier is the occasional need to share state. For example, if two TCP segments for the same connection are being handled on two different processors, the processors need to coordinate their access to the TCP protocol control block to avoid corrupting the TCP state information. Work on this problem is being done at the Swedish Institute of Computer Science. Finally, one of the key problems in multiprocessors is that most multiprocessors are built using regular processor chips like those in workstations and PCs. And as earlier columns have explained, the networking community is still working hard on making uniprocessor implementations on those processors run fast. So the people working on parallel processing actually have a dual problem: first they have to make sure that their implementation is well-tuned for their processors, and then they have to figure out how to make the implementation run well in parallel. So there are lots of challenges in this area. References: For a good survey of the work on parallel packet processing, see the paper by Bjorkman and Gunningberg in the Proceedings of ACM SIGCOMM '94.