








                 NNeettwwoorrkkiinngg IImmpplleemmeennttaattiioonn NNootteess
                         44..44BBSSDD EEddiittiioonn


_S_a_m_u_e_l _J_. _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _N_. _J_o_y_, _R_o_b_e_r_t _S_. _F_a_b_r_y_, _a_n_d _M_i_c_h_a_e_l _J_. _K_a_r_e_l_s
                 Computer Systems Research Group
                    Computer Science Division
    Department of Electrical Engineering and Computer Science
               University of California, Berkeley
                       Berkeley, CA  94720


                            _A_B_S_T_R_A_C_T

          This  report  describes  the internal structure of
     the networking facilities developed for the 4.4BSD ver-
     sion  of  the  UNIX*  operating  system for the VAX[+].
     These  facilities are based on several central abstrac-
     tions which structure the external (user) view of  net-
     work  communication  as  well  as the internal (system)
     implementation.

          The report documents the internal structure of the
     networking  system.   The ``Berkeley Software Architec-
     ture  Manual,  4.4BSD  Edition''  (PSD:5)  provides   a
     description  of  the  user  interface to the networking
     facilities.


     Revised June 10, 1993

















-----------
* UNIX is a trademark of Bell Laboratories.
[+] DEC, VAX, DECnet, and  UNIBUS  are  trademarks  of
Digital Equipment Corporation.









SMM:18-2                          Networking Implementation Notes


                        TTAABBLLEE OOFF CCOONNTTEENNTTSS


11..  IInnttrroodduuccttiioonn

22..  OOvveerrvviieeww

33..  GGooaallss

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

55..  MMeemmoorryy mmaannaaggeemmeenntt

66..  IInntteerrnnaall llaayyeerriinngg
6.1.    Socket layer
6.1.1.    Socket state
6.1.2.    Socket data queues
6.1.3.    Socket connection queuing
6.2.    Protocol layer(s)
6.3.    Network-interface layer
6.3.1.    UNIBUS interfaces

77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee
8.1.     pr_output
8.2.     pr_input
8.3.     pr_ctlinput
8.4.     pr_ctloutput

99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee
9.1.     Packet transmission
9.2.     Packet reception

1100.. GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess
10.1.     Routing tables
10.2.     Routing table interface
10.3.     User level routing policies

1111.. RRaaww ssoocckkeettss
11.1.     Control blocks
11.2.     Input processing
11.3.     Output processing

1122.. BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll
12.1.     Memory management
12.2.     Protocol buffering policies
12.3.     Queue limiting
12.4.     Packet forwarding

1133.. OOuutt ooff bbaanndd ddaattaa

1144.. TTrraaiilleerr pprroottooccoollss










Networking Implementation Notes                          SMM:18-3


AAcckknnoowwlleeddggeemmeennttss

RReeffeerreenncceess




























































SMM:18-4                          Networking Implementation Notes


11..  IInnttrroodduuccttiioonn

     This report describes the internal structure  of  facilities
added  to the 4.2BSD version of the UNIX operating system for the
VAX, as modified in the 4.4BSD release.   The  system  facilities
provide  a  uniform user interface to networking within UNIX.  In
addition, the implementation introduces a structure  for  network
communications which may be used by system implementors in adding
new networking facilities.  The internal structure is not visible
to  the user, rather it is intended to aid implementors of commu-
nication protocols and network services by providing a  framework
which  promotes code sharing and minimizes implementation effort.

     The reader is expected to be familiar with the C programming
language and system interface, as described in the _B_e_r_k_e_l_e_y _S_o_f_t_-
_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l_, _4_._4_B_S_D _E_d_i_t_i_o_n [Joy86].   Basic  under-
standing  of  network  communication  concepts  is assumed; where
required any additional ideas are introduced.

     The remainder of this document provides a description of the
system  internals,  avoiding, when possible, those portions which
are used only by the interprocess communication facilities.

22..  OOvveerrvviieeww

     If we consider the  International  Standards  Organization's
(ISO) Open System Interconnection (OSI) model of network communi-
cation  [ISO81]   [Zimmermann80],   the   networking   facilities
described  here  correspond  to  a  portion  of the session layer
(layer 3) and all of the transport and network layers  (layers  2
and 1, respectively).

     The network layer provides possibly imperfect data transport
services with minimal addressing structure.  Addressing  at  this
level is normally host to host, with implicit or explicit routing
optionally supported by the communicating agents.

     At the transport layer the  notions  of  reliable  transfer,
data  sequencing,  flow  control, and service addressing are nor-
mally included.   Reliability  is  usually  managed  by  explicit
acknowledgement  of  data  delivered.   Failure  to acknowledge a
transfer results in retransmission of the data.   Sequencing  may
be handled by tagging each message handed to the network layer by
a _s_e_q_u_e_n_c_e _n_u_m_b_e_r and maintaining state at the endpoints of  com-
munication  to  use  received sequence numbers in reordering data
which arrives out of order.

     The session layer facilities may provide forms of addressing
which  are  mapped  into formats required by the transport layer,
service authentication and client authentication,  etc.   Various
systems also provide services such as data encryption and address
and protocol translation.











Networking Implementation Notes                          SMM:18-5


     The following sections begin by describing some of the  com-
mon data structures and utility routines, then examine the inter-
nal layering.  The contents of each layer and its  interface  are
considered.   Certain  of the interfaces are protocol implementa-
tion specific.  For these cases examples have been drawn from the
Internet  [Cerf78] protocol family.  Later sections cover routing
issues, the design of the raw socket interface and other  miscel-
laneous topics.

33..  GGooaallss

     The networking system was designed with the goal of support-
ing multiple  _p_r_o_t_o_c_o_l  _f_a_m_i_l_i_e_s  and  addressing  styles.   This
required  information  to be ``hidden'' in common data structures
which could be manipulated by all the pieces of the  system,  but
which  required interpretation only by the protocols which ``con-
trolled'' it.  The system described here attempts to minimize the
use  of shared data structures to those kept by a suite of proto-
cols (a _p_r_o_t_o_c_o_l _f_a_m_i_l_y), and those used for  rendezvous  between
``synchronous'' and ``asynchronous'' portions of the system (e.g.
queues of data packets are filled at interrupt time  and  emptied
based on user requests).

     A major goal of the system was to provide a framework within
which new protocols and hardware could be  easily  be  supported.
To  this  end, a great deal of effort has been extended to create
utility routines which hide many of the more complex and/or hard-
ware dependent chores of networking.  Later sections describe the
utility routines and the underlying data structures they  manipu-
late.

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

     Common  to  all  portions  of the system are two data struc-
tures.  These structures are used to represent addresses and var-
ious  data  objects.   Addresses, internally are described by the
_s_o_c_k_a_d_d_r structure,

     struct sockaddr {
            short     sa_family;           /* data format identifier */
            char      sa_data[14];         /* address */
     };

All addresses belong to one or more _a_d_d_r_e_s_s _f_a_m_i_l_i_e_s which define
their  format  and interpretation.  The _s_a___f_a_m_i_l_y field indicates
the address family to which the address belongs, and the  _s_a___d_a_t_a
field  contains  the  actual  data  value.   The size of the data
field, 14 bytes, was selected based on a study of current address
formats.*  Specific address formats use private structure defini-
tions that define the format  of  the  data  field.   The  system
interface  supports  larger address structures, although address-
family-independent support facilities, for  example  routing  and
raw socket interfaces, provide only 14 bytes for address storage.
Protocols that do not use those facilities (e.g, the current Unix









SMM:18-6                          Networking Implementation Notes


domain) may use larger data areas.

55..  MMeemmoorryy mmaannaaggeemmeenntt

     A single mechanism is used for data storage: memory buffers,
or _m_b_u_f's.  An mbuf is a structure of the form:

     struct mbuf {
            struct    mbuf *m_next;        /* next buffer in chain */
            u_long    m_off;               /* offset of data */
            short     m_len;               /* amount of data in this mbuf */
            short     m_type;              /* mbuf type (accounting) */
            u_char    m_dat[MLEN];         /* data storage */
            struct    mbuf *m_act;         /* link in higher-level mbuf list */
     };

The _m___n_e_x_t field is used to chain mbufs together on linked lists,
while the _m___a_c_t field allows lists of mbuf chains to  be  accumu-
lated.   By  convention, the mbufs common to a single object (for
example, a packet) are chained together with  the  _m___n_e_x_t  field,
while  groups of objects are linked via the _m___a_c_t field (possibly
when in a queue).

     Each mbuf has a small data  area  for  storing  information,
_m___d_a_t.   The  _m___l_e_n field indicates the amount of data, while the
_m___o_f_f field is an offset to the beginning of the  data  from  the
base  of the mbuf.  Thus, for example, the macro _m_t_o_d, which con-
verts a pointer to an mbuf to a pointer to the data stored in the
mbuf, has the form

     #define mtod(_x,_t)         ((_t)((int)(_x) + (_x)->m_off))

(note  the  _t parameter, a C type cast, which is used to cast the
resultant pointer for proper assignment).

     In addition to storing data  directly  in  the  mbuf's  data
area,  data of page size may be also be stored in a separate area
of memory.  The mbuf utility routines maintain a  pool  of  pages
for  this  purpose  and  manipulate  a  private page map for such
pages.  An mbuf with an external data area may be  recognized  by
the  larger  offset  to  the data area; this is formalized by the
macro M_HASCL(_m), which is true if the mbuf whose  address  is  _m
has  an  external  page cluster.  An array of reference counts on
pages is also maintained so that copies  of  pages  may  be  made
without  core  to  core  copying   (copies  are created simply by
duplicating the reference to the data and incrementing the  asso-
ciated  reference counts for the pages).  Separate data pages are
currently used only when copying data from a  user  process  into
the  kernel,  and  when  bringing  data in at the hardware level.
Routines which manipulate mbufs are not  normally  aware  whether
data  is stored directly in the mbuf data array, or if it is kept
-----------
*  Later  versions  of the system may support variable
length addresses.









Networking Implementation Notes                          SMM:18-7


in separate pages.

     The following may be used to allocate and free mbufs:

m = m_get(wait, type);
MGET(m, wait, type);

     The subroutine _m___g_e_t and the macro  _M_G_E_T  each  allocate  an
     mbuf, placing its address in _m.  The argument _w_a_i_t is either
     M_WAIT or M_DONTWAIT according to whether allocation  should
     block  or  fail if no mbuf is available.  The _t_y_p_e is one of
     the predefined mbuf types for  use  in  accounting  of  mbuf
     allocation.

MCLGET(m);
     This  macro attempts to allocate an mbuf page cluster to as-
     sociate with the mbuf _m.  If successful, the length  of  the
     mbuf is set to CLSIZE, the size of the page cluster.

n = m_free(m);
MFREE(m,n);

     The  routine  _m___f_r_e_e  and the macro _M_F_R_E_E each free a single
     mbuf, _m, and any associated external storage area, placing a
     pointer  to  its successor in the chain it heads, if any, in
     _n.

m_freem(m);
     This routine frees an mbuf chain headed by _m.

     The following utility routines are available for  manipulat-
ing mbuf chains:

m = m_copy(m0, off, len);
     The  _m___c_o_p_y routine create a copy of all, or part, of a list
     of the mbufs in _m_0.  _L_e_n bytes of data, starting  _o_f_f  bytes
     from  the  front  of the chain, are copied.  Where possible,
     reference counts on pages are used instead of core  to  core
     copies.   The  original  mbuf chain must have at least _o_f_f +
     _l_e_n bytes of data.  If _l_e_n is specified  as  M_COPYALL,  all
     the data present, offset as before, is copied.

m_cat(m, n);
     The  mbuf chain, _n, is appended to the end of _m.  Where pos-
     sible, compaction is performed.

m_adj(m, diff);
     The mbuf chain, _m is adjusted in size  by  _d_i_f_f  bytes.   If
     _d_i_f_f is non-negative, _d_i_f_f bytes are shaved off the front of
     the mbuf chain.  If _d_i_f_f is negative, the alteration is per-
     formed  from  back  to front.  No space is reclaimed in this
     operation; alterations  are  accomplished  by  changing  the
     _m___l_e_n and _m___o_f_f fields of mbufs.










SMM:18-8                          Networking Implementation Notes


m = m_pullup(m0, size);
     After a successful call to _m___p_u_l_l_u_p, the mbuf at the head of
     the returned list, _m, is guaranteed to have  at  least  _s_i_z_e
     bytes  of  data in contiguous memory within the data area of
     the mbuf (allowing access via a pointer, obtained using  the
     _m_t_o_d  macro,  and  allowing  the  mbuf  to be located from a
     pointer to the data area using _d_t_o_m, defined below).  If the
     original data was less than _s_i_z_e bytes long, _l_e_n was greater
     than the size of an mbuf data area (112 bytes), or  required
     resources  were  unavailable,  _m  is 0 and the original mbuf
     chain is deallocated.

     This routine is particularly useful  when  verifying  packet
     header  lengths  on  reception.  For example, if a packet is
     received and only 8 of the necessary 16 bytes required for a
     valid  packet  header are present at the head of the list of
     mbufs representing the packet, the remaining 8 bytes may  be
     ``pulled  up''  with  a  single  _m___p_u_l_l_u_p call.  If the call
     fails the invalid packet will have been discarded.

     By insuring that mbufs always reside on 128 byte boundaries,
it  is  always possible to locate the mbuf associated with a data
area by masking off the low bits of the  virtual  address.   This
allows  modules  to  store data structures in mbufs and pass them
around without concern for locating the  original  mbuf  when  it
comes time to free the structure.  Note that this works only with
objects stored in the internal data buffer of the mbuf.  The _d_t_o_m
macro  is used to convert a pointer into an mbuf's data area to a
pointer to the mbuf,

     #define   dtom(x)   ((struct mbuf *)((int)x & ~(MSIZE-1)))


     Mbufs are used for  dynamically  allocated  data  structures
such as sockets as well as memory allocated for packets and head-
ers.  Statistics are maintained on mbuf usage and can  be  viewed
by users using the _n_e_t_s_t_a_t(1) program.

66..  IInntteerrnnaall llaayyeerriinngg

     The internal structure of the network system is divided into
three layers.  These layers correspond to the  services  provided
by  the  socket  abstraction, those provided by the communication
protocols, and those provided by the  hardware  interfaces.   The
communication  protocols  are  normally  layered into two or more
individual  cooperating  layers,  though  they  are  collectively
viewed  in  the system as one layer providing services supportive
of the appropriate socket abstraction.

     The following sections describe the properties of each layer
in the system and the interfaces to which each must conform.












Networking Implementation Notes                          SMM:18-9


66..11..  SSoocckkeett llaayyeerr

     The  socket  layer deals with the interprocess communication
facilities provided by the system.  A socket is  a  bidirectional
endpoint  of communication which is ``typed'' by the semantics of
communication it supports.  The system  calls  described  in  the
_B_e_r_k_e_l_e_y _S_o_f_t_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l [Joy86] are used to manipu-
late sockets.

     A socket consists of the following data structure:

     struct socket {
            short     so_type;             /* generic type */
            short     so_options;          /* from socket call */
            short     so_linger;           /* time to linger while closing */
            short     so_state;            /* internal state flags */
            caddr_t   so_pcb;              /* protocol control block */
            struct    protosw *so_proto;   /* protocol handle */
            struct    socket *so_head;     /* back pointer to accept socket */
            struct    socket *so_q0;       /* queue of partial connections */
            short     so_q0len;            /* partials on so_q0 */
            struct    socket *so_q;        /* queue of incoming connections */
            short     so_qlen;             /* number of connections on so_q */
            short     so_qlimit;           /* max number queued connections */
            struct    sockbuf so_rcv;      /* receive queue */
            struct    sockbuf so_snd;      /* send queue */
            short     so_timeo;            /* connection timeout */
            u_short   so_error;            /* error affecting connection */
            u_short   so_oobmark;          /* chars to oob mark */
            short     so_pgrp;             /* pgrp for signals */
     };


     Each socket contains two data queues, _s_o___r_c_v and _s_o___s_n_d, and
a  pointer  to  routines  which provide supporting services.  The
type of the socket, _s_o___t_y_p_e is defined at  socket  creation  time
and  used  in  selecting  those services which are appropriate to
support it.  The supporting protocol is selected at  socket  cre-
ation  time  and  recorded in the socket data structure for later
use.  Protocols are defined by a table of procedures, the _p_r_o_t_o_s_w
structure, which will be described in detail later.  A pointer to
a  protocol-specific  data  structure,  the  ``protocol   control
block,'' is also present in the socket structure.  Protocols con-
trol this data structure, which normally includes a back  pointer
to  the parent socket structure to allow easy lookup when return-
ing information to a user (for example, placing an  error  number
in  the  _s_o___e_r_r_o_r field).  The other entries in the socket struc-
ture are used in queuing  connection  requests,  validating  user
requests,  storing socket characteristics (e.g.  options supplied
at the time a socket is  created),  and  maintaining  a  socket's
state.

     Processes ``rendezvous at a socket'' in many instances.  For
instance, when a process wishes to extract data from  a  socket's









SMM:18-10                         Networking Implementation Notes


receive  queue  and it is empty, or lacks sufficient data to sat-
isfy the request, the process blocks, supplying  the  address  of
the  receive  queue  as a ``wait channel' to be used in notifica-
tion.  When data arrives for the process and  is  placed  in  the
socket's  queue, the blocked process is identified by the fact it
is waiting ``on the queue.''

66..11..11..  SSoocckkeett ssttaattee

     A socket's state is defined from the following:

     #define SS_NOFDREF            0x001     /* no file table ref any more */
     #define SS_ISCONNECTED        0x002     /* socket connected to a peer */
     #define SS_ISCONNECTING       0x004     /* in process of connecting to peer */
     #define SS_ISDISCONNECTING    0x008     /* in process of disconnecting */
     #define SS_CANTSENDMORE       0x010     /* can't send more data to peer */
     #define SS_CANTRCVMORE        0x020     /* can't receive more data from peer */
     #define SS_RCVATMARK          0x040     /* at mark on input */

     #define SS_PRIV               0x080     /* privileged */
     #define SS_NBIO               0x100     /* non-blocking ops */
     #define SS_ASYNC              0x200     /* async i/o notify */


     The state of a socket is manipulated both by  the  protocols
and  the  user (through system calls).  When a socket is created,
the state is defined based on the type of socket.  It may  change
as  control  actions are performed, for example connection estab-
lishment.  It may also change according to the type of input/out-
put  the user wishes to perform, as indicated by options set with
_f_c_n_t_l.  ``Non-blocking'' I/O  implies that a process should never
be  blocked  to  await  resources.  Instead, any call which would
block returns prematurely with the error EWOULDBLOCK, or the ser-
vice  request may be partially fulfilled, e.g. a request for more
data than is present.

     If a  process  requested  ``asynchronous''  notification  of
events  related  to the socket, the SIGIO signal is posted to the
process when such events occur.  An event  is  a  change  in  the
socket's  state; examples of such occurrences are: space becoming
available in the send queue, new data available  in  the  receive
queue, connection establishment or disestablishment, etc.

     A  socket  may be marked ``privileged'' if it was created by
the super-user.  Only privileged sockets may  bind  addresses  in
privileged portions of an address space or use ``raw'' sockets to
access lower levels of the network.

66..11..22..  SSoocckkeett ddaattaa qquueeuueess

     A socket's data queue contains a pointer to the data  stored
in  the  queue and other entries related to the management of the
data.  The following structure defines a data queue:










Networking Implementation Notes                         SMM:18-11


     struct sockbuf {
            u_short   sb_cc;               /* actual chars in buffer */
            u_short   sb_hiwat;            /* max actual char count */
            u_short   sb_mbcnt;            /* chars of mbufs used */
            u_short   sb_mbmax;            /* max chars of mbufs to use */
            u_short   sb_lowat;            /* low water mark */
            short     sb_timeo;            /* timeout */
            struct    mbuf *sb_mb;         /* the mbuf chain */
            struct    proc *sb_sel;        /* process selecting read/write */
            short     sb_flags;            /* flags, see below */
     };


     Data is stored in a queue as a chain of mbufs.   The  actual
count  of data characters as well as high and low water marks are
used by the protocols in  controlling  the  flow  of  data.   The
amount  of  buffer space (characters of mbufs and associated data
pages) is also recorded along with the limit  on  buffer  alloca-
tion.   The  socket  routines  cooperate in implementing the flow
control policy by blocking a process when  it  requests  to  send
data  and  the  high  water  mark  has  been  reached, or when it
requests to receive data and less than  the  low  water  mark  is
present (assuming non-blocking I/O has not been specified).*

     When  a  socket  is   created,   the   supporting   protocol
``reserves'' space for the send and receive queues of the socket.
The limit on buffer allocation is set somewhat  higher  than  the
limit on data characters to account for the granularity of buffer
allocation.  The actual storage associated with  a  socket  queue
may  fluctuate during a socket's lifetime, but it is assumed that
this reservation will always allow a protocol to  acquire  enough
memory to satisfy the high water marks.

     The  timeout and select values are manipulated by the socket
routines in implementing various  portions  of  the  interprocess
communications facilities and will not be described here.

     Data  queued  at  a  socket  is stored in one of two styles.
Stream-oriented sockets queue data with no addresses, headers  or
record  boundaries.   The  data  are  in mbufs linked through the
_m___n_e_x_t field.  Buffers containing access rights  may  be  present
within  the  chain if the underlying protocol supports passage of
access rights.  Record-oriented sockets, including datagram sock-
ets, queue data as a list of packets; the sections of packets are
distinguished by the types of the  mbufs  containing  them.   The
mbufs  which  comprise  a  record  are  linked through the _m___n_e_x_t
field; records are linked from the _m___a_c_t field of the first  mbuf
of  one packet to the first mbuf of the next.  Each packet begins
with an mbuf containing the ``from'' address if the protocol pro-
vides  it, then any buffers containing access rights, and finally
any buffers containing data.  If a record contains  no  data,  no
-----------
* The low-water mark is always presumed to be 0 in the
current implementation.









SMM:18-12                         Networking Implementation Notes


data  buffers  are  required  unless  neither  address nor access
rights are present.

     A socket queue has a number of flags used  in  synchronizing
access to the data and in acquiring resources:

     #define SB_LOCK           0x01   /* lock on data queue (so_rcv only) */
     #define SB_WANT           0x02   /* someone is waiting to lock */
     #define SB_WAIT           0x04   /* someone is waiting for data/space */
     #define SB_SEL            0x08   /* buffer is selected */
     #define SB_COLL           0x10   /* collision selecting */

The  last two flags are manipulated by the system in implementing
the select mechanism.

66..11..33..  SSoocckkeett ccoonnnneeccttiioonn qquueeuuiinngg

     In  dealing   with   connection   oriented   sockets   (e.g.
SOCK_STREAM)  the  two  ends are considered distinct.  One end is
termed _a_c_t_i_v_e, and generates connection requests.  The other  end
is called _p_a_s_s_i_v_e and accepts connection requests.

     From the passive side, a socket is marked with SO_ACCEPTCONN
when a _l_i_s_t_e_n call is made, creating two queues of sockets: _s_o___q_0
for connections in progress and _s_o___q for connections already made
and awaiting user acceptance.  As a protocol is preparing  incom-
ing connections, it creates a socket structure queued on _s_o___q_0 by
calling the routine _s_o_n_e_w_c_o_n_n().  When the connection  is  estab-
lished,  the socket structure is then transferred to _s_o___q, making
it available for an _a_c_c_e_p_t.

     If an SO_ACCEPTCONN socket is closed with sockets on  either
_s_o___q_0  or  _s_o___q,  these sockets are dropped, with notification to
the peers as appropriate.

66..22..  PPrroottooccooll llaayyeerr((ss))

     Each socket is created in  a  communications  domain,  which
usually implies both an addressing structure (address family) and
a set of protocols which implement various  socket  types  within
the domain (protocol family).  Each domain is defined by the fol-
lowing structure:

     struct       domain {
          int     dom_family;             /* PF_xxx */
          char    *dom_name;
          int     (*dom_init)();          /* initialize domain data structures */
          int     (*dom_externalize)();   /* externalize access rights */
          int     (*dom_dispose)();       /* dispose of internalized rights */
          struct  protosw *dom_protosw, *dom_protoswNPROTOSW;
          struct  domain *dom_next;
     };











Networking Implementation Notes                         SMM:18-13


     At boot time, each domain  configured  into  the  kernel  is
added  to  a linked list of domain.  The initialization procedure
of each domain is then  called.   After  that  time,  the  domain
structure is used to locate protocols within the protocol family.
It may also contain procedure references for  externalization  of
access  rights at the receiving socket and the disposal of access
rights that are not received.

     Protocols are described by a set of entry points and certain
socket-visible  characteristics, some of which are used in decid-
ing which socket type(s) they may support.

     An entry in the ``protocol switch'' table  exists  for  each
protocol module configured into the system.  It has the following
form:

     struct protosw {
          short   pr_type;              /* socket type used for */
          struct  domain *pr_domain;    /* domain protocol a member of */
          short   pr_protocol;          /* protocol number */
          short   pr_flags;             /* socket visible attributes */
     /* protocol-protocol hooks */
          int     (*pr_input)();        /* input to protocol (from below) */
          int     (*pr_output)();       /* output to protocol (from above) */
          int     (*pr_ctlinput)();     /* control input (from below) */
          int     (*pr_ctloutput)();    /* control output (from above) */
     /* user-protocol hook */
          int     (*pr_usrreq)();       /* user request */
     /* utility hooks */
          int     (*pr_init)();         /* initialization routine */
          int     (*pr_fasttimo)();     /* fast timeout (200ms) */
          int     (*pr_slowtimo)();     /* slow timeout (500ms) */
          int     (*pr_drain)();        /* flush any excess space possible */
     };


     A protocol is called through the _p_r___i_n_i_t  entry  before  any
other.   Thereafter  it  is called every 200 milliseconds through
the _p_r___f_a_s_t_t_i_m_o entry and  every  500  milliseconds  through  the
_p_r___s_l_o_w_t_i_m_o  for  timer  based actions.  The system will call the
_p_r___d_r_a_i_n entry if it is low on space and this should  throw  away
any non-critical data.

     Protocols  pass  data  between themselves as chains of mbufs
using the _p_r___i_n_p_u_t and _p_r___o_u_t_p_u_t routines.  _P_r___i_n_p_u_t passes  data
up  (towards  the user) and _p_r___o_u_t_p_u_t passes it down (towards the
network); control information passes up and down  on  _p_r___c_t_l_i_n_p_u_t
and  _p_r___c_t_l_o_u_t_p_u_t.   The  protocol  is  responsible for the space
occupied by any of the arguments to these entries and must either
pass  it  onward  or dispose of it.  (On output, the lowest level
reached must free buffers storing the arguments;  on  input,  the
highest level is responsible for freeing buffers.)











SMM:18-14                         Networking Implementation Notes


     The  _p_r___u_s_r_r_e_q  routine  interfaces  protocols to the socket
code and is described below.

     The _p_r___f_l_a_g_s field is constructed from the following values:

     #define PR_ATOMIC         0x01    /* exchange atomic messages only */
     #define PR_ADDR           0x02    /* addresses given with messages */
     #define PR_CONNREQUIRED   0x04    /* connection required by protocol */
     #define PR_WANTRCVD       0x08    /* want PRU_RCVD calls */
     #define PR_RIGHTS         0x10    /* passes capabilities */

Protocols  which are connection-based specify the PR_CONNREQUIRED
flag so that the socket routines will never attempt to send  data
before  a  connection  has  been established.  If the PR_WANTRCVD
flag is set, the socket routines will notify  the  protocol  when
the  user has removed data from the socket's receive queue.  This
allows the protocol to implement acknowledgement on user receipt,
and  also  update  windowing  information  based on the amount of
space available in the receive queue.  The  PR_ADDR  field  indi-
cates  that any data placed in the socket's receive queue will be
preceded by the address of the sender.  The PR_ATOMIC flag speci-
fies  that  each _u_s_e_r request to send data must be performed in a
single _p_r_o_t_o_c_o_l send request; it is the protocol's responsibility
to  maintain record boundaries on data to be sent.  The PR_RIGHTS
flag indicates that the protocol supports the passing of capabil-
ities;   this is currently used only by the protocols in the UNIX
protocol family.

     When a socket is created, the socket routines scan the  pro-
tocol table for the domain looking for an appropriate protocol to
support the type of socket being created.  The _p_r___t_y_p_e field con-
tains  one of the possible socket types (e.g. SOCK_STREAM), while
the _p_r___d_o_m_a_i_n is a back pointer to  the  domain  structure.   The
_p_r___p_r_o_t_o_c_o_l  field  contains the protocol number of the protocol,
normally a well-known value.

66..33..  NNeettwwoorrkk--iinntteerrffaaccee llaayyeerr

     Each network-interface configured into a  system  defines  a
path  through which packets may be sent and received.  Normally a
hardware device is associated with this interface,  though  there
is no requirement for this (for example, all systems have a soft-
ware ``loopback'' interface used for  debugging  and  performance
analysis).   In  addition to manipulating the hardware device, an
interface module is responsible for encapsulation and  decapsula-
tion  of  any link-layer header information required to deliver a
message to its destination.  The selection of which interface  to
use  in delivering packets is a routing decision carried out at a
higher level than the network-interface layer.  An interface  may
have  addresses  in one or more address families.  The address is
set at boot time using an _i_o_c_t_l on a socket  in  the  appropriate
domain;  this  operation  is  implemented by the protocol family,
after verifying the operation through the device _i_o_c_t_l entry.










Networking Implementation Notes                         SMM:18-15


     An interface is defined by the following structure,

     struct ifnet {
          char     *if_name;              /* name, e.g. ``en'' or ``lo'' */
          short    if_unit;               /* sub-unit for lower level driver */
          short    if_mtu;                /* maximum transmission unit */
          short    if_flags;              /* up/down, broadcast, etc. */
          short    if_timer;              /* time 'til if_watchdog called */
          struct   ifaddr *if_addrlist;   /* list of addresses of interface */
          struct   ifqueue if_snd;        /* output queue */
          int      (*if_init)();          /* init routine */
          int      (*if_output)();        /* output routine */
          int      (*if_ioctl)();         /* ioctl routine */
          int      (*if_reset)();         /* bus reset routine */
          int      (*if_watchdog)();      /* timer routine */
          int      if_ipackets;           /* packets received on interface */
          int      if_ierrors;            /* input errors on interface */
          int      if_opackets;           /* packets sent on interface */
          int      if_oerrors;            /* output errors on interface */
          int      if_collisions;         /* collisions on csma interfaces */
          struct   ifnet *if_next;
     };

Each interface address has the following form:

     struct ifaddr {
             struct   sockaddr ifa_addr;   /* address of interface */
             union {
                      struct   sockaddr ifu_broadaddr;
                      struct   sockaddr ifu_dstaddr;
             } ifa_ifu;
             struct   ifnet *ifa_ifp;      /* back-pointer to interface */
             struct   ifaddr *ifa_next;    /* next address for interface */
     };
     #define ifa_broadaddr   ifa_ifu.ifu_broadaddr        /* broadcast address */
     #define ifa_dstaddr     ifa_ifu.ifu_dstaddr          /* other end of p-to-p link */

The protocol generally maintains this  structure  as  part  of  a
larger structure containing additional information concerning the
address.

     Each interface has a send queue and routines used  for  ini-
tialization,  _i_f___i_n_i_t,  and  output, _i_f___o_u_t_p_u_t.  If the interface
resides on a system bus, the  routine  _i_f___r_e_s_e_t  will  be  called
after  a  bus  reset  has  been performed.  An interface may also
specify a timer routine, _i_f___w_a_t_c_h_d_o_g; if _i_f___t_i_m_e_r is non-zero, it
is  decremented  once  per second until it reaches zero, at which
time the watchdog routine is called.

     The state of an interface and  certain  characteristics  are
stored in the _i_f___f_l_a_g_s field.  The following values are possible:












SMM:18-16                         Networking Implementation Notes


     #define IFF_UP                   0x0001/* interface is up */
     #define IFF_BROADCAST     0x0002 /* broadcast address valid */
     #define IFF_DEBUG         0x0004 /* turn on debugging */
     #define IFF_LOOPBACK      0x0008 /* is a loopback net */
     #define IFF_POINTOPOINT   0x0010 /* interface is point-to-point link */
     #define IFF_RUNNING       0x0040 /* resources allocated */
     #define IFF_NOARP         0x0080 /* no address resolution protocol */
     #define IFF_PROMISC       0x0100 /* receive all packets */
     #define IFF_ALLMULTI      0x0200 /* receive all multicast packets */
     #define IFF_OACTIVE       0x0400 /* transmission in progress */
     #define IFF_SIMPLEX       0x0800 /* can't hear own transmissions */
     #define IFF_LINK0         0x1000 /* per link layer defined bit */
     #define IFF_LINK1         0x2000 /* per link layer defined bit */
     #define IFF_LINK2         0x4000 /* per link layer defined bit */
     #define IFF_MULTICAST     0x8000 /* supports multicast */

If the interface is connected to a network which supports  trans-
mission  of _b_r_o_a_d_c_a_s_t packets, the IFF_BROADCAST flag will be set
and the _i_f_a___b_r_o_a_d_a_d_d_r field will contain the address to  be  used
in  sending or accepting a broadcast packet.  If the interface is
associated with a point-to-point hardware link  (for  example,  a
DEC DMR-11), the IFF_POINTOPOINT flag will be set and _i_f_a___d_s_t_a_d_d_r
will contain the address of the host on the  other  side  of  the
connection.   These addresses and the local address of the inter-
face, _i_f___a_d_d_r, are  used  in  filtering  incoming  packets.   The
interface   sets   IFF_RUNNING  after  it  has  allocated  system
resources and posted an initial read on the  device  it  manages.
This state bit is used to avoid multiple allocation requests when
an interface's address is changed.  The IFF_NOARP flag  indicates
the  interface  should not use an ``address resolution protocol''
in mapping internetwork addresses to local network addresses.

     Various statistics are also stored in the  interface  struc-
ture.  These may be viewed by users using the _n_e_t_s_t_a_t(1) program.

     The interface address and flags may be set with the  SIOCSI-
FADDR  and SIOCSIFFLAGS _i_o_c_t_ls.  SIOCSIFADDR is used initially to
define each interface's address; SIOGSIFFLAGS can be used to mark
an  interface  down and perform site-specific configuration.  The
destination address of a point-to-point link is  set  with  SIOC-
SIFDSTADDR.   Corresponding  operations exist to read each value.
Protocol families may also support operations to set and read the
broadcast  address.  In addition, the SIOCGIFCONF _i_o_c_t_l retrieves
a list of interface names and addresses for  all  interfaces  and
protocols on the host.

66..33..11..  UUNNIIBBUUSS iinntteerrffaacceess

     All  hardware  related  interfaces  currently  reside on the
UNIBUS.  Consequently a common set of utility routines for  deal-
ing  with  the  UNIBUS has been developed.  Each UNIBUS interface
uses a structure of the following form:











Networking Implementation Notes                         SMM:18-17


     struct  ifubinfo {
             short       iff_uban;                      /* uba number */
             short       iff_hlen;                      /* local net header length */
             struct      uba_regs *iff_uba;             /* uba regs, in vm */
             short       iff_flags;                     /* used during uballoc's */
     };

Additional structures are associated with each receive and trans-
mit buffer, normally one each per interface; for read,

     struct  ifrw {
             caddr_t     ifrw_addr;                     /* virt addr of header */
             short       ifrw_bdp;                      /* unibus bdp */
             short       ifrw_flags;                    /* type, etc. */
     #define IFRW_W      0x01                           /* is a transmit buffer */
             int         ifrw_info;                     /* value from ubaalloc */
             int         ifrw_proto;                    /* map register prototype */
             struct      pte *ifrw_mr;                  /* base of map registers */
     };

and for write,

     struct  ifxmt {
             struct      ifrw ifrw;
             caddr_t     ifw_base;                      /* virt addr of buffer */
             struct      pte ifw_wmap[IF_MAXNUBAMR];    /* base pages for output */
             struct      mbuf *ifw_xtofree;             /* pages being DMA'd out */
             short       ifw_xswapd;                    /* mask of clusters swapped */
             short       ifw_nmr;                       /* number of entries in wmap */
     };
     #define ifw_addr    ifrw.ifrw_addr
     #define ifw_bdp     ifrw.ifrw_bdp
     #define ifw_flags   ifrw.ifrw_flags
     #define ifw_info    ifrw.ifrw_info
     #define ifw_proto   ifrw.ifrw_proto
     #define ifw_mr      ifrw.ifrw_mr

One  of  each  of  these  structures is conveniently packaged for
interfaces with single buffers for each direction, as follows:

     struct  ifuba {
             struct      ifubinfo ifu_info;
             struct      ifrw ifu_r;
             struct      ifxmt ifu_xmt;
     };
     #define ifu_uban    ifu_info.iff_uban
     #define ifu_hlen    ifu_info.iff_hlen
     #define ifu_uba     ifu_info.iff_uba
     #define ifu_flags   ifu_info.iff_flags
     #define ifu_w       ifu_xmt.ifrw
     #define ifu_xtofree ifu_xmt.ifw_xtofree












SMM:18-18                         Networking Implementation Notes


     The _i_f___u_b_i_n_f_o structure  contains  the  general  information
needed to characterize the I/O-mapped buffers for the device.  In
addition, there is a structure describing each buffer,  including
UNIBUS  resources held by the interface.  Sufficient memory pages
and bus map registers are allocated to each buffer upon  initial-
ization  according  to the maximum packet size and header length.
The kernel virtual address of the buffer is  held  in  _i_f_r_w___a_d_d_r,
and  the  map  registers  begin  at _i_f_r_w___m_r.  UNIBUS map register
_i_f_r_w___m_r[-1] maps the local network header ending on a page bound-
ary.   UNIBUS  data  paths  are  reserved for read and for write,
given by _i_f_r_w___b_d_p.  The prototype of the map registers  for  read
and for write is saved in _i_f_r_w___p_r_o_t_o.

     When  write  transfers  are  not at least half-full pages on
page boundaries, the data are just copied into the  pages  mapped
on  the  UNIBUS and the transfer is started.  If a write transfer
is at least half a page long and on a page boundary, UNIBUS  page
table  entries  are  swapped to reference the pages, and then the
initial pages are remapped from _i_f_w___w_m_a_p when the  transfer  com-
pletes.   The mbufs containing the mapped pages are placed on the
_i_f_w___x_t_o_f_r_e_e queue to be freed after transmission.

     When read transfers give at least half a page of data to  be
input,  page  frames  are  allocated from a network page list and
traded with the pages already containing the  data,  mapping  the
allocated  pages  to  replace the input pages for the next UNIBUS
data input.

     The following utility routines  are  available  for  use  in
writing   network  interface  drivers;  all  use  the  structures
described above.

if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);

     _i_f___u_b_a_m_i_n_i_t allocates  resources  on  UNIBUS  adapter  _u_b_a_n,
     storing  the  information  in  the  _i_f_u_b_i_n_f_o, _i_f_r_w and _i_f_x_m_t
     structures referenced.   The  _i_f_r  and  _i_f_x  parameters  are
     pointers to arrays of _i_f_r_w and _i_f_x_m_t structures whose dimen-
     sions are _n_r and _n_x, respectively.  _i_f___u_b_a_i_n_i_t is a simpler,
     backwards-compatible interface used for hardware with single
     buffers of each type.  They are called only at boot time  or
     after   a   UNIBUS   reset.   One  data  path  (buffered  or
     unbuffered, depending on the _i_f_u___f_l_a_g_s field)  is  allocated
     for  each buffer.  The _n_m_r parameter indicates the number of
     UNIBUS mapping registers required to  map  a  maximal  sized
     packet  onto  the UNIBUS, while _h_l_e_n specifies the size of a
     local network header, if any, which should be  mapped  sepa-
     rately  from the data (see the description of trailer proto-
     cols in chapter 14).  Sufficient  UNIBUS  mapping  registers
     and  pages  of  memory are allocated to initialize the input
     data path for an initial read.  For the  output  data  path,
     mapping registers and pages of memory are also allocated and
     mapped onto the  UNIBUS.   The  pages  associated  with  the









Networking Implementation Notes                         SMM:18-19


     output  data  path  are held in reserve in the event a write
     requires  copying  non-page-aligned  data  (see   _i_f___w_u_b_a_p_u_t
     below).   If  _i_f___u_b_a_i_n_i_t is called with memory pages already
     allocated, they will be used instead of allocating new  ones
     (this  normally  occurs  after  a  UNIBUS  reset).   A  1 is
     returned when allocation and initialization are  successful,
     0 otherwise.

m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);

     _i_f___u_b_a_g_e_t and _i_f___r_u_b_a_g_e_t pull input data out of an interface
     receive buffer and into an mbuf chain.  The first  interface
     passes  pointers to the _i_f_u_b_i_n_f_o structure for the interface
     and the _i_f_r_w structure for the receive  buffer;  the  second
     call may be used for single-buffered devices.  _t_o_t_l_e_n speci-
     fies the length of data to be  obtained,  not  counting  the
     local  network  header.  If _o_f_f_0 is non-zero, it indicates a
     byte offset to a trailing local network header which  should
     be copied into a separate mbuf and prepended to the front of
     the resultant mbuf chain.  When the data amount to at  least
     a half a page, the previously mapped data pages are remapped
     into the mbufs and swapped with fresh pages,  thus  avoiding
     any  copy.   The  receiving  interface is recorded as _i_f_p, a
     pointer to an _i_f_n_e_t structure, for the use of the  receiving
     network  protocol.   A 0 return value indicates a failure to
     allocate resources.

if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);

     _i_f___u_b_a_p_u_t and _i_f___w_u_b_a_p_u_t map a chain of mbufs onto a network
     interface in preparation for output.  The first interface is
     used by devices with multiple transmit buffers.   The  chain
     includes  any  local network header, which is copied so that
     it resides in the  mapped  and  aligned  I/O  space.   Page-
     aligned  data that are page-aligned in the output buffer are
     mapped to the UNIBUS in place of the normal buffer page, and
     the  corresponding  mbuf  is  placed  on a queue to be freed
     after transmission.  Any other mbufs  which  contained  non-
     page-sized  data  portions  are  copied to the I/O space and
     then freed.  Pages mapped from a previous  output  operation
     (no longer needed) are unmapped.




















SMM:18-20                         Networking Implementation Notes


77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

     The interface between the socket routines and the communica-
tion protocols is through the _p_r___u_s_r_r_e_q routine  defined  in  the
protocol switch table.  The following requests to a protocol mod-
ule are possible:

     #define PRU_ATTACH        0      /* attach protocol */
     #define PRU_DETACH        1      /* detach protocol */
     #define PRU_BIND          2      /* bind socket to address */
     #define PRU_LISTEN        3      /* listen for connection */
     #define PRU_CONNECT       4      /* establish connection to peer */
     #define PRU_ACCEPT        5      /* accept connection from peer */
     #define PRU_DISCONNECT    6      /* disconnect from peer */
     #define PRU_SHUTDOWN      7      /* won't send any more data */
     #define PRU_RCVD          8      /* have taken data; more room now */
     #define PRU_SEND          9      /* send this data */
     #define PRU_ABORT         10     /* abort (fast DISCONNECT, DETATCH) */
     #define PRU_CONTROL       11     /* control operations on protocol */
     #define PRU_SENSE         12     /* return status into m */
     #define PRU_RCVOOB        13     /* retrieve out of band data */
     #define PRU_SENDOOB       14     /* send out of band data */
     #define PRU_SOCKADDR      15     /* fetch socket's address */
     #define PRU_PEERADDR      16     /* fetch peer's address */
     #define PRU_CONNECT2      17     /* connect two sockets */
     /* begin for protocols internal use */
     #define PRU_FASTTIMO      18     /* 200ms timeout */
     #define PRU_SLOWTIMO      19     /* 500ms timeout */
     #define PRU_PROTORCV      20     /* receive from below */
     #define PRU_PROTOSEND     21     /* send to below */

A call on the user request routine is of the form,

     error = (*protosw[].pr_usrreq)(so, req, m, addr, rights);
     int error; struct socket *so; int req; struct mbuf *m, *addr, *rights;

The mbuf data chain _m is supplied for output operations  and  for
certain  other  operations  where it is to receive a result.  The
address _a_d_d_r is supplied for address-oriented  requests  such  as
PRU_BIND  and  PRU_CONNECT.   The _r_i_g_h_t_s parameter is an optional
pointer to an mbuf chain containing  user-specified  capabilities
(see  the  _s_e_n_d_m_s_g  and  _r_e_c_v_m_s_g  system calls).  The protocol is
responsible for disposal of the data mbuf chains on output opera-
tions.   A  non-zero return value gives a UNIX error number which
should be passed to higher level software.  The  following  para-
graphs describe each of the requests possible.

PRU_ATTACH
     When a protocol is bound to a socket (with the _s_o_c_k_e_t system
     call) the protocol module is called with this  request.   It
     is the responsibility of the protocol module to allocate any
     resources necessary.  The  ``attach''  request  will  always
     precede any of the other requests, and should not occur more
     than once.









Networking Implementation Notes                         SMM:18-21


PRU_DETACH
     This is the antithesis of the attach request, and is used at
     the time a socket is deleted.  The protocol module may deal-
     locate any resources assigned to the socket.

PRU_BIND
     When a socket is initially created it has no  address  bound
     to  it.   This  request  indicates that an address should be
     bound to an existing socket.  The protocol module must  ver-
     ify  that  the  requested address is valid and available for
     use.

PRU_LISTEN
     The ``listen'' request indicates the user wishes  to  listen
     for  incoming  connection requests on the associated socket.
     The protocol module should perform any state changes  needed
     to  carry  out  this  request  (if  possible).  A ``listen''
     request always precedes a request to accept a connection.

PRU_CONNECT
     The ``connect'' request indicates the user wants to a estab-
     lish  an association.  The _a_d_d_r parameter supplied describes
     the peer to be  connected  to.   The  effect  of  a  connect
     request may vary depending on the protocol.  Virtual circuit
     protocols, such as TCP [Postel81b], use this request to ini-
     tiate  establishment  of  a TCP connection.  Datagram proto-
     cols, such as  UDP  [Postel80],  simply  record  the  peer's
     address  in  a  private data structure and use it to tag all
     outgoing packets.  There are no  restrictions  on  how  many
     times  a  connect request may be used after an attach.  If a
     protocol supports the notion of _m_u_l_t_i_-_c_a_s_t_i_n_g, it is  possi-
     ble  to  use  multiple  connects  to  establish a multi-cast
     group.  Alternatively, an association may  be  broken  by  a
     PRU_DISCONNECT request, and a new association created with a
     subsequent connect request; all without destroying and  cre-
     ating a new socket.

PRU_ACCEPT
     Following a successful PRU_LISTEN request and the arrival of
     one or more connections, this request is  made  to  indicate
     the  user  has accepted the first connection on the queue of
     pending connections.  The protocol module should fill in the
     supplied  address  buffer  with the address of the connected
     party.

PRU_DISCONNECT
     Eliminate an association created with a PRU_CONNECT request.

PRU_SHUTDOWN
     This  call  is  used  to  indicate no more data will be sent
     and/or received (the _a_d_d_r parameter indicates the  direction
     of  the shutdown, as encoded in the _s_o_s_h_u_t_d_o_w_n system call).
     The protocol may, at its  discretion,  deallocate  any  data
     structures related to the shutdown and/or notify a connected









SMM:18-22                         Networking Implementation Notes


     peer of the shutdown.

PRU_RCVD
     This request is made only if the protocol entry in the  pro-
     tocol  switch  table  includes the PR_WANTRCVD flag.  When a
     user removes data from the receive queue this  request  will
     be  sent  to the protocol module.  It may be used to trigger
     acknowledgements, refresh  windowing  information,  initiate
     data transfer, etc.

PRU_SEND
     Each  user  request  to  send data is translated into one or
     more PRU_SEND requests (a protocol may indicate that a  sin-
     gle  user  send  request  must  be  translated into a single
     PRU_SEND request by specifying the  PR_ATOMIC  flag  in  its
     protocol  description).  The data to be sent is presented to
     the protocol as a list of mbufs and an address  is,  option-
     ally,  supplied  in  the  _a_d_d_r  parameter.   The protocol is
     responsible for preserving the data  in  the  socket's  send
     queue if it is not able to send it immediately, or if it may
     need it at some later time (e.g. for retransmission).

PRU_ABORT
     This request indicates an abnormal termination  of  service.
     The protocol should delete any existing association(s).

PRU_CONTROL
     The  ``control'' request is generated when a user performs a
     UNIX _i_o_c_t_l system call on a socket (and  the  ioctl  is  not
     intercepted  by  the  socket routines).  It allows protocol-
     specific operations to be provided outside the scope of  the
     common  socket  interface.   The  _a_d_d_r  parameter contains a
     pointer to a static kernel data area where relevant informa-
     tion  may be obtained or returned.  The _m parameter contains
     the actual _i_o_c_t_l request code (note the non-standard calling
     convention).   The _r_i_g_h_t_s parameter contains a pointer to an
     _i_f_n_e_t structure if the _i_o_c_t_l operation pertains to a partic-
     ular network interface.

PRU_SENSE
     The  ``sense''  request  is generated when the user makes an
     _f_s_t_a_t system call on a socket; it  requests  status  of  the
     associated  socket.   This currently returns a standard _s_t_a_t
     structure.  It typically contains only the optimal  transfer
     size  for  the  connection  (based on buffer size, windowing
     information and maximum packet size).  The _m parameter  con-
     tains  a pointer to a static kernel data area where the sta-
     tus buffer should be placed.

PRU_RCVOOB
     Any  ``out-of-band''  data  presently  available  is  to  be
     returned.  An mbuf is passed to the protocol module, and the
     protocol should either place data in the mbuf or attach  new
     mbufs  to the one supplied if there is insufficient space in









Networking Implementation Notes                         SMM:18-23


     the single mbuf.  An error may be  returned  if  out-of-band
     data  is  not  (yet) available or has already been consumed.
     The _a_d_d_r parameter contains any options such as MSG_PEEK  to
     examine data without consuming it.

PRU_SENDOOB
     Like PRU_SEND, but for out-of-band data.

PRU_SOCKADDR
     The  local address of the socket is returned, if any is cur-
     rently bound to it.  The  address  (with  protocol  specific
     format) is returned in the _a_d_d_r parameter.

PRU_PEERADDR
     The  address of the peer to which the socket is connected is
     returned.  The socket must be in a SS_ISCONNECTED state  for
     this request to be made to the protocol.  The address format
     (protocol specific) is returned in the _a_d_d_r parameter.

PRU_CONNECT2
     The protocol module is supplied two sockets and requested to
     establish  a  connection between the two without binding any
     addresses, if possible.  This call is used  in  implementing
     the _s_o_c_k_e_t_p_a_i_r(2) system call.

     The  following  requests are used internally by the protocol
modules and are never generated by the socket routines.  In  cer-
tain  instances,  they are handed to the _p_r___u_s_r_r_e_q routine solely
for convenience in tracing a protocol's operation (e.g. PRU_SLOW-
TIMO).

PRU_FASTTIMO
     A  ``fast timeout'' has occurred.  This request is made when
     a timeout occurs in the protocol's _p_r___f_a_s_t_i_m_o routine.   The
     _a_d_d_r parameter indicates which timer expired.

PRU_SLOWTIMO
     A  ``slow timeout'' has occurred.  This request is made when
     a timeout occurs in the protocol's _p_r___s_l_o_w_t_i_m_o routine.  The
     _a_d_d_r parameter indicates which timer expired.

PRU_PROTORCV
     This request is used in the protocol-protocol interface, not
     by the routines.  It requests reception of data destined for
     the  protocol  and not the user.  No protocols currently use
     this facility.

PRU_PROTOSEND
     This request allows a protocol to  send  data  destined  for
     another  protocol  module,  not  a user.  The details of how
     data  is  marked  ``addressed  to  protocol''   instead   of
     ``addressed  to user'' are left to the protocol modules.  No
     protocols currently use this facility.










SMM:18-24                         Networking Implementation Notes


88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee

     The  interface  between  protocol  modules  is  through  the
_p_r___u_s_r_r_e_q,  _p_r___i_n_p_u_t,  _p_r___o_u_t_p_u_t,  _p_r___c_t_l_i_n_p_u_t,  and _p_r___c_t_l_o_u_t_p_u_t
routines.  The calling conventions for all but the _p_r___u_s_r_r_e_q rou-
tine  are expected to be specific to the protocol modules and are
not guaranteed to be consistent  across  protocol  families.   We
will examine the conventions used for some of the Internet proto-
cols in this section as an example.

88..11..  pprr__oouuttppuutt

     The Internet protocol UDP uses the convention,

     error = udp_output(inp, m);
     int error; struct inpcb *inp; struct mbuf *m;

where  the  _i_n_p,  ``_i_nternet  _protocol  _control  _block'',  passed
between modules conveys per connection state information, and the
mbuf chain contains the data to be sent.   UDP  performs  consis-
tency  checks,  appends  its  header, calculates a checksum, etc.
before passing the packet on.  UDP is based on the Internet  Pro-
tocol,  IP [Postel81a], as its transport.  UDP passes a packet to
the IP module for output as follows:

     error = ip_output(m, opt, ro, flags);
     int error; struct mbuf *m, *opt; struct route *ro; int flags;


     The call to IP's output routine  is  more  complicated  than
that  for  UDP,  as befits the additional work the IP module must
do.  The _m parameter is the data to be sent, and the _o_p_t  parame-
ter  is  an optional list of IP options which should be placed in
the IP packet header.  The _r_o parameter  is  is  used  in  making
routing decisions (and passing them back to the caller for use in
subsequent calls).  The final  parameter,  _f_l_a_g_s  contains  flags
indicating  whether  the  user is allowed to transmit a broadcast
packet and if routing is to be performed.  The broadcast flag may
be  inconsequential  if  the underlying hardware does not support
the notion of broadcasting.

     All output routines return 0 on success  and  a  UNIX  error
number  if a failure occurred which could be detected immediately
(no buffer space available, no route to destination, etc.).

88..22..  pprr__iinnppuutt

     Both UDP and TCP use the following calling convention,

     (void) (*protosw[].pr_input)(m, ifp);
     struct mbuf *m; struct ifnet *ifp;

Each mbuf list passed is a single packet to be processed  by  the
protocol  module.   The  interface  from  which  the  packet  was









Networking Implementation Notes                         SMM:18-25


received is passed as the second parameter.

     The IP input routine is a VAX software interrupt level  rou-
tine,  and so is not called with any parameters.  It instead com-
municates with network interfaces through a queue, _i_p_i_n_t_r_q, which
is  identical  in  structure  to  the  queues used by the network
interfaces for storing packets awaiting transmission.  The  soft-
ware  interrupt  is  enabled  by the network interfaces when they
place input data on the input queue.

88..33..  pprr__ccttlliinnppuutt

     This routine is used to convey ``control'' information to  a
protocol  module  (i.e.  information which might be passed to the
user, but is not data).

     The common calling convention for this routine is,

     (void) (*protosw[].pr_ctlinput)(req, addr);
     int req; struct sockaddr *addr;

The _r_e_q parameter is one of the following,

     #define  PRC_IFDOWN             0       /* interface transition */
     #define  PRC_ROUTEDEAD          1       /* select new route if possible */
     #define  PRC_QUENCH             4       /* some said to slow down */
     #define  PRC_MSGSIZE            5       /* message size forced drop */
     #define  PRC_HOSTDEAD           6       /* normally from IMP */
     #define  PRC_HOSTUNREACH        7       /* ditto */
     #define  PRC_UNREACH_NET        8       /* no route to network */
     #define  PRC_UNREACH_HOST       9       /* no route to host */
     #define  PRC_UNREACH_PROTOCOL   10      /* dst says bad protocol */
     #define  PRC_UNREACH_PORT       11      /* bad port # */
     #define  PRC_UNREACH_NEEDFRAG   12      /* IP_DF caused drop */
     #define  PRC_UNREACH_SRCFAIL    13      /* source route failed */
     #define  PRC_REDIRECT_NET       14      /* net routing redirect */
     #define  PRC_REDIRECT_HOST      15      /* host routing redirect */
     #define  PRC_REDIRECT_TOSNET    14      /* redirect for type of service & net */
     #define  PRC_REDIRECT_TOSHOST   15      /* redirect for tos & host */
     #define  PRC_TIMXCEED_INTRANS   18      /* packet lifetime expired in transit */
     #define  PRC_TIMXCEED_REASS     19      /* lifetime expired on reass q */
     #define  PRC_PARAMPROB          20      /* header incorrect */

while the _a_d_d_r parameter is the address to  which  the  condition
applies.   Many  of the requests have obviously been derived from
ICMP (the Internet Control  Message  Protocol  [Postel81c]),  and
from  error  messages  defined  in  the  1822 host/IMP convention
[BBN78].  Mapping tables exist to  convert  control  requests  to
UNIX error codes which are delivered to a user.

88..44..  pprr__ccttlloouuttppuutt

     This  is  the  routine that implements per-socket options at
the protocol level for _g_e_t_s_o_c_k_o_p_t and  _s_e_t_s_o_c_k_o_p_t.   The  calling









SMM:18-26                         Networking Implementation Notes


convention is,

     error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp);
     int op; struct socket *so; int level, optname; struct mbuf **mp;

where  _o_p  is one of PRCO_SETOPT or PRCO_GETOPT, _s_o is the socket
from whence the call originated, and _l_e_v_e_l and  _o_p_t_n_a_m_e  are  the
protocol level and option name supplied by the user.  The results
of a PRCO_GETOPT call are returned in an mbuf  whose  address  is
placed  in  _m_p before return.  On a PRCO_SETOPT call, _m_p contains
the address of an mbuf  containing  the  option  data;  the  mbuf
should be freed before return.

99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee

     The  lowest  layer  in the set of protocols which comprise a
protocol family must interface itself  to  one  or  more  network
interfaces  in  order  to  transmit  and  receive packets.  It is
assumed that any routing decisions have been made before  handing
a  packet to a network interface, in fact this is absolutely nec-
essary in order to  locate  any  interface  at  all  (unless,  of
course,  one  uses  a single ``hardwired'' interface).  There are
two cases with which to be concerned, transmission  of  a  packet
and receipt of a packet; each will be considered separately.

99..11..  PPaacckkeett ttrraannssmmiissssiioonn

     Assuming  a  protocol  has  a handle on an interface, _i_f_p, a
(struct ifnet *), it transmits a fully formatted packet with  the
following call,

     error = (*ifp->if_output)(ifp, m, dst)
     int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;

The output routine for the network interface transmits the packet
_m to the _d_s_t address, or returns  an  error  indication  (a  UNIX
error  number).   In reality transmission may not be immediate or
successful; normally the output routine simply queues the  packet
on its send queue and primes an interrupt driven routine to actu-
ally transmit the packet.  For unreliable media, such as the Eth-
ernet,  ``successful''  transmission simply means that the packet
has been placed on the cable without a collision.  On  the  other
hand,  an  1822  interface guarantees proper delivery or an error
indication for each message transmitted.  The model  employed  in
the  networking  system  attaches  no promises of delivery to the
packets handed to a network interface, and thus corresponds  more
closely  to  the Ethernet.  Errors returned by the output routine
are only those that can be detected immediately, and are normally
trivial  in  nature (no buffer space, address format not handled,
etc.).  No indication is received if errors  are  detected  after
the call has returned.












Networking Implementation Notes                         SMM:18-27


99..22..  PPaacckkeett rreecceeppttiioonn

     Each  protocol family must have one or more ``lowest level''
protocols.  These protocols deal with internetwork addressing and
are  responsible  for  the  delivery  of  incoming packets to the
proper protocol processing modules.  In the PUP  model  [Boggs78]
these  protocols  are termed Level 1 protocols, in the ISO model,
network layer protocols.  In this system each such protocol  mod-
ule  has  an input packet queue assigned to it.  Incoming packets
received by a network interface are queued for the protocol  mod-
ule,  and a VAX software interrupt is posted to initiate process-
ing.

     Three macros are available for queuing and  dequeuing  pack-
ets:

IF_ENQUEUE(ifq, m)
     This places the packet _m at the tail of the queue _i_f_q.

IF_DEQUEUE(ifq, m)
     This places a pointer to the packet at the head of queue _i_f_q
     in _m and removes the packet from the queue.   A  zero  value
     will be returned in _m if the queue is empty.

IF_DEQUEUEIF(ifq, m, ifp)
     Like  IF_DEQUEUE, this removes the next packet from the head
     of a queue and returns it in _m.  A pointer to the  interface
     on which the packet was received is placed in _i_f_p, a (struct
     ifnet *).

IF_PREPEND(ifq, m)
     This places the packet _m at the head of the queue _i_f_q.

     Each queue has a maximum length associated with it as a sim-
ple  form of congestion control.  The macro IF_QFULL(ifq) returns
1 if the queue is filled, in which case  the  macro  IF_DROP(ifq)
should  be  used  to increment the count of the number of packets
dropped, and the offending packet is dropped.  For  example,  the
following  code  fragment  is  commonly found in a network inter-
face's input routine,

     if (IF_QFULL(inq)) {
            IF_DROP(inq);
            m_freem(m);
     } else
            IF_ENQUEUE(inq, m);

















SMM:18-28                         Networking Implementation Notes


1100..  GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess

     The system has been designed with the  expectation  that  it
will  be  used in an internetwork environment.  The ``canonical''
environment was envisioned to be a collection of local area  net-
works connected at one or more points through hosts with multiple
network interfaces (one on each local area network), and possibly
a  connection  to a long haul network (for example, the ARPANET).
In such an environment, issues of gatewaying and  packet  routing
become  very important.  Certain of these issues, such as conges-
tion control, have been handled in a simplistic manner or specif-
ically  not addressed.  Instead, where possible, the network sys-
tem  attempts  to  provide  simple  mechanisms  upon  which  more
involved  policies may be implemented.  As some of these problems
become better understood, the solutions developed will be  incor-
porated into the system.

     This  section  will  describe  the  facilities  provided for
packet routing.  The simplistic mechanisms provided  for  conges-
tion control are described in chapter 12.

1100..11..  RRoouuttiinngg ttaabblleess

     The  network  system  maintains  a set of routing tables for
selecting a network interface to use in delivering  a  packet  to
its destination.  These tables are of the form:

     struct rtentry {
              u_long   rt_hash;                /* hash key for lookups */
              struct   sockaddr rt_dst;        /* destination net or host */
              struct   sockaddr rt_gateway;    /* forwarding agent */
              short    rt_flags;               /* see below */
              short    rt_refcnt;              /* no. of references to structure */
              u_long   rt_use;                 /* packets sent using route */
              struct   ifnet *rt_ifp;          /* interface to give packet to */
     };


     The routing information is organized in two separate tables,
one for routes to a host and one for routes to  a  network.   The
distinction  between  hosts  and  networks is necessary so that a
single mechanism may be used for both  broadcast  and  multi-drop
type  networks,  and  also for networks built from point-to-point
links (e.g DECnet [DEC80]).

     Each table is organized as a hashed  set  of  linked  lists.
Two  32-bit  hash  values  are calculated by routines defined for
each address family; one based on the destination being  a  host,
and  one  assuming  the  target  is  the  network  portion of the
address.  Each hash value is used  to  locate  a  hash  chain  to
search  (by  taking the value modulo the hash table size) and the
entire 32-bit value is then used as a key in scanning the list of
routes.   Lookups  are  applied  first  to  the routing table for
hosts, then to the routing table for networks.  If  both  lookups









Networking Implementation Notes                         SMM:18-29


fail, a final lookup is made for a ``wildcard'' route (by conven-
tion, network 0).  The  first  appropriate  route  discovered  is
used.   By doing this, routes to a specific host on a network may
be present as well as routes to the network.  This also allows  a
``fall  back'' network route to be defined to a ``smart'' gateway
which may then perform more intelligent routing.

     Each routing table entry contains a destination (the desired
final  destination),  a  gateway to which to send the packet, and
various flags which indicate the route's status and type (host or
network).   A count of the number of packets sent using the route
is kept, along with a count of ``held references'' to the dynami-
cally  allocated  structure  to  insure  that  memory reclamation
occurs only when the route is not in use.  Finally, a pointer  to
the  a  network  interface  is kept; packets sent using the route
should be handed to this interface.

     Routes are typed in two ways: either as host or network, and
as  ``direct''  or  ``indirect''.   The  host/network distinction
determines how to compare the _r_t___d_s_t field during lookup.  If the
route  is  to  a  network, only a packet's destination network is
compared to the _r_t___d_s_t entry stored in the table.  If  the  route
is to a host, the addresses must match bit for bit.

     The  distinction  between ``direct'' and ``indirect'' routes
indicates whether the destination is directly  connected  to  the
source.   This is needed when performing local network encapsula-
tion.  If a packet is destined for a peer at a  host  or  network
which  is  not directly connected to the source, the internetwork
packet header will contain the address of the  eventual  destina-
tion, while the local network header will address the intervening
gateway.  Should the destination  be  directly  connected,  these
addresses  are  likely  to be identical, or a mapping between the
two exists.  The RTF_GATEWAY flag indicates that the route is  to
an  ``indirect'' gateway agent, and that the local network header
should be filled in from the _r_t___g_a_t_e_w_a_y field instead of from the
final internetwork destination address.

     It  is  assumed that multiple routes to the same destination
will not be present; only  one  of  multiple  routes,  that  most
recently installed, will be used.

     Routing  redirect  control  messages are used to dynamically
modify existing routing table entries as well as dynamically cre-
ate new routing table entries.  On hosts where exhaustive routing
information is too expensive to maintain  (e.g.  work  stations),
the  combination of wildcard routing entries and routing redirect
messages can be used  to  provide  a  simple  routing  management
scheme without the use of a higher level policy process.  Current
connections may be rerouted after notification of  the  protocols
by  means  of  their _p_r___c_t_l_i_n_p_u_t entries.  Statistics are kept by
the routing table routines on the use of  routing  redirect  mes-
sages  and  their affect on the routing tables.  These statistics
may be viewed using _n_e_t_s_t_a_t(1).









SMM:18-30                         Networking Implementation Notes


     Status information other than routing redirect control  mes-
sages may be used in the future, but at present they are ignored.
Likewise, more intelligent ``metrics'' may be  used  to  describe
routes  in  the  future, possibly based on bandwidth and monetary
costs.

1100..22..  RRoouuttiinngg ttaabbllee iinntteerrffaaccee

     A protocol accesses the routing tables  through  three  rou-
tines,  one  to allocate a route, one to free a route, and one to
process a routing redirect control message.  The routine  _r_t_a_l_l_o_c
performs  route  allocation;  it  is called with a pointer to the
following structure containing the desired destination:

     struct route {
            struct    rtentry *ro_rt;
            struct    sockaddr ro_dst;
     };

The route returned  is  assumed  ``held''  by  the  caller  until
released  with an _r_t_f_r_e_e call.  Protocols which implement virtual
circuits, such as TCP, hold onto routes for the duration  of  the
circuit's lifetime, while connection-less protocols, such as UDP,
allocate and  free  routes  whenever  their  destination  address
changes.

     The  routine _r_t_r_e_d_i_r_e_c_t is called to process a routing redi-
rect control message.  It is called with a  destination  address,
the  new gateway to that destination, and the source of the redi-
rect.  Redirects are accepted only from the  current  router  for
the  destination.  If a non-wildcard route exists to the destina-
tion, the gateway entry in the route is modified to point at  the
new  gateway  supplied.   Otherwise, a new routing table entry is
inserted reflecting the information supplied.  Routes  to  inter-
faces  and  routes  to gateways which are not directly accessible
from the host are ignored.

1100..33..  UUsseerr lleevveell rroouuttiinngg ppoolliicciieess

     Routing policies implemented in  user  processes  manipulate
the  kernel routing tables through two _i_o_c_t_l calls.  The commands
SIOCADDRT and SIOCDELRT add and delete routing  entries,  respec-
tively;  the  tables  are read through the /dev/kmem device.  The
decision to place policy decisions in a user process implies that
routing  table updates may lag a bit behind the identification of
new routes, or the failure of existing routes, but this period of
instability  is normally very small with proper implementation of
the routing process.  Advisory information, such  as  ICMP  error
messages  and IMP diagnostic messages, may be read from raw sock-
ets (described in the next section).

     Several routing policy processes have  already  been  imple-
mented.  The system standard ``routing daemon'' uses a variant of
the Xerox NS Routing Information Protocol [Xerox82]  to  maintain









Networking Implementation Notes                         SMM:18-31


up-to-date  routing tables in our local environment.  Interaction
with other existing routing protocols, such as the  Internet  EGP
(Exterior  Gateway Protocol), has been accomplished using a simi-
lar process.

1111..  RRaaww ssoocckkeettss

     A raw socket is an object which allows users  direct  access
to  a  lower-level protocol.  Raw sockets are intended for knowl-
edgeable processes which wish to take advantage of some  protocol
feature  not directly accessible through the normal interface, or
for the development of new protocols built  atop  existing  lower
level  protocols.   For  example,  a  new version of TCP might be
developed at the user level by using a raw IP socket for delivery
of  packets.   The raw IP socket interface attempts to provide an
identical interface to the one a protocol would have if  it  were
resident in the kernel.

     The  raw socket support is built around a generic raw socket
interface, (possibly) augmented by  protocol-specific  processing
routines.   This section will describe the core of the raw socket
interface.

1111..11..  CCoonnttrrooll bblloocckkss

     Every raw socket has a protocol control block of the follow-
ing form:

     struct rawcb {
             struct   rawcb *rcb_next;        /* doubly linked list */
             struct   rawcb *rcb_prev;
             struct   socket *rcb_socket;     /* back pointer to socket */
             struct   sockaddr rcb_faddr;     /* destination address */
             struct   sockaddr rcb_laddr;     /* socket's address */
             struct   sockproto rcb_proto;    /* protocol family, protocol */
             caddr_t  rcb_pcb;                /* protocol specific stuff */
             struct   mbuf *rcb_options;      /* protocol specific options */
             struct   route rcb_route;        /* routing information */
             short    rcb_flags;
     };

All  the control blocks are kept on a doubly linked list for per-
forming lookups during  packet  dispatch.   Associations  may  be
recorded  in  the control block and used by the output routine in
preparing packets for transmission.  The _r_c_b___p_r_o_t_o structure con-
tains  the protocol family and protocol number with which the raw
socket is associated.  The protocol,  family  and  addresses  are
used  to  filter packets on input; this will be described in more
detail  shortly.   If  any   protocol-specific   information   is
required,  it  may  be  attached  to  the control block using the
_r_c_b___p_c_b field.  Protocol-specific  options  for  transmission  in
outgoing packets may be stored in _r_c_b___o_p_t_i_o_n_s.











SMM:18-32                         Networking Implementation Notes


     A  raw socket interface is datagram oriented.  That is, each
send or receive on the socket  requires  a  destination  address.
This address may be supplied by the user or stored in the control
block and automatically installed in the outgoing packet  by  the
output routine.  Since it is not possible to determine whether an
address is present or  not  in  the  control  block,  two  flags,
RAW_LADDR  and RAW_FADDR, indicate if a local and foreign address
are present.  Routing is expected to be performed by the underly-
ing protocol if necessary.

1111..22..  IInnppuutt pprroocceessssiinngg

     Input  packets  are  ``assigned''  to raw sockets based on a
simple pattern matching scheme.  Each network interface or proto-
col  gives  unassigned  packets to the raw input routine with the
call:

     raw_input(m, proto, src, dst)
     struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;

The data packet then has a generic header prepended to it of  the
form

     struct raw_header {
            struct    sockproto raw_proto;
            struct    sockaddr raw_dst;
            struct    sockaddr raw_src;
     };

and it is placed in a packet queue for the ``raw input protocol''
module.  Packets taken from this queue are copied  into  any  raw
sockets that match the header according to the following rules,

1)   The protocol family of the socket and header agree.

2)   If  the  protocol  number in the socket is non-zero, then it
     agrees with that found in the packet header.

3)   If a local address is defined for the  socket,  the  address
     format  of  the local address is the same as the destination
     address's and the two addresses agree bit for bit.

4)   The rules of 3) are applied to the socket's foreign  address
     and the packet's source address.

A basic assumption is that addresses present in the control block
and packet header (as constructed by the  network  interface  and
any  raw input protocol module) are in a canonical form which may
be ``block compared''.

1111..33..  OOuuttppuutt pprroocceessssiinngg

     On output the raw _p_r___u_s_r_r_e_q routine passes the packet and  a
pointer  to  the  raw  control  block  to the raw protocol output









Networking Implementation Notes                         SMM:18-33


routine for any processing required before it is delivered to the
appropriate  network  interface.   The output routine is normally
the only code required to implement a raw socket interface.

1122..  BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll

     One of the major factors in the performance of a protocol is
the buffering policy used.  Lack of a proper buffering policy can
force packets to be dropped, cause falsified  windowing  informa-
tion  to  be  emitted by protocols, fragment host memory, degrade
the overall host performance,  etc.   Due  to  problems  such  as
these,  most  systems allocate a fixed pool of memory to the net-
working system and impose a policy optimized for ``normal''  net-
work operation.

     The networking system developed for UNIX is little different
in this respect.  At boot time a fixed amount of memory is  allo-
cated  by the networking system.  At later times more system mem-
ory may be requested as the need arises, but at no time is memory
ever  returned  to the system.  It is possible to garbage collect
memory from the network, but difficult.  In order to perform this
garbage  collection  properly,  some  portion of the network will
have to be ``turned off'' as data structures  are  updated.   The
interval  over  which this occurs must kept small compared to the
average inter-packet arrival time, or too  much  traffic  may  be
lost, impacting other hosts on the network, as well as increasing
load on the interconnecting mediums.  In our environment we  have
not  experienced  a  need for such compaction, and thus have left
the problem unresolved.

     The mbuf structure was introduced in  chapter  5.   In  this
section a brief description will be given of the allocation mech-
anisms, and policies used by the protocols in performing  connec-
tion level buffering.

1122..11..  MMeemmoorryy mmaannaaggeemmeenntt

     The  basic  memory allocation routines manage a private page
map, the size of which determines the maximum  amount  of  memory
that  may  be allocated by the network.  A small amount of memory
is allocated at boot time to initialize the mbuf  and  mbuf  page
cluster free lists.  When the free lists are exhausted, more mem-
ory is requested  from  the  system  memory  allocator  if  space
remains  in  the map.  If memory cannot be allocated, callers may
block awaiting free memory, or the failure may  be  reflected  to
the  caller  immediately.   The allocator will not block awaiting
free map entries, however, as exhaustion of the page map  usually
indicates  that  buffers  have  been lost due to a ``leak.''  The
private page table is used by the network buffer management  rou-
tines  in  remapping pages to be logically contiguous as the need
arises.  In addition, an array of reference counts parallels  the
page  table  and  is  used when multiple references to a page are
present.










SMM:18-34                         Networking Implementation Notes


     Mbufs are 128 byte structures, 8 fitting in a 1Kbyte page of
memory.   When  data is placed in mbufs, it is copied or remapped
into logically contiguous pages of memory from the  network  page
pool  if  possible.  Data smaller than half of the size of a page
is copied into one or more 112 byte mbuf data areas.

1122..22..  PPrroottooccooll bbuuffffeerriinngg ppoolliicciieess

     Protocols reserve fixed amounts of buffering  for  send  and
receive queues at socket creation time.  These amounts define the
high and low water marks used by the socket routines in  deciding
when  to  block  and unblock a process.  The reservation of space
does not currently result in any action by the memory  management
routines.

     Protocols  which  provide  connection  level flow control do
this based on the  amount  of  space  in  the  associated  socket
queues.  That is, send windows are calculated based on the amount
of free space in the socket's receive queue, while  receive  win-
dows  are adjusted based on the amount of data awaiting transmis-
sion in the send queue.  Care has been taken to avoid the ``silly
window  syndrome'' described in [Clark82] at both the sending and
receiving ends.

1122..33..  QQuueeuuee lliimmiittiinngg

     Incoming packets from the network are always received unless
memory  allocation  fails.   However, each Level 1 protocol input
queue has an upper bound on the queue's length, and  any  packets
exceeding that bound are discarded.  It is possible for a host to
be overwhelmed by excessive network traffic (for instance a  host
acting  as a gateway from a high bandwidth network to a low band-
width network).  As a ``defensive'' mechanism  the  queue  limits
may be adjusted to throttle network traffic load on a host.  Con-
sider a host willing to devote some percentage of its machine  to
handling  network  traffic.   If the cost of handling an incoming
packet can be calculated so that an acceptable ``packet  handling
rate'' can be determined, then input queue lengths may be dynami-
cally adjusted based on a host's network load and the  number  of
packets  awaiting  processing.   Obviously, discarding packets is
not a satisfactory solution to a problem  such  as  this  (simply
dropping  packets  is  likely to increase the load on a network);
the queue lengths were incorporated mainly as a safeguard  mecha-
nism.

1122..44..  PPaacckkeett ffoorrwwaarrddiinngg

     When  packets can not be forwarded because of memory limita-
tions, the system attempts to generate a ``source  quench''  mes-
sage.   In addition, any other problems encountered during packet
forwarding are also reflected back to the sender in the  form  of
ICMP packets.  This helps hosts avoid unneeded retransmissions.











Networking Implementation Notes                         SMM:18-35


     Broadcast  packets  are never forwarded due to possible dire
consequences.  In an early stage of network  development,  broad-
cast  packets  were  forwarded and a ``routing loop'' resulted in
network saturation and every host on the network crashing.

1133..  OOuutt ooff bbaanndd ddaattaa

     Out of band data is a facility peculiar to the stream socket
abstraction  defined.   Little  agreement  appears to exist as to
what its semantics should be.  TCP defines the notion of ``urgent
data'' as in-line, while the NBS protocols [Burruss81] and numer-
ous others provide a fully independent logical transmission chan-
nel along which out of band data is to be sent.  In addition, the
amount of the data which may be sent as an out  of  band  message
varies  from  protocol  to  protocol; everything from 1 bit to 16
bytes or more.

     A stream socket's notion  of  out  of  band  data  has  been
defined  as  the  lowest  reasonable common denominator (at least
reasonable in our minds); clearly this is subject to debate.  Out
of  band  data  is  expected  to be transmitted out of the normal
sequencing and flow control constraints of the  data  stream.   A
minimum  of 1 byte of out of band data and one outstanding out of
band message are expected to be supported by  the  protocol  sup-
porting  a stream socket.  It is a protocol's prerogative to sup-
port larger-sized messages, or more than one outstanding  out  of
band message at a time.

     Out  of  band data is maintained by the protocol and is usu-
ally not stored in the socket's receive  queue.   A  socket-level
option, SO_OOBINLINE, is provided to force out-of-band data to be
placed in the normal receive queue when urgent data is  received;
this  sometimes  amelioriates  problems  due to loss of data when
multiple out-of-band segments are received before the  first  has
been passed to the user.  The PRU_SENDOOB and PRU_RCVOOB requests
to the _p_r___u_s_r_r_e_q routine are used in sending and receiving  data.

1144..  TTrraaiilleerr pprroottooccoollss

     Core to core copies can be expensive.  Consequently, a great
deal of effort was spent in minimizing such operations.  The  VAX
architecture  provides  virtual memory hardware organized in page
units.  To cut down on copy operations, data  is  kept  in  page-
sized  units  on page-aligned boundaries whenever possible.  This
allows data to be moved in memory simply by  remapping  the  page
instead of copying.  The mbuf and network interface routines per-
form page table manipulations where needed, hiding the  complexi-
ties of the VAX virtual memory hardware from higher level code.

     Data  enters  the system in two ways: from the user, or from
the network (hardware interface).  When data is copied  from  the
user's address space into the system it is deposited in pages (if
sufficient data is present).  This encourages the user to  trans-
mit  information  in  messages which are a multiple of the system









SMM:18-36                         Networking Implementation Notes


page size.

     Unfortunately, performing a similar  operation  when  taking
data  from the network is very difficult.  Consider the format of
an incoming packet.  A packet usually contains  a  local  network
header  followed  by  one  or more headers used by the high level
protocols.  Finally, the data, if  any,  follows  these  headers.
Since  the header information may be variable length, DMA'ing the
eventual data for the user into a page aligned area of memory  is
impossible  without  _a  _p_r_i_o_r_i  knowledge of the format (e.g., by
supporting only a single protocol header format).

     To allow variable length header information  to  be  present
and  still ensure page alignment of data, a special local network
encapsulation may be used.  This encapsulation, termed a  _t_r_a_i_l_e_r
_p_r_o_t_o_c_o_l  [Leffler84], places the variable length header informa-
tion after the data.  A fixed size local network header  is  then
prepended to the resultant packet.  The local network header con-
tains the size of the data portion (in units of 512 bytes), and a
new  _t_r_a_i_l_e_r _p_r_o_t_o_c_o_l _h_e_a_d_e_r, inserted before the variable length
information, contains the size  of  the  variable  length  header
information.   The  following  trailer protocol header is used to
store information regarding the variable length protocol header:

     struct {
            short     protocol;            /* original protocol no. */
            short     length;              /* length of trailer */
     };


     The processing of the trailer protocol is very  simple.   On
output,  the local network header indicates that a trailer encap-
sulation is being used.  The header also includes  an  indication
of  the  number of data pages present before the trailer protocol
header.  The trailer protocol header is  initialized  to  contain
the  actual  protocol  identifier  and the variable length header
size, and is appended to the data along with the variable  length
header information.

     On input, the interface routines identify the trailer encap-
sulation by the protocol type stored in the local network header,
then  calculate the number of pages of data to find the beginning
of the trailer.  The trailing information is copied into a  sepa-
rate mbuf and linked to the front of the resultant packet.

     Clearly,   trailer  protocols  require  cooperation  between
source and destination.  In  addition,  they  are  normally  cost
effective only when sizable packets are used.  The current scheme
works because the local network encapsulation header is  a  fixed
size,  allowing  DMA operations to be performed at a known offset
from the first data page being received.  Should the  local  net-
work header be variable length this scheme fails.











Networking Implementation Notes                         SMM:18-37


     Statistics collected indicate that as much as 200Kb/s can be
gained by using a trailer  protocol  with  1Kbyte  packets.   The
average size of the variable length header was 40 bytes (the size
of a minimal TCP/IP packet header).  If hardware supports  larger
sized packets, even greater gains may be realized.

AAcckknnoowwlleeddggeemmeennttss

     The  internal structure of the system is patterned after the
Xerox PUP architecture [Boggs79], while  in  certain  places  the
Internet protocol family has had a great deal of influence in the
design.  The use of software interrupts for process invocation is
based  on  similar  facilities found in the VMS operating system.
Many of the ideas related to protocol modularity, memory  manage-
ment,  and  network  interfaces are based on Rob Gurwitz's TCP/IP
implementation for the 4.1BSD version of UNIX on  the  VAX  [Gur-
witz81].   Greg  Chesson  explained his use of trailer encapsula-
tions in Datakit, instigating their use in our system.



RReeffeerreenncceess


[Boggs79]           Boggs, D. R., J. F. Shoch, E. A. Taft, and R.
                    M.  Metcalfe;  _P_U_P_: _A_n _I_n_t_e_r_n_e_t_w_o_r_k _A_r_c_h_i_t_e_c_-
                    _t_u_r_e.  Report  CSL-79-10.   XEROX  Palo  Alto
                    Research Center, July 1979.

[BBN78]             Bolt  Beranek  and  Newman; Specification for
                    the Interconnection of  Host  and  IMP.   BBN
                    Technical Report 1822.  May 1978.

[Cerf78]            Cerf, V. G.;  The Catenet Model for Internet-
                    working.  Internet  Working  Group,  IEN  48.
                    July 1978.

[Clark82]           Clark,  D.  D.;   Window  and Acknowledgement
                    Strategy in TCP, RFC-813.   Network  Informa-
                    tion Center, SRI International.  July 1982.

[DEC80]             Digital  Equipment Corporation;  _D_E_C_n_e_t _D_I_G_I_-
                    _T_A_L _N_e_t_w_o_r_k _A_r_c_h_i_t_e_c_t_u_r_e _-  _G_e_n_e_r_a_l  _D_e_s_c_r_i_p_-
                    _t_i_o_n.  Order No.  AA-K179A-TK.  October 1980.

[Gurwitz81]         Gurwitz, R. F.;  VAX-UNIX Networking  Support
                    Project - Implementation Description.  Inter-
                    network  Working  Group,  IEN  168.   January
                    1981.

[ISO81]             International  Organization  for Standardiza-
                    tion.  _I_S_O  _O_p_e_n  _S_y_s_t_e_m_s  _I_n_t_e_r_c_o_n_n_e_c_t_i_o_n  _-
                    _B_a_s_i_c  _R_e_f_e_r_e_n_c_e  _M_o_d_e_l.   ISO/TC  97/SC 16 N
                    719.  August 1981.









SMM:18-38                         Networking Implementation Notes


[Joy86]             Joy, W.; Fabry, R.;  Leffler,  S.;  McKusick,
                    M.;  and Karels, M.; Berkeley Software Archi-
                    tecture Manual, 4.4BSD  Edition.   _U_N_I_X  _P_r_o_-
                    _g_r_a_m_m_e_r_'_s  _S_u_p_p_l_e_m_e_n_t_a_r_y  _D_o_c_u_m_e_n_t_s,  Vol.  1
                    (PSD:5).  Computer  Systems  Research  Group,
                    University  of  California,  Berkeley.   May,
                    1986.

[Leffler84]         Leffler,  S.J.  and  Karels,  M.J.;   Trailer
                    Encapsulations, RFC-893.  Network Information
                    Center, SRI International.  April 1984.

[Postel80]          Postel, J.  User Datagram Protocol,  RFC-768.
                    Network   Information  Center,  SRI  Interna-
                    tional.  May 1980.

[Postel81a]         Postel, J., ed.  Internet Protocol,  RFC-791.
                    Network   Information  Center,  SRI  Interna-
                    tional.  September 1981.

[Postel81b]         Postel, J., ed.  Transmission Control  Proto-
                    col,  RFC-793.   Network  Information Center,
                    SRI International.  September 1981.

[Postel81c]         Postel, J.  Internet Control  Message  Proto-
                    col,  RFC-792.   Network  Information Center,
                    SRI International.  September 1981.

[Xerox81]           Xerox Corporation.  _I_n_t_e_r_n_e_t _T_r_a_n_s_p_o_r_t _P_r_o_t_o_-
                    _c_o_l_s.    Xerox  System  Integration  Standard
                    028112.  December 1981.

[Zimmermann80]      Zimmermann, H.  OSI Reference Model - The ISO
                    Model of Architecture for Open Systems Inter-
                    connection.  _I_E_E_E _T_r_a_n_s_a_c_t_i_o_n_s _o_n  _C_o_m_m_u_n_i_c_a_-
                    _t_i_o_n_s.  Com-28(4); 425-432.  April 1980.
























