.nr PS 12
.nr VS 15
.RP
.TL
Providing support for non-blocking I/O in a thread package

.AU
Panos Tsirigotis
800-00-2940

.sp 3
Project Report
for
CSCI 5551

.NH
The problem
.PP
The thread package is Awe2, written is C++ and supporting a variety of
architectures: 68000, SPARC, MIPS etc.
The current implementation of threads under Awe2 blocks the entire
UNIX process when a thread attempts to do I/O.
This is obviously wasteful if there are other threads that
can do useful work while the original thread waits for its I/O
request to complete (which may take arbitrarily long time if
the request is something like terminal input).
To solve this problem, I created non-blocking versions of system
calls that have unbounded (or too long) completion times. These are:
.RS
.ft I
read, write, send, sendto, sendmsg, recv, recvfrom, recvmsg, accept,
connect
.ft R
.RE
.PP
The UNIX kernel
does not allow an I/O request to be in progress (with the
single exception of the \fIconnect(2)\fR system call). Instead, when
a file descriptor is set for non-blocking I/O 
(this is slightly incorrect, because the underlying object
is set for non-blocking I/O, not the file descriptor), the
system will just return immediately from the I/O call (e.g. \fIread(2)\fR)
and will inform the process that its request cannot be satisfied
at this time. The process may elect to a) register a signal handler
for asynchronous notification from the system about when I/O
will be possible) and/or b) poll the relevant file descriptors using
the \fIselect(2)\fR system call to identify if I/O is possible.

.NH
Design decisions
.PP
The non-blocking I/O facility is based on the following design decisions
whose goal was to make the facility easy to use, design, maintain and
integrate with the rest of the Awe2 code.
.RS
.IP 1)
The facility as implemented allows a thread to block while
waiting for I/O, but the UNIX process keeps running and the
cpu multiplexor may execute other threads.
Truly asynchronous I/O could be implemented but the Awe2 package
lacks a way of notifying a thread that 
an event has happened.
What is required is something
analogous to signals but on a per thread basis.
This approach was rejected because it 
would require changes to the existing \fIThread\fR class and probably
to other parts of the Awe2 code.
Also, such a model of computation
makes programming hard because of the need to register signal
handlers and handle race conditions. The blocking-thread,
non-blocking unix process is a much cleaner model.
Finally, asynchronous I/O would be just an illusion as the UNIX
kernel does not support it (a process can have only one I/O
operation outstanding).
.IP 2)
The non-blocking I/O facility is optional. This allows a program to be
tested with blocking I/O calls something that, presumably, makes 
debugging easier.
It may also help identify problems with the non-blocking I/O facility.
.IP 3)
Modifications to the existing Awe2 code should be minimal with no
recompilation being necessary (if possible).
.RE

.NH
About Awe2 and non-blocking I/O
.PP
Awe2 provides two types of CPU multiplexors:
a \fISingleCpuMux\fR for uniprocessor
systems and a \fIMultiCpuMux\fR for multiprocessor systems.
There is a 1-1 correspondence between CPU multiplexors and UNIX processes:
in effect a multiplexor manages the single thread of control available
to a UNIX process.
In the following
discussion the terms CPU and process will be used interchangeably.
.PP
There are 2 versions of the non-blocking I/O facility,
one for the \fISingleCpuMux\fR class
and one for the \fIMultiCpuMux\fR class. The \fIMultiCpuMux\fR version uses the 
\fISingleCpuMux\fR version.

.NH
The \fISingleCpuMux\fB version of the non-blocking I/O facility
.PP
The non-blocking I/O is implemented by the \fIAsyncIO\fR class.
For each file descriptor there is an \fIAsyncIO\fR object (there is an
array of \fIAsyncIO\fR objects). Each object has a member function for
each of the intercepted system calls:
.RS
.ft I
read, write, send, sendto, sendmsg, recv, recvfrom, recvmsg, accept,
connect
.ft R
.RE
.LP
There is a function with C linkage for each call that invokes
the appropriate member function.
A thread that tries to execute one of these calls is placed in a
\fIThreadContainer\fR associated with the file descriptor that the
thread tried to use. The descriptor is included in a set of
descriptors on which I/O is expected.
.PP
There are 2 ways of awaking a thread when I/O is possible on
its file descriptor:
.RS
.IP 1)
the system sends the process a \fBSIGIO\fR signal.
The signal handler then reschedules all threads for which
I/O is possible (maybe more than one).
.IP 2)
the other method is by the use of an \fIIOThread\fR. An \fIIOThread\fR is a
subclass of \fIThread\fR. There is only one object of this class.
The function of the \fIIOThread\fR is to check if
I/O is possible on any file descriptor. A secondary, but also important,
function of the \fIIOThread\fR is to keep \fISingleCpuMux\fR running because the
\fISingleCpuMux\fR will terminate if it runs out of threads to
execute.
The \fIIOThread\fR terminates when
the event queue of the multiplexor is empty and no thread is waiting
for I/O. When the \fIIOThread\fR terminates, \fISingleCpuMux\fR terminates also.
.RE
.PP
The \fBSIGIO\fR method is not necessary in the \fISingleCpuMux\fR case and it is used
only if the symbol USE_SIGIO is defined.
The difference between the two methods is that with \fBSIGIO\fR a thread
is rescheduled as soon as it can complete its I/O request.
However since the \fISingleCpuMux\fR uses a first-in first-out event queue,
the rescheduled thread is placed at the end of the queue so
there is no gain by immediate rescheduling.
.PP
A problem with \fBSIGIO\fR is that it introduces a race condition:
the \fBSIGIO\fR handler moves threads from the \fIThreadContainer\fR of the
file descriptor to the event queue of the multiplexor and it
may be invoked while the multiplexor
is extracting a thread from its event queue.
The following solutions are possible:
.RS
.IP 1)
Disable \fBSIGIO\fR when adding/removing threads to/from the event queue.
This is very easy to implement but
has the disadvantage that it requires 2 system calls.
Depending on how often the scheduler code is called, this may imply
a significant performance penalty.
.IP 2)
Create an I/O queue at the multiplexor. The \fBSIGIO\fR handler
will place rescheduled threads in the I/O queue. The multiplexor
can move that queue to its event queue when the event queue gets
empty (\fBSIGIO\fR will have to be blocked during the move); the result
is less overhead. However this solution is unfair to threads
that do I/O because they cannot compete for the CPU until all
other threads have either terminated or also tried to do I/O.
A possible solution (due to Dirk Grunwald) is to use the
size of the I/O queue as a hint in the dispatcher code to
decide if it is worth moving the I/O queue to the event queue
(note that the size does not need to be accurate, so there is
no need for blocking \fBSIGIO\fR to get it).
.RE
.LP
The current version of the non-blocking I/O facility
does not support any of these solutions. The
\fIIOThread\fR method is used instead.

.NH
The \fIMultiCpuMux\fB version of the non-blocking I/O facility
.PP
The \fIMultiCpuMux\fR provides more threads of control by forking
multiple UNIX processes.
(obviously it can also be used on uniprocessors but there is
no significant performance advantage from doing so).
On a multiprocessor, the desired effect is for these processes
to run on different processors speeding up the whole job.
All processes share the same data segments. This facilitates
thread migration because data is always accessible regardless of
what CPU the process is running on.
.NH 2
Thread migration
.PP
Since thread migration is possible a situation may arise where
a thread opens a file while on CPU 1, then migrates to CPU 2 and
tries to access the file from there. However the file descriptor
for the file belongs to CPU/process 1, not to CPU/process 2.
The following are possible solutions to this problem:
.RS
.IP 1)
take the thread to the CPU that owns the file descriptor
.IP 2)
send the file descriptor to the CPU where the thread is (the original
CPU may close the file descriptor)
.IP 3)
have another thread that resides on the cpu where the descriptor
is to process the I/O request and send the result back.
.IP 4)
have all processes share file descriptors.
.RE
.PP
Solution 1 was chosen. When a thread finds out that the file descriptor
it wants to use is not on the CPU it is running on, it relocates
itself to the CPU where the descriptor is, and when the I/O request
is completed it moves back to the original CPU. Note that the second
relocation happens when the thread is rescheduled: the thread is
rescheduled to run on the original CPU.
This solution required the modification of the \fIThread\fR class to
include an extra field to indicate the original CPU.
.PP
Solution 3 was rejected because it
required a message passing facility which was not available.
Solutions 2 and 4 will not work if the threads do any buffering 
in their stacks. They are also more complex to implement because
of the additional mechanism to send/receive file descriptors. 
.PP
From a conceptual point of view solution 4 is the best one.
To implement it, each cpu must create a UNIX domain socket where it
receives messages. There are 2 types of messages: OPEN_FD and CLOSE_FD.
An OPEN_FD message also contains the file descriptor number and
the file descriptor itself (part of the access rights field).
Normally these should be the same (i.e. the system should open
a new file descriptor at the receiving process with a number
equal to the number sent).
A CLOSE_FD message just contains a file descriptor number. When a
cpu receives such a message it closes the specified file descriptor.
When a cpu opens or closes a file, it sends a
signal to the other cpus and then sends them
an OPEN_FD or CLOSE_FD message.
.NH 2
Number of file descriptors
.PP
Another issue was the number of file descriptors available. Clearly
this can be as much as \fBMaxCpuMultiplexors\fR*\fIgetdtablesize()\fR.
I opted instead to have only \fIgetdtablesize()\fR descriptors available.
This means that not all processes use all their available
file descriptors. Furthermore, a file descriptor number can belong
to at most one process.
.PP
There is a problem with this: two processes opening a file may get
back the same file descriptor. The solution is to intercept
all calls that create file descriptors. When a descriptor
is created, it is checked if some other process is using that
number and if so, the descriptor is duplicated to the next
unused number and the original descriptor is closed. 
.PP
The code uses an array of \fIMultiAsyncIO\fR objects 
of size \fIgetdtablesize()\fR.
The \fIMultiAsyncIO\fR object that corresponds to a file descriptor
indicates at which process the descriptor can be found (the descriptor
is used as an index in the table). 
.PP
The possibility of allowing more than \fIgetdtablesize()\fR 
file descriptors was considered and rejected because of the
following disadvantage:
a system call that creates a file descriptor
(like \fIopen(2)\fR) may fail or succeed depending on the CPU that the
thread is using. This can be fixed by moving the thread to a CPU
with free file descriptors but the code becomes more complex.
.NH 2
Handling non-blocking I/O
.PP
\fIMultiCpuMux\fR as opposed to \fISingleCpuMux\fR does not need an \fIIOThread\fR.
Furthermore, the current version of that code discourages the use
of an \fIIOThread\fR because it relies on its event queue being empty
to carry out some extra computations. Therefore, it would seem
that the \fBSIGIO\fR method to reschedule threads is the appropriate one.
This method has also the advantage of immediate rescheduling of
threads that run on other CPUs.
However the race condition must be fixed by employing one of the
solutions proposed above.
.PP
It is also possible to use an \fIIOThread\fR per CPU but the job dispatcher in
\fIMultiCpuMux\fR will need to be modified to check if the \fIIOThread\fR is
the only one running and if there is no I/O request pending.
.NH 2
Unresolved issues
.PP
Since all processes share data segments, they share all variables that
have external linkage.
Unfortunately, a lot of C library functions are not reentrant because
they use static data (e.g. \fIgetpwent(3)\fR). This
means that locking code must be used around calls to such functions.
.PP
A more difficult problem is that all processes share \fIerrno\fR.
It is possible that two processes that execute system calls at the
same time both try to modify errno. The result is that the final
value of errno is undefined.
In general, what is needed is a way to separate the data segment
into a shared and a private part. The linker should provide such
support.

.NH
Using the non-blocking I/O facility with \fISingleCpuMux\fR
.PP
The user must create an \fIIOThread\fR before starting the \fISingleCpuMux\fR.
Other than that, use of the facility is transparent.
However, it should be noted that some form of locking is required
around code that uses the facility but
is not reentrant (an example is the stream code in the G++ library).
SpinLocks should not be used because they will cause deadlock; 
blocking locks are needed.
The Semaphore class is one possible solution.


.NH
Current status and future work
.PP
The uniprocessor version of the non-blocking I/O facility works reliably.
The multiprocessor code has not been tested yet because the SunOS
version of the multiplexor was not ready. The dispatcher for \fIMultiCpuMux\fR
will need to be changed to deal with \fBSIGIO\fR or with the \fIIOThread\fR.
Also the following modifications are recommended:
.RS
.IP -
Move the event and I/O queues to the \fICpuMultiplexor\fR class
.IP -
Allow a thread to define its CPU affinity
.IP -
Include functions to lock/unlock the event and I/O queues in the 
\fICpuMultiplexor\fR.
The functions should be virtual members of the \fICpuMultiplexor\fR.
They are obviously unnecessary in the current version of \fISingleCpuMux\fR
but if preemption is allowed, they will be needed.
.RE

