.\" mm, I guess
.ds d "\&\s-1\f(CW
.ds p "\&\s+1\fP
.de CC
.DS I
.ft CW
.S -1
.nr == \w'~~~~~~~~'u
.ta \\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u +\\n(==u
..
.de EE
.ft P
.S +1
.DE \\$1
..
.nr Hb 10
.nr Hs 10
.TL
\fB Userfs\fP \- Filesystems Implemented as User Processes
.AU "Jeremy Fitzhardinge <jeremy@sw.oz.au>
.AF "Softway Pty. Ltd."
.MT 4
.H 1 "Introduction"
Userfs is a mechanism by which normal user processes can
be a Linux filesystem.
There are many uses for this, including:
.BL
.LI "Prototype filesystems"
.P
Prototype new block allocation algorithms in a user process and
debug with gdb
before going into the compile-crash-reboot cycle of kernel
development.
.LI "Infrequent use filesystems"
.P
You want to mount "FooBaz 0X" filesystems under Linux, but
you don't want it that often, and you don't need it to be
maximum speed.  Rather than trying to get the kernel itself
to understand, or write specialised tools, write a filesystem
program.
.LI "Add capabilities to existing filesystems"
.P
Want compression, encryption, ACLs?  Have a process to mirror
an existing file tree, but with your own extentions and semantics.
.LI "Completely virtual filesystems and new interfaces"
.P
Add a filesystem-type interface to an existing mechanism, or
a filesystem interface as a new way of representing data.
Sick of FTP?  How about
.CC
$ mkdir /ftp/tsx-11.mit.edu
$ cd /ftp/tsx-11.mit.edu/pub/Linux
$ cp README $HOME
.EE
Or mail?
.CC
$ cd /mail
$ ls
001.sbg@socs.uts.edu.au
002.Leroy
003.tlukka@vinkku.hut.fi
004.Davor_Jadrijevic
$ cat */From
From: sbg@socs.uts.edu.au
From: leroy@socs.uts.edu.au (Leroy)
From: tlukka@vinkku.hut.fi
From: davor%emard.uucp@ds5000.irb.hr (Davor Jadrijevic)
$ cat */Subject
Subject: More things
Subject: (none)
Subject: That userfs thing
Subject: mailfs again
$ 
.EE
.LE
You get the idea.
.H 1 "Installation"
.H 2 "Kernel"
First of all, remove traces of previous verions of userfs: make sure
there are no userfs header files in
.I "linux/include/linux"
and no userfs patches to any of the kernel source.
.P
Apply the patch "userfs.diff" in the normal way (it's against a 1.1.17
kernel).  Do a "make config; make depend; make", and install.  It is
not necessary to copy any files into the linux source tree; only the
patch is required.
.P
Since this is an 
.I alpha
release, you should know what you're doing, and
know how to fix up simple compilation problems with the kernel.  I'd
be surprised if it just worked.  One thing you may have problems with
is config.in: some of my private changes have leaked in there, and
you're almost certainly going to have local changes, so it's unlikely
the patch will be clean.  The only important change is adding the
CONFIG_USERFS_FS line to the end of the filesystem section.
.P
.H 2 "Non-kernel Code"
.P
Building the rest of the code should be a matter of typing "make" at
the top userfs directory. This will generate dependencies and build
the utilities needed (genser), the library, the clients using the
library and the kernel module.
.P
There are problems linking the code in genser, add the "-lfl" to the
LIBS list in the makefile.
.P
This version is a loadable kernel module.  When you specify "yes" to
the userfs question, it doesn't put the filesystem code into the
kernel; it only puts some symbols into the module symbol table needed
at module load time.  I hope to eliminate the need for any special
changes to the kernel soon.
.P
To install the module you need the
.B modutils
package, which should be available from your local
Linux ftp archive.  It should be clear from it's
documentation what you need to do with
.I "userfs.o"
to get it into the kernel.  If you get some warnings about multiply
defined symbols, ignore them.  Only undefined symbols are a problem.
.P
.H 2 "Mailing list"
There is a USERFS channel on the Linux Activists list server.  To
subscribe, send mail with 
.CC
X-Mn-Admin: join USERFS
.EE
as the first line to
linux-activists-request@niksula.hut.fi.  This channel is for general
discussion of userfs development and application.
.H 2 "Bugs, comments, etc"
When you find a bug, tell me.  Please send me the code you're using,
the kernel version, whatever changes you've made to userfs kernel
code, and instructions or a script to reproduce the bug.  Don't just
tell me "it broke."
.P
If you've made changes to the kernel code, please send it to me rather
than sending it out to the world.  Please send me comments, ideas for
new kernel features, or things that you think would make good
filesystems but you can't do right now.  Also feel free to ask
questions about either the implementation of my code or how to write
your own userfs clients.
.P
Send all mail to Jeremy Fitzhardinge <jeremy@sw.oz.au>.
.H 1 "Using clients"
Clients are generally mounted with the
.B muserfs
command.  It's quite simple \- it's a program which makes sure the
mount point is legal for the user to mount on, and mounts the given
process with the user's permissions.  Note that any user can mount a
process, so more checking is done on the mount point than for a normal
mount.  Unless the user is root, the mount point must be owned by the
user and writable.
.B muserfs
has a man page, which is even up to date.
.P
There are a few useful or semi-useful clients:
.B homer ", " ftpfs ", " mailfs " and " arcfs.
.BL
.LI Homer
is written in C++, and uses the C++ library in the lib directory to do
most of its work.  All it does is set up a single directory under its
mount point which contains symbolic links named after each user name
in the password file, which points to the associated home directory.
Mounted on /u it makes a passible replacement for ~ expansion in a
shell except it works for any program.
.LI Ftpfs 
is an experimental filesystem which allows readonly access to FTP
sites, maintaining a long-term disk cache.  Its intended primarily for
anonymous FTP, but can also be used for authenticated FTP sessions.
.LI Mailfs
is by Davor Jadrijevic.  It is for reading mail.  Currently its
read-only and does not track mailbox changes, but it is being actively
developed.
.LI Arcfs
was written by David Gymer.  It allows you to mount
a compressed tar file as a read-only filesystem, and inspect
it with normal tools.  It's pretty neat.
.H 1 "Theory of operation"
The kernel patch and module create a new filesystem type into the
kernel ("userfs").  The filesystem itself is very simple; all it does
it takes the normal kernel filesystem requests, wraps them up into
well-defined packets and squirts them down a file descriptor
(presumeably connected to a process) and waits for the reply on
another file descriptor.
.P
If the filesystem process is on the same machine, then the file
descriptors are probably ordinary pipes.  However, userfs just reads
and writes on the file descriptors, so they could be anything; files,
sockets, devices and so on \- userfs doesn't care.
.P
The following is not a comprehensive tutorial on writing filesystems,
a detailed "how it works" or specification of the existing code.  It
is intended to give some idea of what I was thinking, and basic
concepts to bear in mind while poking about in my kernel or user code.
.H 2 "Priorities"
I had a number of goals which I wanted satisfied by this thing (from
most to least important):
.BL
.LI Flexibility
.P
I want the process to have as much power as a kernel-resident
filesystem as possible.  I wanted to keep the interfaces as generic
and flexible.  This has been mostly achieved.
.LI Robustness
.P
Since I see prototyping and development an major use for userfs, it
seems important to make sure that the kernel code can't (at worst)
crash or lock up if the user code fails.  As it stands, it should be
impossible for a user process to crash the kernel, but it is possible
for a bad user process to lock up processes trying to use the
filesystem, and it is possible for the kernel to muck up reference
counts and make the filesystem un-umountable.
.P
It is also possible for a process to go strange while it is being
mounted, leaving a half-mounted filesystem.  The mountpoint becomes a
nulled out inode, but the kernel refuses to unmount it (because it
isn't mounted), and refuses to mount on it (because it's busy).
.LI "Availability to users and Security"
.P
I'd like any user to be able to write a filesystem process.
Traditionally, filesystems are things that embody the security of
Unix, and are therefore very much superuser-only things.  However,
there are only a couple of really sensitive features that shouldn't be
able to be controlled by any user: suid executables and device nodes.
Since a trusted superuser process is still required to call the mount
system call, and that process can set the no-suid and no-device flags,
the filesystem code can't use these as security holes.  I can't think
of anything else that needs special care from a security point of
view.  However, since the filesystem is completely under the control
of the process, one can make no assumptions about its contents.  For
example "." and ".." may not do expected things, symlinks may point to
places other than what readlink returns.  This makes navigating such
filesystems a new and interesting experience.
.LI "Efficiency"
.P
Efficiency is my lowest priority, but it is still important.
Unfortunately the other requirements (as usual) make things less
efficient.  The most significant inefficiency is the context switches
between the kernel and the process.  I think the most benefits can be
gained by reducing the number of these.  There are several approaches
to this:
.BL
.LI
If the process wants a well-defined behaviour for an operation, then
it should be done in the kernel.  The best example of this is
permission checking - if the process wants normal unix permission
checking then it doesn't need to do it itself.  Otherwise it can take
all the permission requests from the kernel, and implement other
permission policies.  This is currently implemented.  When the
filesystem is first mounted, the kernel asks the process what requests
it will accept.  From that point the kernel will do sensible default
actions for requests that the process doesn't want to handle rather
than sending them down the connection.
.LI
Group requests commonly issued together into one.  This is hard, since
the main kernel tells the filesystem code very little about what it is
doing, so it is hard to know what to do next.  However, there are a
couple of single kernel requests that are implemented in the protocol
as two or more transactions.  This could be fixed in future.
.LI
Data can be cached in the kernel.  This is the most tricky, since
kernel caching or read-ahead limits the amount of control the process
can have over the data once read.  I think this could be optionally
implemented, depending on whether the process says it is OK to do
caching, and if so what kinds.
.P
Currently directory readahead is implemented with the
.B upp_multireaddir
operation.  This allows the filesystem process to return as many
directory entries as it likes.  These entries are saved attached to
the directory inode in the kernel.  Future readdir requests look in
the readahead buffer before sending an operation to the filesystem.
If it fails to find the required entry then it dumps the readahead
buffer and asks the filesystem process again.  This is a win if there
are lots of linear directory searches (such as shell globbing, ls or
pwd).
.LI
A larger than 4k maximum packet size can be used, now that the kernel
memory allocator allows larger than 4k memory allocations.  However,
since pipes are the most common connection beween filesystems and
kernels, and pipes can hold at most 4k of data, there would still be a
context switch between filesystem code and kernel every 4k, so there
wouldn't be much gain.
.LE
A number of people have suggested adding shared memory between the
kernel and the filesystem process.  This would be quite limiting and
least likely option to improve things.  At the moment, the filesystem
makes no assumptions about the nature of the file descriptors for
talking to the process.  To implement shared memory between the kernel
and the process would require some way of finding the process on the
other end of the file descriptors (if any), and playing around with
memory maps.  This still wouldn't cut down on the number of context
switches at all.
.LE
.H 2 "Protocol"
The protocol used is machine independent, using network
byte order and defined type sizes.  The code to do the
packetisation and depacketisation is generated automatically
by a program, given the description of each packet.
This is not fully portable, but it avoids byte order and
structure alignment problems.
.P
A packet to or from the kernel has two parts.  The first is a header
that contains a sequence number, an operation type, a packet type,
size of the following data, and a protocol version number.  The packet
type can either be a request, a reply or an enquiry.  Requests and
enquiries are always from the kernel to the process, and the process
only ever sends replies to the kernel.  A reply's header has one extra
field - an error field, containing an error number.  Replies always
have the same sequence number as their corresponding request or
enquiry.  If there was an error performing the operation the error
field is set to the error number and there is no additional data
returned.  If there is no error the error field is set to 0.
.P
Following a request or reply packet is the optional operation-specific
data.  This is passed through the protocol for interpretation by the
operation routines at each end.
.P
The kernel may have multiple outstanding requests.  In other words,
the kernel may send a new request before receiving a reply to a
previous one.  This allows the filesystem to block one process for a
slow operation while other processes can use the filesystem for
shorter operations.  This improves performance on, for example, an ftp
filesystem, where one process may be using a fast local link, and
another may be using a slow international one, and each has to wait
for its own requests to be satisfied.  Of course this requires the
filesystem process to be written with some form of multi-threading.
If the process just reads requests, acts on them and replies then it
can do so and ignore any kernel requests until it is ready to deal
with them.
.H 2 "Handles"
The base element of a filesystem is an 
.I inode .
The kernel needs to be able to uniquely identify a inode.  Internally,
inodes are uniquely numbered within a filesystem, but each mounted
filesystem has its own numbering.  Therefore an inode is completely
identified by an inode number and a filesystem identifier (or
.I device ,
even though it doesn't mean much for a filesystem which is not on
a disk)..
.P
A device is what distinguished mounted filesystems from one another,
and an inode is what distinguishes files within a filesystem from each
other.  Inode numbers are generated by each filesystem, and are used
by the kernel to refer to specific files to the filesystem specific
code.  User process filesystems are no exception.  Between the kernel
and the filesystem process, files are refered to by using 
.I handles ,
which are essentially 32 bit unsigned numbers.  When a process first
mentions a file to the kernel, it gives it a handle, which the kernel
uses for all later operations on the file.  It the the handle which
identifies the file, rather than the name, so it is important to use
distinct handles for distinct files, and never change the handle of a
file once it has been given to the kernel.
.H 2 "Random operation specific advice and blurb"
This may eventually accurately describe the whole protocol, but for
now its a list of interesting points and things that have bitten me.
.P
Normally when writing a filesystem you should use the library
.I libuserfs
(see below), and use the advice in this section as a guide on what
kind of things should be put in your userfs operation functions,
or for idle curiosity.
.H 3 "Mounting"
The mount is initiated by a user process calling the mount system
call, with the "userfs" filesystem type.  In the filesystem specific
data, the process passes two file descriptor numbers for the kernel to
read and write to.  These can by any kind of file descriptor at all.
Most commonly they would be pipes or sockets, but there is no
restriction.  All the kernel requires that the one it talks to the
process with is writable, and the one it gets replies from is
readable.
.P
The most important request is mounting.  Most important, because it is
the only request that the process has to implement (of course, not
implementing anything else would be completely useless).  The request
itself is not that complex.  All it does is return a handle of the
inode at the root of the filesystem.  Most commonly, this will be a
directory.  Userfs does not enforce this, but the kernel itself may.
.P
After the process returns the root handle, the kernel will probe the
process to see what operations it is willing to support.  This is done
by sending a series of enquire packets to the process.  The process
should reply with normal reply packets, with the errno field either
set to 0 if it is supported or ENOSYS if it isn't.  No real operation
should be done, and no additional information should be sent in the
reply.  If the process replies ENOSYS to an operation, it will never
receive it again, and the kernel will use a sensible default for it
(typically what the kernel would normally do for an in-kernel
filesystem if it doesn't support the operation).  The filesystem
process should send 0 for the operations it explicitly supports, and
ENOSYS for everything else, so the protocol can be extended without
having to modify existing clients.

.H 3 "Reading Inodes"
A pretty common (perhaps most common) thing for the filesystem to be
doing is reading inodes.  For the process, this involves filling out a
structure much like the kernel's inode structure and the stat
structure.  For this version, it's important thing is to make sure the
nlinks field is non-zero.  If it is 0 then the kernel will never "put"
the inode, and it will make the filesystem un-umountable.
.P
When the kernel wants an inode from the filesystem, it uses the
.B upp_iread
protocol request to fetch it.  This happens if something in the kernel
asks for the inode, but it isn't already in the kernel inode table.
Therefore, once the kernel has asked the filesystem for an inode, it
will not ask for it again while anything in the kernel is using it.
.P
Once nothing in the kernel is using the inode, the kernel will
issue an
.B upp_iput
operation, which may be preceded by an
.B upp_iwrite
if the inode was modified in use.  A filesystem need not implement
these operations if there is no need to do so.
.H 3 "Open and Close"
Reading and putting inodes are the basic operations: regardless of
what an inode is being used for it will be read and put.  The
.B upp_open " and " upp_close
operations specifically correspond to the
.BR open (2)
and
.BR close (2)
system calls.  Normally a filesystem doesn't need to perform any
special handling for these operations, and would not normally
implement them, except if it wants to know the identity of the process
doing the operations.  When a program issues an open system call for a
file on the user filesystem, the kernel will send a
.I upp_open
operation for the file, which includes complete identifcation for the
process which issued the open.  When the filesystem replies it returns
a
.I "credentials token."
From then on, that credentials token is sent to the filesystem in all
operations which correspond to a system call which takes a file
descriptor as an argument, such as
.B read , write , readdir , lseek
and so on.
.P
This may seem a bit complex: why not just send the uid of the process
with the operations?  Well, the credentials of a process are quite
complex, since they include the real, saved and effective uids and
gids of the process, and all the auxillary groups.  Sending this with
each request would be quite an overhead.  The idea is that all the
info is sent on a open, and the filesystem process can associate it
with a token internally, and only use the token in correspondance with
the kernel.
.P
Also note that the credentials are associated with an open file
descriptor, not the process performing the operation.  Mostly a
process will deal with file descriptors it has created itself, but its
quite possible that it can inherit file descriptors from another
process with a different set of credentials.  In this case the
filesystem knows the original process's credentials, but not for the
process which is performing the operaion.
.H 3 "Handle Management"
The handle of an inode is only way the kernel and the filesystem can
talk about a file.  An inode may have more than one name, or no names
at all, so file names are not a good way of keeping track of a file.
Use inodes in your filesystem code to keep track of files, even if you
have a simple 1:1 name to file mapping.
.P
Handles must also be consistent.  Of course you must always keep the
handles of files currently in use consistent, but you must also keep
them consistent between uses.  If a process opens a file once, closes
it and then reopens it, then it will expect it to have the same inode
number if it is supposed to be the same file (which is how processes
using a user filesystem will see the file handles).
.P 
Also, if you ever refer to a handle in communication with the kernel,
you must be prepared for the kernel to ask about it.  For example, if
the kernel reads a directory with the
.B upp_readdir " or " upp_multireaddir
operations, each entry in the reply will have a name and a handle.
Each of those handles must be the handle of the file if the kernel
looks at the file more closely.  If you make them all the same, for
example, then a program would be entitled to believe that all the
names in the directory refer to one actual file.
.H 3 "Dealing with muserfs"
Writing a client which can be handled by muserfs is very easy.  The
important thing to remember is that the filesystem process can
basically ignore muserfs, and ignore issues like how to quit and so
on.
.P
A userfs filesystem process should only terminate under one condition:
it gets an EOF (a read of 0 bytes) from the kernel on the file
descriptor its reading operation requests from.  Muserfs will execute
it so that most signals are ignored, so it can handle them itself.
When the muserfs process is sent a SIGINT or SIGTERM it unmounts the
filesystem mount point with the
.BR umount (8)
command (used so that /etc/mtab is updated properly).  This causes the
kernel to send the filesystem process a
.B upp_umount
operation.  The kernel will close its end of the file descriptors, and
the process is expected to do the same, if only by exiting.
Therefore, when trying to unmount a userfs filesystem, do not kill the
filesystem process directly, and do not kill muserfs with SIGKILL.
Either way you should be able to unmount with
.B umount
as root.
.H 1 "Using libuserfs"
.I libuserfs
is a C++ library designed to make writing filesystem clients easier.
It is designed so all the work common to almost all filesystems is
encapsulated into a few generic classes, which can be used as base
classes for specific filesystem functions.
.H 2 "Basic Classes"
The most basic classes,
.B Comm ", " Filesystem " and " Inode
implement the basic communication with the kernel and stub methods for
each operation.
.P
The Comm class reads from the kernel and decodes the headers of the
operation packets, and passes the remainder to the Filesystem class.
The Filesystem performs the operation and returns an unencoded return
header and the encoded body of the reply, if any.  All this is not
exposed to the code above the library.
.P
Filesystem takes each operation and dispatches it to the appropriate
place.  The Filesystem class directly handles the oprations which are
global to the whole filesystem, such as mounting or unmounting.
For operation which pertain to a particular Inode (such as reading,
or looking up a name in a directory), Filesystem looks
up the Inode in its table and dispatches the operation to it.
.P
The Inode class has all its methods implemented as stubs with fail
with the "not implemented" error code.  It also has members for the
standard inode properties of mode, type, size, ownership, links,
timestamps and so on.
.P
These classes are completely useless on their own, so they must be
used as base classes for other classes with actually do something.
.I libuserfs 
has more specific, but still generally useful classes.
.P
.B SimpleInode
implements a simple inode with some normally expected behaviour.
It has a constructor which initializes the inode properties to
sensible values, and methods which implement simple defaults for
the open, close and permissions check operations.
.P
.B DirInode,
derived from SimpleInode, implements all the operations needed for
a directory, including linking and unlinking inodes to/from names,
rename, and directory scanning and lookup.  It takes very little
extra code to implemement simple directory behaviour.
.H 2 "Writing your own filesystem classes"
A complete filesystem has two parts: a collection of inodes, one
for each file, and the filesystem structure itself, which holds all
the inodes together.  Each inode represents a file in the filesystem,
regardless of type.  There is only one inode in the filesystem,
even if the file appears multiple times under different names.
.H 3 "Arguments and return values of operation methods"
Each method with the name 
.B do_something
in the Filesystem and Inode classes corresponds to an operation
in the userfs protocol.  As a result, they all have similar
argument structures.  All such methods have
.B "const up_preamble &pre" " and " "upp_repl &repl"
which are references to the operation reqest and reply packet
headers.  Mostly there is no reason for operation methods to
use them, because their contents are dealt with in lower
levels of the library, but they are there if you want them.
.P
Each userfs protocol operation may have arguments, return values,
both or neither, and the method for that operation will have
corresponding arguments.  For an operation named
.I x
the method argument with the operation arguments will have the
type \fBconst upp_\fIx\fP_s\fR,
and the return values argument will have the type
\fBupp_\fIx\fP_r\fR,
For example, the up_read operation will correspond to the Inode
method
.CC
int Inode::do_read(const up_preamble &pre, upp_repl &repl,
                   const upp_read_s &args, upp_read_r &ret);
.EE
.P
The contents of the structures, along with encoding and decoding
functions, are machine generated, and therefore have a consistent
set of rules.  Mostly its quite simple, with normal base types
directly corresponding to C and C++ types.  However, variable
sized types need to have both a pointer to the data and the
size of the data encoded into them.  Memory for the data is allocated
with the C++ new and delete operators, with the
.B alloc
method of a variable sized object.  The memory is automatically freed
by the method's caller.  For example, if a return value of a method
contains an member called
.B name
representing a filename, it would be set with the following sequence
(assuming
.B ourname
is a normal 0 terminated string):
.CC
int namelen = strlen(ourname);
ret.name.alloc(namelen);                   // Allocate memory
ret.name.nelem = namelen;	           // Set name length
memcpy(&ret.name.elems, ourname, namelen); // Set name contents
// ...
.EE
Note that strings are never 0 terminated; the length of the returned
string is exactly the number of characters in the string.
.P
If the operation the method is performing fails, it should return the
appropriate error code, or 0 if it succeeds.  Don't return -1 unless
you mean to \- it has special meaning (see below, in "Deferring
Replies").
.H 3 "Deriving from Filesystem"
Filesystem class must implement a number of methods to make the
filesystem viable:
.BL
.LI "\fIEnquire\fP"
is called when the kernel wants to find what operations your
filesystem supports.  For all the operations that any inode will
implement, return 0 and return ENOSYS for the rest.
.LI "\fIdo_mount\fP"
takes no arguments and returns the handle for the inode for the
root directory (that is, the top directory of your filesystem).
The kernel immediately does a 
.B do_iread
operation using this handle.
.LE
You can also implement
.I do_statfs
which allows the kernel to get space and inode usage statistics,
such as when "df" is executed, and
.I do_umount
so the filesystem is formally informed when it is unmounted (normally
it just gets an EOF from the kernel, and Comm::Run returns).
.H 3 "Deriving from Inode"
Most of the work of the filesystem is done in the inodes.
All inode classes must be derived from Inode, and generally
there will be a number of different Inode based classes.
.P
It is probably better to use SimpleInode as a base rather than
plain Inode, because it implements simple defaults for some
methods, which would otherwise fail.  If Filesystem::Enquire
says that the filesystem supports a particular operation, then
any inode should be prepared to get that operation from the
kernel.
.P
Similarly, unless you are doing something special, deriving
directories from DirInode saves a lot of work.
.P
Only 
.I do_iread
need be implemented, but obviously the filesystem will do nothing
interesting unless other operations are implemented.  do_iread returns
the details of the inode.  Note that the Filesystem class calls the
do_iread of the Inode when the operation comes from the kernel, so the
inode must exist by the time the kernel asks for it.  The constructor
for Inode automatically registers the inode in the Filesystem's inode
table; conversely, the destructor removes it.
.P
Here are some other useful methods for an Inode; the descriptions
are brief and general, and don't necessarily refer to all the arguments
and return values, which means they can be ignored.
.BL
.LI "\fIdo_iwrite\fP"
is, obviously, the opposite of do_iread.  It simply sets the various
Inode values.
.LI "\fIdo_iput\fP"
is called when the kernel is no longer using the inode.  That is,
the inode is no longer open, the current or root directory of a
process, being executed from or being mapped from.  If an inode
is iput and has no names (has no name to inode mapping in any
directory) it can be destroyed.
.LI "\fIdo_read\fP"
allows data to be read from the file.  The arguments are the offset in
the file to start reading from, and the number of bytes desiried.  The
method may return as many bytes up to that number as it likes,
including 0, which means EOF.
.LI "\fIdo_write\fP"
does the converse; a block of data and an offset is passed in, and
the method returns the number of bytes actually written.
.LI "\fIdo_lookup\fP"
translates a name into an inode reference.  This is typically
implemented for directories; if the name exists in the directory
the method should return the handle of the inode, or fail with
ENOENT.
.LI "\fIdo_dirread\fP"
returns the next directory entry at the passed offset.  It returns
the name and inode of the next file in the directory, and the
size of the entry returned. This is added by the kernel to the
current offset in the directory to form the offset of the next
directory entry for the next call.  Since the directory entries
don't correspond to real file storage as in other, more conventional
filesystems, a directory entry can be regarded as having an
offset of 1.
.P
If the end of the directory has been reached, it should return a new
offset of 0.
.LI "\fIdo_multireaddir\fP"
is similar to do_readdir, but can return any number of directory
entries, which are cached in a readahead buffer in the kernel.  If a
program asks for a directory entry for an inode which has a cached
directory entry then the entry will come from within the kernel rather
than asking the filesystem process.  This operation can return as few
as 1 entry (and so is like do_readdir), or as many as will fit in a
return packet (up to 4k or so of entries).  Returning no entries means
the end of the directory has been reached.  Returning multiple entries
improves the performance
of directory scans, most frequently done by ls and pwd.
.P
Look at the implementation of DirInode::do_multireaddir for details
of how this should be dealt with.
.LI "\fIdo_create\fP"
does all file creation, whether it be a normal file, a directory,
a fifo file or a device node.  The mode contains the type of the
file in same way as the stat structure member
.B st_mode.
.LI "\fIdo_unlink\fP"
is the opposite, and is used for unlinking (removing a name to inode
mapping) files and directories.  If an inode is not in use and
has no links then it can be destroyed and its handle can be reused.
.LI "\fIdo_symlink\fP"
is used to create new symlink inodes.  It returns the handle
of the new inode.
.LI "\fIdo_readlink\fP"
returns the pathname which a symbolic link is pointing to.
.LI "\fIdo_followlink\fP"
returns the pathname of the file a symbolic link is really referring
to.  If Filesystem::Enquire says the filesystem does not support
this operation, the readlink operation is used instead.
.LI "\fIdo_open\fP"
is called when a file is actually opened.  It is only necessary
to implement this if it is important to know whether a file is
being opened as opposed to being used in any other way.  This
operation passes the filesystem the complete authentication
credentials of the process doing the open, so that the filesystem
can do extended security checking or change the behaviour of the
file depending on the user.
.P
This method can return a credential token, which is a magic number
used by the filesystem process to refer to the set of credentials
passed by the kernel.  The kernel attaches this credentials token to
each each operation generated by system calls on the file descriptor
generated by the open (read(), write(), readdir() and close()).  The
credentials token is part of the file descriptor, so is inhereited
unchanged if the descriptor is passed to another process, even if it
has different credentials.
.P
When a file is opened, a new file table entry for the inode is
created.  That file table entry has a single file descriptor
referring to it.  More file descriptors can be made to refer to
the file table entry with the
.BR dup (2)
system call, and can be removed with
.BR close (2).
.LI "\fIdo_close\fP"
is called when the last file descriptor for a file table entry
is closed.  The only argument for this is the credentials token
for that file table entry, so that the filesystem can free all
references to it.
.LI "\fIdo_permission\fP"
is called when the filesystem says it wants to do permissions
checking.  This is called a lot, and can cause many more operations
to pass between the kernel and filesystem process.  If the filesystem
does not implement it the normal unix user/group/others checking
is performed.
.LI "\fIdo_rename\fP"
moves a file from one directory to a new one (though it may be the
same).
.H 3 "Deriving from DirInode"
DirInode implements a number of userfs operation methods for
directories, such as readdir, multireaddir and lookup.  It
also automatically constructs directories with "." and ".."
entries pointing to the appropriate places.
.P
DirInode deals with strings a lot, and rather than using the
normal
.B "char *"
it uses the libg++
.B String
class for all string arguments to its own methods (but not,
of course, for the userfs protocol operation methods).
.P
DirInode expects a pointer to the parent directory,
which is also a class derived from DirInode.  If the directory
is at the top of the filesystem's tree, it should be a NULL
pointer.  The protected member
.B parent
points the the parent inode, or
.B this
for the top one.  It should never be NULL.
.P
DirInode keeps a list of files in the directory, but does not
allow that list to be directly visible.  The only operations
for manipulating the directory contents for a derived class
are:
.BL
.LI "\fBint link(const String name, Inode *)\fP"
which links a new name into the directory, updating all the reference
and link counts;
.LI "\fBint unlink(const String name)\fP"
which does the opposite;
.LI "\fBDirEntry *lookup(const String name)\fP"
which returns a directory entry if it finds the file, or NULL
otherwise; and
.LI "\fBDirEntry *scan(Pix &pos)\fP"
which returns the directory entry at
.I pos,
updating it in the process, or NULL if there are no more entries.
.LI "\fBDirEntry *scan(int &pos)\fP"
is the same, except it uses an integer offset, which is less
efficient.
.LE
.H 2 "Communications classes"
.P
There are a number of communications classes in the library,
which provide different ways of multiplexing replies.
.P
The most simple is the Comm class, which simply takes each request,
passes it to the filesystem and sends back the reply.  There are more
complex comms classes though.
.H 3 "File Descriptor Dispatcher"
.P
The
.B CommBase
class (base of all comms classes) provides a dispatcher which allows
classes to register interest in activity on file descriptors.
This is used internally to get input from the kernel, but can be used
by a filesystem to monitor any file descriptor for any reason.
To do it, simply derive a dispatcher class from
.B DispatchFD
and call
.B "struct disp_fd addDispatch(int fd, DispatchFD *, int what)" ,
where what can be one or more of
.I DISP_R ", " DISP_W " or " DISP_E ,
for interest in read ready, write ready or exceptions.
When an event occurs, the
.B "DispatchFD::dispatch(int fd, int what)"
method is called of the registered class.  If it returns 0 then
it is removed from the dispatch list.  If it returns -1 it indicates
an error; it is removed, and
.B "CommBase::Run()"
returns.  Returning 1 is a normal return.
.P
.B "CommBase::Run()"
returns normally when there are no more entries on the dispatch list.
.H 3 "Deferring Replies"
.P
In normal operation, the filesystem processes one request at a time,
so each operation is replied to before the next is looked at.  This is
a convention of the way the user code works, and not something the
kernel enforces.  It just sends requsts as processes using the
filesystem need them, and they block until the reply for their
particular request is replied to.  Therefore, it is possible for
multiple processes to use the filesystem at once.
.P
The 
.B DeferComm " and " DeferFilesys
classes have a method called
.B DeferReply
(the DeferFilesys once just calls the Comm one to make it accessable
to things within the filesystem).  DeferReply forks the filesystem; on
the child side it returns 0 and in the parent it returns the pid of
the child.  If the operation method returns -1 then the Filesystem
just goes on to processing the next request from the kernel.  When the
child is ready to reply, it can just return in the normal way.  The
call to DeferReply sets up the DeferComm class in the child process to
reply though the parent rather than going straight to the kernel, in
order to make sure the replies from multiple processes don't get
jumbled up.  When the reply has been sent back, the child process just
exits.
.P
Because the child is really a child process, you have to do all the
changes in filesystem state before calling DeferReply, or arrange for
some other mechanism for the parent and children to talk.
.H 3 "Multi-threaded filesystems"
.P
The
.B ThreadComm
class creates a new lightweight thread for each request, using the Rex
lwp library (in the lwp directory).  This allows multiple requests to
be handled within the one process, so long as one thread does not
block the whole process in a system call.  The file descriptor
dispatcher in CommBase is useful for preventing this.
