This file is very out of date, and very incomplete, but it still might be
useful.

Copyright (C) International Computer Science Institute, 1990

The Sather Implemetation

Stephen M. Omohundro

International Computer Science Institute
1947 Center Street, Suite 600
Berkeley, California 94704
Phone: 415-643-9153
Internet: om@icsi.berkeley.edu
May 14, 1990


Abstract

This document describes and motivates the implementation decisions
made in the construction of the object-oriented language Sather. The
language itself is described in "The Sather Language".


Introduction

The Sather compiler is itself written in Sather. It was bootstrapped
through a less efficient version written in C. It is run with the
command "cs classname". It uses a contol file named ".sather" to
locate the source code and guide the compilation process. It creates
or uses a directory named "classname.cs" in which it puts the C files
generated for each Sather source file, the compiled object binaries
and the executable program. Because memory is inexpensive and getting
more so, the compiler maintains structures for all source files in
memory and lets the virtual memory system do any swapping to disk that
may be required.  There are fifteen phases involved in compilation
which are briefly described here. The rest of this document gives the
details of these phases.


The Fifteen Phases

1. Find and read the ".sather" file as well as any "(include)" files
it specifies. Build up tables of Sather source files, C files, debug
keys, C macros to be inserted at the head of generated files, C names
to be generated for Sather objects, and other information needed
during the compilation.

2. Read in the specified Sather source files. Each file is passed
through a handwritten lexical analyzer and a YACC parser that builds
up a code tree for each Sather class. Identifier names are given
sequential integer indices and are inserted into an extendible hash
table. Internally names are all stored as integers and integer
equality is equivalent to name equality.

3. The definition of the C class is allowed to be distributed across
several files. This phase concatanates the feature lists for these C
classes. It also checks for other classes with multiple definitions
and returns an error if any are found.

4. At this point, many of the current class trees may describe
parameterized classes.  This phase determines what instantiations of
those parameters are actually necessary and assigns a unique integer
to each instantiated class. It works by starting at the root class
(which cannot be parameterized) and recording each instantiated class
reference uniquely. Any new classes found during this procedure are
recursively followed. When this phase ends there is an array of
objects, each representing either a simple class or an instantiated
parameterized class.

5. The original code trees are copied over to these instantiated class
objects, filling in any type parameter instantiations. When this phase
is finished, there will be no type parameters remaining.

6. The feature lists of inherited classes are recursively expanded
into the feature lists of their descendents.  Any feature with the
same name as a later feature is eliminated as is any feature declared
"UNDEFINED". 

7. Inherited routines are expanded. This has to happen after the
name duplications and class inheritance has been resolved.

8. Each class collects information about whether it is an array class
or not.

9. Compiler constants are evaluated. This is done recursively in a way
that detects looped dependencies.

10. Attribute offsets are computed.

11. The ancestor and descendent relations between classes are determined.

12. A code walk is performed which determines the referent of each
name and checks the types for consitency.

13. A code walk is performed outputing C code for each instantiated class.

14. The system hash tables and macro files are output. Any included C
files are copied into the proper directory.

15. The makefile is generated and the C compiler is run on the
generated code. 


The representation of objects

This section describes how Sather objects are represented in C.  The C
code generated by the Sather compiler does not use any of C's complex
type system (structures, typedefs, unions, etc.) All type checking and
layout is done in the Sather phase of the compilation. Sather type
BOOL is represented as a C "char" which has value either 0 or 1. CHAR
is represented by the C type "char", INT by the C "int", REAL by the C
"float" and DOUBLE by the C "double". All objects are represented by C
variables of type "ptr" which is typedefed to be "char *". Chars are
assumed to be one byte.  Ints, reals, and pointers to be 4 bytes and
doubles to be 8 bytes. All objects are declared to be word aligned
arrays of chars. The first four bytes of each array is an integer tag
which specifies the object's class. The BOOL and CHAR attributes come
next (to save space because of word alignment requirements). These are
followed by INT's, REAL's, DOUBLE's, and object pointers. Each
instantiated class is assigned a unique integer index which is the tag
for objects of that type. The class indices are assigned sequentially
and are all greater than 0. The first few indices are assigned to the
artificial indices ALL (1), SELF_TYPE (2), UNDEFINED (3), and VOID (4).

The end of the object is used to hold a dynamically variable array for
classes of array type. One-dimensional arrays start with an int giving
the current number of entries which is followed by the array contents.
Two-dimensional arrays have an int with the current size of the first
dimension followed by the current size of the second dimension, an
array of offsets from the head of the object to each array row (to
avoid multiplication on element access) and finally the array itself.
Three and four dimensional arrays are similar.  Sather strings are
just arrays of CHAR's which are 0 terminated. Thus a Sather string is
an ordinary C string preceeded by an int specifying the tag and an int
specifying the allocated space.


Runtime environment

Sather creates (or uses if one already exists) a directory with a name
of the form "classname.cs". All of the C and object files needed to
compile a class are contained in this directory. There is a C file
associated with each instantiation of each class. It has a name which
is formed by concatenating the first three letters of the class with
the class number and "_.c" (eg. "CLA124_.c"). In addition, there will
be copies of any externally provided C files and a set of C files
which are generated as a part of every Sather system. The only builtin
header file is called "all_.h" and this is included in each Sather C
file. It defines a few common macros and declares the systemwide
tables. It also defines the debug keys specified in the compiler
directive file. The file "runtime_.c" defines the functions for
creating, copying, and extending objects. The file "gc_.c" defines the
garbage collection code. The file "main_.c" contains the C "main"
routine. It initializes the shared variables, sets up the system
tables, and starts the code execution. The file "makefile" is
generated to make the directory a self-contained portable unit. The
file "sather_macros.h" is generated if any Sather functions or shared
attributes must be called directly from user C code. This "#defines" a
C name to be the generated Sather name for the routine or attribute.


Tables

There are two runtime tables in Sather. The first is attr_table_. It
is an array of arrays, one for each class index. The entries in a
class's array are: the base size of the class in bytes not including
any array parts, the dimension of the array part, the C type of array
entries, the number of attributes, the C types of each attribute, and
the offset in bytes of each attribute. These are accessed by
ob_base_size_(i), ob_arr_dim_(i), ob_arr_ctype_(i), ob_attr_num_(i),
ob_attr_ctype_(i,j), and ob_attr_offset_(i,j). Each class defines its
own row of this table as "attr_ent_FOO12_". Main collects these
together into attr_table_. 

The other table is dispatch_table_. It is a hash table with the keys
in even entries and the contents in the odd ones. It contains pointers
to routines, pointers to shareds, object attribute offsets, and
integer constants. As discussed above, each class has a unique index.
Attributes are also given a unique index which is the index of the
name as a string in the string table (in the compiler). The key for
the dispatch table is made by combining the class index and the
feature name index. The first 14 bits hold the class and the last 18
the name.  The length of this table is in dispatch_table_size_.


Naming conventions

This section describes the naming convention for the C names generated
by the Sather compiler.  In generating C code from another language
there is always the problem of avoiding the possibility of name
conflicts between the generated names and existing C names. In Sather
the rule is quite simple: as long as no C name or Sather name ends
with an underscore, there will be no conflict.

The C version of routines and shared variables are globally visible
and so their names must be made unique. In essence, we would like to
concatenate the class name to the beginning of the feature name, but
this would be too long. We could just use the class number but this
would make the generated code be obscure and complicate the use of a
symbolic C debugger. Our compromise solution is to construct these
names by concatenating the first three letters of the class name, the
class number, an underscore, the feature name, and a final underscore.
Thus in the class "FOO_CLASS", the feature "bar" might receive the C
name "FOO123_bar_". The names of Sather constants never make it to C
because their value is determined by the Sather compiler. Similarly,
attribute access is done directly in terms of offsets and so they do
not receive C names.

The system uses other names ending in a single underscore which have a
form that cannot conflict with these class relative feature names.
Local variable names are followed by two underscores (because we
cannot ensure that they are not of the form of a routine name).  Thus
the local variable "i" becomes "i__". If the same local variable is
used in two places, we must add a numerical index. This is placed
between the two underscores: "i_12_". This form cannot conflict with
the other single underscore names. A similar convention is used for
function arguments and the two built-in locals: "res" and "self".
Generated locals have the form "gl13_".  System macros all end in an
underscore. Debug keys have a triple underscore: "keyname___". Here is
a list of the generated names:

Shared and routine features in the C class:   foo  ->  foo
Shared and routine features in class FOO:     feat ->  FOO123_feat_
Local variables, routine arguments:           var  ->  var__ or var_12_
Builtin locals res and self:                       ->  res__, self__
Generated locals:                                  ->  gl15_
Debug and assert keys:                        key  ->  key___
Row in attribute table:                            ->  attr_ent_FOO123_	
Target for goto (end of loop)                      ->  goto_tag_12_

The builtin C functions and macros are:

ptr new_(int cls): A 0 initialized object of given class type.
ptr new1_(int cls,as1), new2_(int cls,as1,as2), 
	new3_(int cls,as1,as2,as3), new4_(int cls,as1,as2,as3,as4): 
	An initialized array object of given class and dimensions.
ptr copy_(ptr p): A copy of object p. 
ptr extend1_(ptr p, int ns), extend2_(ptr p, int ns1,ns2),
	extend3_(ptr p, int ns1,ns2,ns3), 
	extend4_(ptr p, int ns1,ns2,ns3,ns4): An extended version of p.
ptr deep_copy_(ptr p): A copy of entire structure reachable from p.
int ob_size_(ptr p): The size of object p in bytes.
int *(attr_table_[]): table of information about classes.
int num_classes_: Number of classes in the system.
ob_base_size_(i): Macro base size of obs of class i.
ob_arr_dim_(i): Macro array dimensions of class i.
ob_arr_ctype_(i): Macro C type of array entries of class i.
ob_attr_num_(i): Macro number of attributes of class i.
ob_attr_ctype_(i,j): Macro C type of j'th attribute of class i.
ob_attr_offset_(i,j): Macro offset in bytes of j'th attribute of class i.

TYPE_(ob): Integer type of object ob. (Mustn't be void.)
ctype_size_
CATT_(ob,off): char in ob at offset off.
IATT_(ob,off): int in ob at offset off.
FATT_(ob,off): float in ob at offset off.
DATT_(ob,off): double in ob at offset off.
PATT_(ob,off): pointer in ob at offset off.


The Compiler Objects

In this section we describe the structure of the compiler classes for
representing code. We view the compilation process as first creating
for each class a list of tree structures which represent its features
and then performing successive transformaions on these trees. Such
representations and transformations are ideal for object oriented
representation. At each level of component (eg. class, feature,
statement, expression) we define a generic class along with stubs of
operations we want to peroform on components of that level. Each
actual component (eg. particular kinds of statements) inherit from this
generic class and define appropriate versions of the operations for
themselves. We use the suffix "ob" (as in classob, etc.) to keep the
names of classes from being confused with source code. 

The topmost class is "CODEOB" and is inherited by all classes
representing code. It doesn't define any attributes, but defines
several operations needed on all code objects.

There are two classes for representing whole Sather classes: "CLASSOB"
and "ICLASSOB". CLASSOB's are directly created by the parser and
directly correspond to source text. The representation of
parameterized classes includes the type parameters as variables.
ICLASSOB's represent classes with all type parameters instantiatied.
The ICLASSOB's are associated with the unique integer indices that
form the tags on objects. 

Type specifications are represented by descendents of TYPEOB.
S_TYPEOB's describe simple types, P_TYPEOB's describe parameterized
types, D_TYPEOB's describe dispatched types, and I_TYPEOB's describe
instantiated types. The parser produces structures built from S, P,
and D typeob's. At instantiation, these are all converted to
I_TYPEOB's. 

The features of a class are all respresented by descendents of FEATOB.
A LST of FEATOB's is a FEATOB_LST. CINH_FEATOB's describe class
inheritance, FINH_FEATOB's describe feature inheritance,
SHARED_FEATOB's describe shared attributes, ATTR_FEATOB's describe
object attributes, ROUT_FEATOB's describe routines, and CONST_FEATOB's
describe constants.

The statements of a routine are all represented by descendents of
STMTOB. LDEC_STMTOB's describe local variable declarations,
ASSIGN_STMTOB's describe assignment statements, COND_STMTOB's describe
conditionals, LOOP_STMTOB's describe loops, SWITCH_STMTOB's describe
multi-way branches, BREAK_STMTOB's describe break statements,
RETURN_STMTOB's describe return statements, CALL_STMTOB's describe
function calls, ASSERT_STMTOB's describe assert statements,
DEBUG_STMTOB's describe debug statemenst, LDECL_STMTOB's describe
lists of local declarations.

Expressions are all represented by descendents of EXPROB. ID_EXPROB's
describe identifiers, ID_ARG_EXPROB's describe identifiers with
arguments, CONST_EXPROB's describe constants, DOT_EXPROB's describe
dotted expressions, AREF_EXPROB's describe array references,
OP_EXPROB's describe operator expressions, CREF_EXPROB's describe
references to other class attributes.


The Lexical Analyzer

The lexer is handwritten in C for efficiency. Its basic structure is a
large switch statement to dispatch on the first character in a token.
Global variables hold the current character, the token buffer, and the
current line number. The token buffer doubles in size if its end is
reached. In this way the only limitation on the size of tokens is
available memory.


The Identifier Table

Every alphanumeric token is assigned an integer index and is stored in
a string table. The index is stored in a hash table at a location
hashed on the string. In the data structure representing the source
code, this integer index is always used. This makes all references
take the same space, saving on allocation, reduces the space required,
makes it possible to efficiently switch on identifier names, and makes
the test for name equality be an integer equalilty test. Literal
constants are kept as strings because there is never any need to test
whether two are the same and they can get large. In incremental
compilation the identifier table is preserved so that the same indices
are assigned in different compilations.


The Parser

Yacc was used to generate a parser. The grammar is fairly
straightforwardly LALR(1). One complexity is that the declaration of a
function with no arguments (eg. "f:INT is ...") looks like the
declaration of an attribute (eg. "f:INT;") until the "is" is reached.
It is therefore necessary to use a single syntactic class for the
initial part of both of these declarations. A second complexity arises
in the syntax of expressions. It was important to separate expressions
into those which could support a further attribute call or array index
and those which could not (eg. a+b).

The parser expands the declaration constructs which allow multiple
variables to be declared together into separate declarations. The
initializations are put in as assignments. Initialization of multiple
variables by an expression is converted into a single intialization
via the expression and the rest via that variable.


The generated C files

The generated C files consist of a series of sections. First is a
header comment which gives the name of the file, the Sather class
which it represents and the file in which it resides, the time and
data of generation and the name of the person who generated it.  Next
the "all_.h" file is included. Then comes a list of any strings
required by "(c_macro)" specifications for calls to C class routines
made in this file. These will typically be "#include" or "#define"
directives. Next come the shared attribute declarations. These have
the form: "int FOO12_var_;".  Next come the declarations of the shared
attributes and routines used from elsewhere: "extern int BAR45_foo_;"
and "extern ptr BAZ23_run_();". Next comes the init routine. This must
initialize the string constants if any, as well as all the shareds
(with 0 if not given a value).  Finally, the rest of the routines are
defined.


Object-oriented dispatch

For object oriented dispatch we need to determine at runtime the
referent of a name relative to an object. The tag on the object gives
the index of the class and it together with the feature name index are
used to index into dispatch_table_ to dynamically find the referent.
Because of the inheritance rules, the compiler knows whether the
referent will be a routine or a shared, constant, or object attribute.
For shared attributes, the table holds a pointer to the static storage
location. For routines, the table holds a pointer to the code. For
constants, the table holds the integer value. For object attributes,
the table holds the integer offset. The compiler further knows whether
shared or object attributes refer to base classes or are pointers.
These restrictions were made to eliminate runtime tests. An
alternative would have been to make all dispatch be through functions,
but this introduces many small functions and the cost of a function
call for each feature access. It eliminates the feature of Eiffel that
a function may be redefined as an attribute (even Eiffel doesn't allow
an object attribute to be redefined as a function). This was deemed
insufficiently important to increase the cost of every dispatch. A
programmer can always define a function to access a variable.

We clearly want to avoid the cost of a hash table lookup on every
dispatch. There are several approaches to avoiding this. The simplest
is to use an array or some kind of compacted table to hold the
information instead. The size requirements of these approaches appear
too expensive for large systems. The other kind of approach is to
provide a software cache for recently looked up entries. Such a cache
might be placed either in the calling function, in a stub associated
with the function name, or in a general dispatch function. We have
chosen to cache lookups inside the calling function. It seems more
likely that a given function will make repeated calls to objects of
the same type. It also allows us to avoid the cost of a function call
in the case of a cache hit. We introduce two static variables for each
dispatched call in the function definition. One of these is an integer
and holds the integer tag of the last call. Since no class has index
0, initializing this to 0 guarantees that the first dispatch will go
through the hash table. The other variable holds the value stored in
the table for that class. If make a dispatch on an object of the same
type as the previous dispatch, we need only notice that the stored
class index is correct and directly use the stored value. The extra
cost in this case is the time to extract the object's tag and compare
it with the stored value. No function call is required. Notice that if
the static variables associated with a function are stored next to one
another in memory, then the physical cache performance should be
better with this method than with one which keeps all entries in
tables. 

The dispatch table is held in the global: "int *dispatch_table_". Its
size is in the global "int dispatch_table_size_". The function
"get_dispatch(cls,nm)" returns the value stored under the given class
and name. It causes an error if nothing is found. In theory, the
generated code will never ask for non-existent entries. Caching is
implemented by a simple macro. Unfortunately, this macro is a
statement and this causes complexity in translating nested
expressions. After executing the line
"cache_dispatch_(ob,nm,glt,glv)", glt (generated local holding the
type) will hold the type of ob and glv (generated local holding the
value) will hold the value for this class associated with nm. If the
correct type is already stored in glt, it doesn't do anything,
otherwise it calls get_dispatch_.  Similarly
"array_dispatch_(ob,glt,glv)" may be called to ensure that glv holds
the offset in bytes of the first "asize" variable in the object ob.


Routine representation

Sather routines are converted to C routines with an extra argument
"self__".  This holds a pointer to the object which the routine is
called on. Every routine with a return argument has a predefined local
variable "res__". The C statement "return(res__);" is generated at the
end of the code and at any explicit Sather "return" statements.

Expression translation

As mentioned above, the dispatching makes generating expressions
harder than it might look at first. The problem is that evaluating an
expression may require executing some statements (eg. if a dispatch is
necessary). If that expression is nested deeply then it is difficult
to generate valid expressions.  Our solution is for each expression
node to potentially generate both preliminary and actual C code. The
preliminary code is a statement while the actual code may be nested in
an expression. The actual code should be an expression whose value is
the value of the Sather expression. An expression tree is recursively
converted. Typically, the preliminary code for outer constructs may
call the actual code for inner constructs (after first outputting
their preliminary code).


Identifier expressions

An expression consisting of just an isolated identifier may refer to a
local variable, a routine argument an attribute of self__, to a shared
variable, constant, or routine in the class of "self__". It may also
be one of the special attributes: "self", "res", "type", "copy", "new"
(for non-array classes), "deep_copy", or "void". 


Operator expressions

The preliminary version of an operator expression consists of the
preliminary parts of its arguments if any. The actual part is just the
operator applied to the actual parts of the argument expressions.


Conditional translation

Similar problems occur in expressing conditionals because the inner
elsif's may require that some preliminary statments be executed, but
only if the first test fails. We convert if-then statments into nested
if's rather than using else if because to evaluate the later tests we
need room to put the preliminary expression statments. The form
produced looks like:

prelim_expr1
if (actual_expr1)
  {
    statments
  }
else
  {
    prelim_expr2
    if (actual_expr2)
      {
        statements
      }
    else
      {
        statements
      }
  }


Loop translation

The problem of expression translation makes the conversion of loops a
little tricky as well because we might have to execute statments on
each time around. The Sather loop:

until
   expr
loop
   stmts 
end;

is converted to the C:

while(1)
  {
    prelim_expr
    if (actual_expr) return;
    stmts
  }

and 

loop
   stmts
end

is converted to
while(1)
  {
    stmts
  }

Because break statments appearing in a switch are used to leave the
switch in C, we must define a goto tag at the end of loops which
contain Sather switch statments with "break" in them. This is
converted to a C "goto" the end of the loop.


Switch translation

Switches are complicated by the fact that Sather switches can include
non-constant expressions in the tests. These are put at the bottom in
the "default" section in a conditional.

Identifier translation


Array access translation


Dotted access translation


Class access translation


Runtime type checking

There is conditionally compiled code for doing type checking at
runtime. Each function checks the types of its arguments and each
assignment checks for conformance on the left hand side. The tables
necessary for such checks are also only conditionally included. Could
do either as a bitmap (say 320 classes, need 10 ints per class or
about 12K extra space) or as a hash table, with hashing on the
conformance pairs. If each class conforms to about 10 others, then
table would need 2*3200 entries or about 24K bytes extra). Looks like
the bitmap is best. 


Array bounds checking

