      Data structures and related abstract data types in Sather 1.0


			Christian Schwarz


Abstract.
--------
This note discusses the implementation of some data structures and related
abstract data types in Sather 1.0.


The relation of data types and data structures.
-----------------------------------------------
The abstract data types that I want to support are ``dictionary'' and
``sorted sequence'' similar to the way described in the LEDA manual [2].
One important goal is to incorporate the ``item concept'', or 
``access by position'' which is basically the abstraction of a
pointer to some elements in the data structure.

First, I have implemented concrete data structures that contain the 
functionality needed by both data types. As a simple example, I 
used a doubly linked list. More advanced data structures are 
the skiplist [3] and the red-black tree [4]. For the latter
I could use Sather 0.2 code fragments written by Stephan Murer.
(Both the skiplist and the red-black tree are implemented in LEDA [1],
among a number of other alternative data structures---LEDA uses
C++ as implementation language.)

Here is a rough interface of a dictionary with the most important operations.
A dictionary stores with elements of a set K some associated information,
consisting of an element of I. DIC_ITEM denotes an object that holds
all the necessary information for one dictionary association.

dictionary{K,I} is
       
        key(it:DIC_ITEM):K;

	inf(it:DIC_ITEM):K;

        change_inf(it:DIC_ITEM);

	lookup(k:K):DIC_ITEM;

	insert(k:K,i:I); -- may also return DIC_ITEM, but this needs another
                         -- name since we cannot overload the return type

        del_item(it:DIC_ITEM);
end;

and we can get delete(k:K) as a combination of lookup and del_item.

A concrete data structure, say SKIPLIST{K,I}, would replace DIC_ITEM
by the object that holds k:K,i:I and some pointers to connect to other
elements, or in case of a red-black tree, DIC_ITEM is replaced by 
RB_NODE{K,I}, holding pointers to children and parent and a color 
information.

Consider simply subtyping the concrete data structure under a type
$DICT and the concrete pointer type under $ITEM.
This does not work, because then, with the combination of lookup 
on the one hand and del_item, key and inf on the other, the contravariance 
rule forces the abstract type $ITEM and the concrete item type to be equal, 
which is not possible.

Another way of implementing the item concept would be to require that
all concrete item types are the same, e.g. by building a class containing
the union of the fields of each individual item type. This would be 
unelegant and a waste of storage.

Before we continue the discussion, here is a typical example of item usage:

-- count the number of occurrences of each string in a text
-- text is given as an FLIST{STR}

 process is
      s:STR;
      #OUT + a.size + " elements\n";
      loop
	 s:=a.elt!;
	 it::=D.lookup(s);
	 if (SYS::ob_eq(it,D.nil)) then
	    D.insert(s,1); 
	 else
	    D.change_inf(it,D.inf(it)+1); 
	 end;
      end;	 
   end;
   
  
Here, D is some concrete data structure with the appropriate item type,
such as RB_NODE{K,I} for D=RB_TREE{K,I}. Using the implicit declaration 
mechanism it::=, we can however avoid to mention this item type at all,
which already provides some kind of inexpensive abstraction.
A programmer would have to declare the the right D, and if D is changed,
every declaration has to be changed, but the rest of the program stays
the same.

We can further restrict the places where changes need to be made in case
we want another implementation: we build a special class for the
application we have in mind, and parameterize it with the type
of the dictionary. Thus, in the above example, the code would be
encapsulated in a class DICTEST{IMPL}, and somewhere else in the 
program we instantiate the class with DICTEST{RB_TREE{K,I}}.
This method is used in the current implementation.


Concerning the requirements that the key set K must meet, the 
dictionary only needs to decide equality on the elements.
All the concrete data structures described in this paper actually work
on key sets with the stronger property of being linearly ordered.
Let us continue with the data type sorted sequence, which needs
the ordering. Such a data type can contain all the operations listed
for the dictionary above, and is mainly characterized by the following 
addictional operation:

	locate(k:K):SEQ_ITEM

which returns the smallest element (key-wise) that is at least as large as k.
Additionally, we would like to insert an element in the sequence even if its
key is already present. In the program given below, ``minsert'' is used
for this purpose. We do this because we want to support the replacing
insertion as well, and because each concrete structure provides the
operations needed by both abstract types.

NOTE: In this document I use ``abstract type'' often not n the sense of a 
type $XXX interface in Sather but in a conceptual meaning.

An example program for sorted sequence is follows.

-- in a lexicographically ordered sequence of strings (given as a:FLIST(STR}),
-- process a number of queries (given as sea:FLIST{TUP{STR,STR}}) which consist
-- of a pair of strings; The answer to such a query is to list all elements
-- in the sequence between the query items.

  process is
      s:STR;
      #OUT + a.size + " elements\n";
      #OUT + sea.size + " queries\n";
      loop
	 s:=a.elt!;
         D.minsert(s,0); 
      end;
      loop
	t::=sea.elt!;
        t1::=t.t1;
        t2::=t.t2;
        #OUT + "elements from " + t1 + " to " + t2 + ":\n";
        if (~t2.is_lt(t1)) then 
          it::=D.locate(t1);
          stop::=D.locate(t2); 
          loop
             until!(D.is_nil(it) or SYS::ob_eq(it,stop));
	       #OUT + D.key(it) + "\n";
             it:=D.succ(it);  
          end;
        end;	
      end;	 
   end;
   
  
Concering abstraction, the same as what was said for the dictionary 
applies here as well. A pragmatic abstraction compromise under the 
current circumstances is to code the application program as a
class which is parameterized by the implementation type.

To sum up, we do not use a Sather type interface to realize an abstract
data type,  but somehow ``agree'' on conventions what operations the 
type needs--the concrete data structure, which instantiates the 
implementation parameter, has to provide it. However, the compiler is still
able to check if this requirement is met, when trying to make this
instantiation.


Programs.
--------
All files can be found in ~cschwarz/sa/1.0/lib.
There are three implementations. The file list.sa contains
a simple list LISTASSOC{K,I} with item type LISTEL{K,I}, skip.sa 
contains a skiplist SKIPLIST{K,I} with item type SKIPEL{K,I}, and 
rb_tree.sa has a red-black tree RB_TREE{K,I} with node type RB_NODE{K,I}.
The example programs are classes DICTEST{IMPL} in dic.sa and
SEQTEST{IMPL} in seq.sa. In the main program, the parameter IMPL
is instantiated by RB_TREE{STR,INT} and the other corresponding classes.


The code is fairly restrictive concerning the language features used.
No bound routines and no dispatching are used.
However, iters are used to walk through the elements of a data structure in
a canonical order (in our examples determined by the linear ordering on the
keys of the elements). Partial evaluation of expression the way it is 
known from C was used to cut considerably on the red-black-tree code,
at the same time improving readability.


Efficiency.
----------
I ran the above described example applications for dictionary and
sorted sequence with key type STR and information type INT and used
single or combinations of Sather files in the same directory as input.
Measuring the whole application with a time command was the easiest way to 
get the results. (I tried importing an external C function, which caused
problems in the compiler.) If a text contains n words, m of which are
unique, then the dictionary application performs n lookups and m inserts.
The sorted sequence application performs n inserts, and 2q lookups,
where q is the number of queries. (I have used q=10 in all test examples.) 

Not surprisingly, the simple list had the worst performance among the three
data structures. However, the red-black tree outperformed the 
skiplist with a tangible edge of ~35%. This is contrary to the 
experimental results delivered by the LEDA implementations. (The test
programs are in the /prog section of the LEDA directory, e.g. in 
~cschwarz/lib/LEDA-3.0. In those tests, other randomized data structures
also have advantages over the deterministic ones.

The timings for the dictionary application in ms:

 data str.  \ n	| 109	  471	  1422	  2845
-----------------------------------------------
  red-black 	| 0.4	  1.4	   3.4	   8.2
  skiplist	| 0.5	  2.1	   4.8	  11.7	
  list	l	| 0.5     2.4 	  13.9	  51.4


The timings for the sorted sequence application in ms, 10 queries:

 data str.  \ n	| 109	  471	  1422	  2845
-----------------------------------------------
  red-black 	| 0.3	  0.8	   1.8	   4.4
  skiplist	| 0.4	  1.1	   2.9	   6.7	
  list	l	| 0.7     3.7 	  31.9	 152.6



Thanks.
-------
Robert Griesemer, Steve Omohundro and David Stoutamire helped 
me with a lot of information, and I could use some Sather 0.2 code 
written by Stephan Murer.


References.
----------
[1]	K. Mehlhorn, S. Naeher. LEDA -- A Library of Efficient Data types and
Algorithms. ICALP'90 LNCS vol. 443, pp. 1-5, 1990, also appeared in CACM.

[2]	S. Naeher. LEDA-3.0 Manual. Technical Report Max-Planck-Institut fuer
Informatik, Saarbruecken, Germany, 1993.

[3]	W. Pugh. A Skip List Cookbook. CS-TR-2286.1, University of Maryland,
College Park, 1990.

[4]	T. Cormen, C. Leiserson, R. Rivest. Introduction to Algorithms,
MIT Press, 1990.