Background
==========
An SGML document is of some _doctype_.  This should appear at the top of 
the file.  For example:

    <!DOCTYPE HTML SYSTEM>
    <HTML>
    ...
    </HTML>

The SGML summarizer
===================
A Perl script named SGML.sum is used to parse generic SGML files.  SGML.sum
reads the output of the 'sgmls' program by James Clark <jjc@jclark.com>.

The 'sgmls' program needs a few files to do its work:

    The source SGML document
    A Document Type Definition (DTD)
    An SGML declaration (.decl) file 

The 'sgmls' program takes three command line arguments:

    The name of a Catalog file
    The declaration file
    The source document file

The Catalog is used to map doctypes to DTD files on disk.  For Harvest
the Catalog is $HARVEST_HOME/lib/gatherer/sgmls-lib/catalog.

The SGML.sum script takes two command line arguments:

    The doctype
    The source file to summarize

By default it looks for the following support files for $doctype:

    $HARVEST_HOME/lib/gatherer/sgmls-lib/$doctype/$doctype.decl
    $HARVEST_HOME/lib/gatherer/sgmls-lib/$doctype/$doctype.sum.tbl

The file $doctype.dtd should be kept here also, but is specified 
in the Catalog.  The default 'decl' and 'tbl' pathnames can be overridden
by using -d and -t options to SGML.sum.  The 'tbl' file is discussed later.


Many files to be used with the SGML summarizer may not have <!DOCTYPE..
on the first line.  This will be especially true of HTML.  For this
reason, SGML.sum writes the input to a tmpfile and looks for the
<!DOCTYPE string.  If not found, it inserts

    <!DOCTYPE $doctype SYSTEM>

as the first line of the source document before feeding it to 'sgmls'.



Creating a summarizer for a new doctype
=============================================
Assume you have a new doctype named FOO.

   *) As outlined in the Harvest users manual, edit the Essence config
      files (eg lib/byurl.cf) so that your FOO documents get typed
      as FOO.

           FOO		^http://.*\.foo$

   *) Create these files

          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.dtd
          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.decl
          $HARVEST_HOME/lib/gatherer/sgmls-lib/FOO/FOO.sum.tbl

      The 'decl' and 'tbl' files could possibly live in the gatherer
      lib directory.  Not sure yet about the DTD.  Edit the Catalog
      file to reflect the pathname of the DTD.

   *) Write a shell script named FOO.sum.  The simplest way is:

           #!/bin/sh
           exec SGML.sum FOO $*

      Or possibly

           #!/bin/sh
           dcl="$HARVEST_HOME/gatherers/foo/lib/foo.decl"
           tbl="$HARVEST_HOME/gatherers/foo/lib/foo.tbl"
           exec SGML.sum -d $dcl -t $tbl FOO $*



The SGML to SOIF translation
============================

There are two types of SGML data that can be extracted by the SGML
summarizer.  The first is ``content'' which appears between two
tags.  (Note that SGML allows some ending tags to be implied).  Example:

    <B>This phrase is in bold</B>
    <PARA TYPE="title">The title of this paper is....</PARA>

The second type is data that appears in SGML attributes, inside the 
tag delimiters.  Examples:

    <A HREF="http://harvest.cs.colorado.edu/">
    <META NAME="author"  CONTENT="Duane Wessels">

The SGML summarizer uses a translation table to know which SGML data
goes into which SOIF attributes.  For the examples above we might use:


	# SGML-to-SOIF mappings
	#
	<B>			keywords,parent
	<PARA,TYPE=title>	title
        <PARA>			body
	<A:HREF>		url-references
	<META:CONTENT>		$NAME
	<PRE>			ignore

The first field is the SGML tag, enclosed in angle brackets.  The second
field is a comma-separated list of SOIF attributes.  

There are no special or reserved SGML tag values.  There are two special
characters (comma and colon) which are assumed to not appear in any 
valid tag names.  

If content appears for a tag not listed in the table, that content is
passed up to the parent and becomes a part of the parent's content.  
This continues until a tag is found with an output SOIF attribute.

There are two special soif attributes: 'parent' and 'ignore'.  The 'parent'
attribute means to pass the content for this tag up to the parent tag.
This is only needed when you want the content to appear in an attribute
in addition to the parent's.  You would never need list just 'parent' 
as the only attribute for a tag.  The 'ignore' attribute means that the
content for this tag should be discarded.

Another special case is the example '$NAME' above.  This means that the
value of the CONTENT attribute in the META tag should be output in
the SOIF attribute given by the value of NAME in the META TAG.  Example:

     <META CONTENT="Dirk Niblick" NAME="owner">

results in 

     owner{12}:	Dirk Niblick

(there is nothing special about the word 'NAME', it can be any valid
attribute for the tag).


Note that order in the translation table is important.  For a given tag,
the first match is taken.  For example, this is NOT what you want:

        <PARA>			body
	<PARA,TYPE=title>	title

The second line would never be checked because the first line would
match before it.
