


The SGML summarizer uses a SGML-to-SOIF translation table.  Each line
contains two fields: the SGML tag plus modifiers, and a list of SOIF
attributes.  This is an example that can be used for simple summarizing
of HTML documents:

	<BODY>		body

The effect of this is that all text which appears between the BODY
tags is placed into the SOIF attribute named 'body'.  This works because
data for tags not listed in the table are passed up to the parent 
tag.

If we know that data appearing between PRE tags is useless to index,
we can leave it out by specifying the SOIF attribute as 'ignore'.

	<BODY>		body
	<PRE>		ignore

The title is not included in the body, so we can add a special
attribute for that

	<BODY>		body
	<PRE>		ignore
	<TITLE>		title

This will get a us pretty far.  Now maybe we want to use words in 
bold or other special fonts as keywords in the summary

	<BODY>		body
	<PRE>		ignore
	<TITLE>		title
	<B>		keywords
	<EM>		keywords

One bad effect of this is that words in bold are taken out of 'body' and 
put into 'keywords'.  This makes the body text somewhat unreadable and 
might cause phrase searches on the body to miss.  We can have the
data for these bolded words passed up to the parent tag by listing
'parent' as a SOIF attribute:

	<BODY>		body
	<PRE>		ignore
	<TITLE>		title
	<B>		keywords,parent
	<EM>		keywords,parent

So far we have only summarized the content _between_ tags.  Perhaps 
we would also like to include the URLs of hypertext links in the 
SOIF data.  These appears as attribute values _within_ an SGML tag.
They are specified in the translation table as <TAG:ATTR>:

	<BODY>		body
	<PRE>		ignore
	<TITLE>		title
	<B>		keywords,parent
	<EM>		keywords,parent
	<A:HREF>	url-references

HTML has a very flexible META tag with which we can write things like

	<META NAME="Author" CONTENT="Duane Wessels">
	<META NAME="Data-Source" CONTENT="Dept of Records">
	<META NAME="Data-Quality" CONTENT="5">
        <META NAME="keywords"  CONTENT="skiing winter recreation expensive">

Rather than having all of these appear in a 'meta' SOIF attribute, it
would be nice to have them each appear in their own attribute.  This
can be done by using a variable-like notation to specify the SOIF
attribute as the value of one of the SGML attributes, in this case
NAME.

	<BODY>		body
	<PRE>		ignore
	<TITLE>		title
	<B>		keywords,parent
	<EM>		keywords,parent
	<A:HREF>	url-references
	<META:CONTENT>	$NAME

The resulting SOIF would look like:

	author{13}:	Duane Wessels
	data-source{15}:	Dept of Records
	data-quality{1}:	5
	keywords{34}:	expensive
	recreation
	skiing
	winter



The 'Rainbow' project translate MIF/RTF/Interleaf into SGML.  Rather
than having "fixed" tag names such as <PRE> and <STRONG>, it has more
generic tags and uses attributes to specify more about the data.
For example, a bold phrase might appear as

    ...the <CLF FONT="bold">Hounds of the Baskervilles</CLF> was...

Similarly, paragraphs appear as

    <PARA PARATYPE="title">How I spent my summer vacation<PARA>

This is accomodated in the SGML summarizer by giving attribute values 
in the mapping table:

	<PARA,PARATYPE=title>		title
	<PARA,PARATYPE=heading 1>	headings
	<PARA,PARATYPE=heading 2>	headings
	<PARA>				body

Note that order is important here.  The first match found is accepted.
Less-specific specifications should be listed later.

The bad news here is that it is unclear how the magic words such as
'title' and 'heading 1' are choosen.  I suspect that they are
hard-coded into most word processors, but different across versions and
platforms.  FrameMaker probably allows the user to create and name a
custom paragraph type.  So in order to really use the SGML summarizer
effectively here, the Harvest admin will need to know something about
the documents being summarized.

