
                                     NAME
                                       
   DBIx::KwIndex - create and maintain keyword indices in DBI tables
     _________________________________________________________________
   
                                   SYNOPSIS
                                       
 package MyKwIndex;
 use DBIx::KwIndex;

 sub document_sub { ... }

 package main;
 $kw = DBIx::KwIndex->new({dbh => $dbh, index_name => 'myindex'})
   or die "can't create index";

 $kw->add_document   ([1,2,3,...]) or die $kw->{ERROR};
 $kw->remove_document([1,2,3,...]) or die $kw->{ERROR};
 $kw->update_document([1,2,3,...]) or die $kw->{ERROR};

 $docs = $kw->search({ words=>'upset stomach' });
 $docs = $kw->search({ words=>'upset stomach', boolean=>'AND' });
 $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
 $docs = $kw->search({ words=>'upset (bite|stomach)', re=>1 });

 $kw->add_stop_word(['the','an','am','is','are']) or die $kw->{ERROR};
 $words = $kw->common_word(85);
 $kw->remove_word(['gingko', 'bibola']) or die $kw->{ERROR};

 $ndocs  = $kw->document_count();
 $nwords = $kw->word_count();

 $kw->remove_index or die $kw->{ERROR};
 $kw->empty_index  or die $kw->{ERROR};
     _________________________________________________________________
   
                                  DESCRIPTION
                                       
   DBIx::KwIndex is a keyword indexer. It indexes documents and stores
   the index data in database tables. You can tell DBIx::KwIndex to index
   [lots] of documents and later on show you which ones contain a certain
   word. The typical application of DBIx::KwIndex is in a search engine.
   
   How to use this module:
    1. Provide a database handle.
 use DBI;
 my $dbh = DBI->connect(...) or die $DBI::errstr;
    2. Subclass DBIx::KwIndex and provide a `document_sub' method to
       retrieve documents referred by an integer id. The method should
       accept a list of document ids in an array reference and return the
       documents in an array reference. In this way, you can index any
       kind of documents that you want: text files, HTML files, BLOB
       columns, etc., as long as you provide the suitable document_sub()
       to retrieve the documents. The one thing to remember is that the
       documents must be referred by unique integer number. Below is a
       sample of a document_sub() that retrieves document from the
       'content' field of a database table.
 package MyKwIndex;
 require DBIx::KwIndex;
 use base 'DBIx::KwIndex';
 sub document_sub {
    my ($self, $ary_ref) = @_;
        my $dbh = $self->{dbh};
    my $result = $dbh->selectall_arrayref(
    'SELECT id,content FROM documents
     WHERE id IN ('. join(',',@$ary_ref). ')');

    # if retrieval fails, you should return undef
    defined($result) or return undef;

    # now returns the content field in the order of the id's
    # requested. remember to return the documents exactly
    # in the order requested!
    my %tmp = map { $_->[0] => $_->[1] } @$result;
    return [ @tmp{ @$aref } ];
 }
    3. Create the indexer object.
 my $kw = MyKwIndex->new({
          dbh => $dbh,
          index_name => 'article_index',
          # other options...
          });
       dbh is the database handle. index_name is the name of the index,
       DBIx::KwIndex will create several tables which are all prefixed
       with the index_name. The default index_name is 'kwindex'. Other
       options include: max_word_length (default 32).
    4. Index some documents. You can index one document at a time, e.g.
 $kw->add_document([1]) or die $kw->{ERROR};
 $kw->add_document([2]) or die $kw->{ERROR};
       or small batches of documents at a time:
 $kw->add_document([1..10])  or die $kw->{ERROR};
 $kw->add_document([11..20]) or die $kw->{ERROR};
       or large batches of documents at a time:
 $kw->add_document([1..300])   or die $kw->{ERROR};
 $kw->add_document([301..600]) or die $kw->{ERROR};
       Which one to choose is a matter of memory-speed trade-off. Larger
       batches will increase the speed of indexing, but with increased
       memory usage.
       Note: DBIx::KwIndex ignores single-character words, numbers, and
       words longer than 'max_word_length'.
    5. If you want to search the index, use the search() method.
 $docs = $kw->search({ words => 'upset stomach' });
 die "can't search" if !defined($docs);
       The search() method will return an ARRAY ref containing the
       document ids that matches the criteria. Other parameter include:
       num => maximum number of results to retrieve; start => starting
       position (1 = from the beginning); boolean => 'AND' or 'OR'
       (default is 'OR'); re => use regular expression, 1 or 0.
       Note: num and start uses the LIMIT clause (which is quite unique
       to MySQL). re uses the REGEXP clause. Do not use these options if
       your database server does not support them.
       Also note: Searching is entirely done from the index. No documents
       will be retrieved while searching. A simple 'relevancy' ranking is
       used. Search is case-insensitive and there is no phrase-search
       support yet.
       Some examples:
 # retrieve only the 11th-20th result.
 $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
 die "can't search" if !defined($docs);
 # find documents which contains all the words.
 $docs = $kw->search({ words=>['upset stomach'], boolean=>'AND' });
 die "can't search" if !defined($docs);
    6. Now suppose some documents change, and you need to update the
       index to reflect that. Just use the methods below. # if you want
       to remove documents from index $kw->remove_document([90..100]) or
       die $kw->{ERROR};
 # if you want to update the index
 $kw->update_document([90..100]) or die $kw->{ERROR};
     _________________________________________________________________
   
                             SOME UTILITY METHODS
                                       
   If you want to exclude some words (usually very common words, or
   ``stop words'') from being indexed, do this before you index any
   document:
   
 $kw->add_stop_word(['the','an','am','is','are'])
   or die "can't add stop words";

   Adding stop words is a good thing to do, as stop words are not very
   useful for your index. They occur in a large proportion of documents
   (they do not help searches differentiate documents) and they increase
   the size your index (slowing the searches).
   
   But which words are common in your collection? you can use the
   common_word method:
   
 $words = $kw->common_word(85);

   This will return an array reference containing all the words that
   occur in at least 85% of all documents (default is 80%).
   
   If you want to delete some words from the index:
   
 $kw->remove_word(['common','cold']);
   or die "can't remove words";

   To get some statistics about your index:
   
 # the number of documents
 $ndocs = $kw->document_count();
 # the number of words
 $nwords = $kw->word_count();

   Last, if you got bored with the index and want to delete it:
   
 $kw->remove_index or die $kw->{ERROR};

   This will delete the database tables. Or, if you just want to empty
   the index and start all over:
   
 $kw->empty_index or die $kw->{ERROR};
     _________________________________________________________________
   
                                    AUTHOR
                                       
   Steven Haryanto <steven@haryan.to>
     _________________________________________________________________
   
                                   COPYRIGHT
                                       
   Copyright (c) 1995-1999 Steven Haryanto. All rights reserved.
   
   You may distribute under the terms of either the GNU General Public
   License or the Artistic License, as specified in the Perl README file.
     _________________________________________________________________
   
                              BUGS/CAVEATS/TODOS
                                       
   Test the module under other database server (besides MySQL).
   
   Use a more correct search sorting (the current one is kinda bogus :).
   
   Probably implement phrase-searching (but this will require a larger
   vectorlist).
   
   Probably, maybe, implement English/Indonesian stemming.
   
   Any safer, non database-specific way to test existence of tables other
   than $dbh->tables?
     _________________________________________________________________
   
                                     NOTES
                                       
   At least two other Perl extensions exist for creating keyword indices
   and storing them in a database: DBIx::TextIndex and MyConText. As of
   this writing, only DBIx::TextIndex features phrase-searching and
   boolean NOT; and only DBIx::KwIndex offers feature to delete documents
   from index (but please see the updated version and documentation for
   details). I personally find DBIx::KwIndex more convenient when I need
   to index documents that change often, because one can add/remove some
   documents without rebuilding the entire index.
   
   Advices/comments/patches welcome.
     _________________________________________________________________
   
                                    HISTORY
                                       
   0001xx=first draft,satunet.com. 000320=words->scalar.
   000412=0.01/documentation/cpan.
