NAME
    DBIx::TextIndex - Perl extension for full-text searching in SQL
    databases

SYNOPSIS
    use DBIx::TextIndex;

    my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh,
    document_table => 'document_table', document_fields => ['column_1',
    'column_2'], document_id_field => 'primary_key', index_dbh =>
    $index_dbh, collection => 'collection_1', });

    $index->initialize;

    $index->add_document(\@document_ids);

    my $results = $index->search({ column_1 => '"a phrase" +and -not or',
    column_2 => 'more words', });

    foreach my $document_id (sort {$$results{$b} <=> $$results{$a}} keys
    %$results ) { print "DocumentID: $document_id Score:
    $$results{$document_id} \n"; }

    $index->delete;

DESCRIPTION
    DBIx::TextIndex was developed for doing full-text searches on BLOB
    columns stored in a MySQL database. Almost any database with BLOB and
    DBI support should work with minor adjustments to SQL statements in the
    module.

    Implements a crude parser for tokenizing a user input string into
    phrases, can-include words, must-include words, and must-not-include
    words.

    The following methods are available:

  $index = DBIx::TextIndex->new(\%args)

    Constructor method. The first time an index is created, the following
    arguments must be passed to new():

    my $index = DBIx::TextIndex->new({ document_dbh => $document_dbh,
    document_table => 'document_table', document_fields => ['column_1',
    'column_2'], document_id_field => 'primary_key', index_dbh =>
    $index_dbh, collection => 'collection_1', });

    document_dbh
        DBI connection handle to database containing text documents

    document_table
        Name of database table containing text documents

    document_fields
        Reference to a list of column names to be indexed from
        document_table

    document_id_field
        Name of a unique integer key column in document_table

    index_dbh
        DBI connection handle to database containing TextIndex tables. I
        recommend using a separate database for your TextIndex, because the
        module creates and drops tables without warning.

    collection
        A name for the index. Should contain only alpha-numeric characters
        or underscores [A-Za-z0-9_]

    After creating a new TextIndex for the first time, and after calling
    initialize(), only the index_dbh, document_dbh, and collection arguments
    are needed to create subsequent instances of a TextIndex.

  $index->initialize

    This method creates all the inverted tables for the TextIndex in the
    database specified by document_dbh. This method should be called only
    once when creating a new index! It drops all the inverted tables before
    creating new ones.

    initialize() also stores the document_table, document_fields, and
    document_id_field attributes in a special table called "collection," so
    subsequent calls to new() for a given collection do not need those
    arguments.

  $index->add_document(\@document_ids)

    Add all the @documents_ids from document_id_field to the TextIndex.
    @document_ids must be sorted from lowest to highest. All further calls
    to add_document() must use @document_ids higher than those previously
    added to the index. Reindexing previously-indexed documents will yield
    unpredictable results!

  $index->search(\%search_args)

    search() returns $results, a reference to a hash. The keys of the hash
    are document ids, and the values are the relative scores of the
    documents. If an error occured while searching, $results will be a
    scalar variable containing an error message.

    $results = $index->search({ first_field => '+andword -notword orword
    "phrase words"', second_field => ... ... });

    if (ref $results) { print "The score for $document_id is
    $results->{$document_id}\n"; } else { print "Error: $results\n"; }

  $index->unscored_search(\%search_args)

    unscored_search() returns $document_ids, a reference to an array. Since
    the scoring algorithm is skippped, this method is much faster than
    search()

    $document_ids = $index->unscored_search({ first_field => '+andword
    -notword orword "phrase words"', second_field => ... });

    if (ref $document_ids) { print "Here's all the document ids:\n"; map {
    print "$_\n" } @$document_ids; } else { print "Error: $document_ids\n";
    }

  $index->delete

    delete() removes the tables associated with a TextIndex from index_dbh.

CHANGES
    0.05 Added unscored_search() which returns a reference to an array of
    document_ids, without scores. Should be much faster than scored search.

        Added error handling in case _occurence() doesn't return a number.

    0.04 Bug fix: add_document() will return if passed empty array ref
    instead of producing error.

         Changed _boolean_compare() and _phrase_search() so and_words and
    phrases behave better in multiple-field searches. Result set for each
    field is calculated first, then union of all fields is taken for
    final result set.

         Scores are scaled lower in _search().

    0.03 Added example scripts in examples/.

    0.02 Added or_mask_set.

    0.01 Initial public release. Should be considered beta, and methods may
    be added or changed until the first stable release.

AUTHOR
    Daniel Koch, dkoch@bizjournal.com

COPYRIGHT
    Copyright 1997, 1998, 1999, 2000, 2001 by Daniel Koch. All rights
    reserved.

LICENSE
    This package is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, i.e., under the terms of the
    "Artistic License" or the "GNU General Public License".

DISCLAIMER
    This package is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

    See the "GNU General Public License" for more details.

ACKNOWLEDGEMENTS
    Thanks to Ulrich Pfeifer for ideas and code from Man::Index module in
    "Information Retrieval, and What pack 'w' Is For" article from The Perl
    Journal vol. 2 no. 2.

    Thanks to Steffen Beyer for the Bit::Vector module, which enables fast
    set operations in this module. Version 5.3 or greater of Bit::Vector is
    required by DBIx::TextIndex.

BUGS
    Uses quite a bit of memory.

    MySQL-specific SQL is used.

    Parser is not very good.

    Documentation is not complete.

    Phrase searching relies on full-table scan. Any suggestions for adding
    word-proximity information to the index would be much appreciated.

    No facility for deleting documents from an index. Work-around: create a
    new index.

    Please feel free to email me (dkoch@bizjournals.com) with any questions
    or suggestions.

SEE ALSO
    perl(1).

