NAME
    Text::Scan - Fast search for very large numbers of keys in a body of
    text.

SYNOPSIS
            use Text::Scan;

            $dict = new Text::Scan;

            %terms = ( dog  => 'canine',
                       bear => 'ursine',
                       pig  => 'porcine' );

            # load the dictionary with keys and values
            # (values can be any scalar, keys must be strings)
            while( ($key, $val) = each %terms ){
                    $dict->insert( $key, $val );
            }

            # Scan a document for matches
            %found = $dict->scan( $document );

            # Or, if you need to count number of occurrences of any given 
            # key, use an array. This will give you a countable flat list
            # of key => value pairs.
            @found = $dict->scan( $document );

            # Check for membership ($val is true)
            $val = $dict->has('pig');

            # Retrieve all keys
            @keys = $dict->keys();

            # Like perl's index() but with multiple patterns (new in v0.07)
            # Scan for the starting positions of terms.
            @indices = $dict->mindex( $document );

            # The hash version of mindex() records the position of the first 
            # occurrences of each word
            %indices = $dict->mindex( $document ); 

            # Turn on wildcard scanning. (New in v0.09) 
            # This can be done anytime. Works for scan() and mindex()
            $dict->usewild();
                
DESCRIPTION
    This module provides facilities for fast searching on arbitrarily long
    texts with arbitrarily many search keys. The basic object behaves
    somewhat like a perl hash, except that you can retrieve based on a
    superstring of any keys stored. Simply scan a string as shown above and
    you will get back a perl hash (or list) of all keys found in the string
    (along with associated values (or positions if you use mindex() instead
    of scan())). Longest/first order is observed during matching (meaning,
    each subsequent match begins at the end of the last successful match,
    and matches are "greedy", as in perl regular expressions).

    IMPORTANT: As of this version, a single space is used as a delimiter for
    purposes of recognizing key boundaries. That's right, there is a bias in
    favor of processing natural language! In other words, if 'my dog' is a
    key and 'my dogs bite' is the text, 'my dog' will not be recognized. I
    plan to make this more configurable in the future, to have a different
    delimiter or none at all. For now, recognize that the key 'drunk' will
    not be found in the text 'gedrunk' or 'drunken' (or 'drunk.' for that
    matter). Properly tokenizing your corpus is essential. I know there is
    probably a better solution to the problem of substrings, and if anyone
    has suggestions, by all means contact me.

    To be honest, what I am leaning toward is simply having no implicit
    delimiter at all, and relying on the programmer to use a chosen
    delimiter when inserting keys, then tokenizing the target text properly
    so that the delimiter is present at boundaries as defined by your
    application. This would leave you free to have no delimiter if you
    really want "drunk" to match "gedrunk", "drunken", "drunk." etc. The
    chore of tokenizing the target would be mitigated by pattern matching
    capabilities (hmm..)

    NEW in v 0.09: Wildcards! A limited wildcard functionality is available.
    call usewild() to turn it on. Thereafter any asterisk (*) will be
    treated as "one or more non-space characters". Once this function is
    turned on, the scan will be approximately 50% slower than with literal
    strings. If you include '*' in any key without calling usewild(), the
    '*' will be treated literally.

TO DO
    Some obvious things have not been implemented. Deletion of key/values,
    patterns as keys (kind of a big one), the abovementioned elimination of
    the default boundary marker ' ', possibility of calling scan() with a
    filehandle instead of a string scalar.

CREDITS
    This code is heavily borrowed from both Bentley & Sedgwick, and Leon
    Brocard's additions to it for "Tree::Ternary_XS". The differences are in
    the modified search algorithm to allow for scanning, the storage of
    keys/values, and an extra node-rotation for gradual self-adjusting
    optimization to the statistical characteristics of the target text.

    Many test scripts come directly from Rogaski's "Tree::Ternary" module.

    The C code interface was created using Ingerson's "Inline".

SEE ALSO
    "Bentley & Sedgwick "Fast Algorithms for Sorting and Searching Strings",
    Proceedings ACM-SIAM (1997)"

    "Bentley & Sedgewick "Ternary Search Trees", Dr Dobbs Journal (1998)"

    "Sleator & Tarjan "Self-Adjusting Binary Search Trees", Journal of the
    ACM (1985)"

    "Tree::Ternary"

    "Tree::Ternary_XS"

    "Inline"

COPYRIGHT
    Copyright 2001 Ira Woodhead, H5 Technologies. All rights reserved.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself

AUTHOR
    Ira Woodhead, ira@h5technologies.com

