NAME
    Text::SpeedyFx - tokenize/hash large amount of strings efficiently

VERSION
    version 0.006

SYNOPSIS
        use Data::Dumper;
        use Text::SpeedyFx;

        my $sfx = Text::SpeedyFx->new;

        my $words_bag = $sfx->hash('To be or not to be?');
        print Dumper $words_bag;
        #$VAR1 = {
        #          '1422534433' => '1',
        #          '4120516737' => '2',
        #          '1439817409' => '2',
        #          '3087870273' => '1'
        #        };

        my $feature_vector = $sfx->hash_fv("thats the question", 8);
        print unpack('b*', $feature_vector);
        # 01001000

DESCRIPTION
    XS implementation of a very fast combined parser/hasher which works well
    on a variety of *bag-of-word* problems.

    Original implementation
    <http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.pdf> is in Java
    and was adapted for a better Unicode compliance.

METHODS
  new([$seed, $bits])
    Initialize parser/hasher, can be customized with the options:

    $seed
        Hash seed (default: 1).

    $bits
        How many bits do represent one character. The default value, 8,
        sacrifices Unicode handling but is fast and low on memory footprint.
        The value of 18 encompasses *Basic Multilingual*, *Supplementary
        Multilingual* and *Supplementary Ideographic* planes. See also
        "UNICODE SUPPORT"

  hash($octets)
    Parses $octets and returns a hash reference where keys are the hashed
    tokens and values are their respective count. $octets are assumed to
    represent UTF-8 string unless Text::SpeedyFx is instantiated with
    "$bits" == 8 (which forces Latin-1 mode, see "UNICODE SUPPORT"). Note
    that this is the slowest form due to the (computational) complexity of
    the Perl hash structure itself: "hash_fv()" is 350% faster, while
    "hash_min()" is up to 400% faster.

  hash_fv($octets, $n)
    Parses $octets and returns a feature vector (string of bits) with length
    $n. $n is supposed to be a multiplier of 8, as the length of the
    resulting feature vector is "ceil($n / 8)". See the included utilities
    cosine_sim and uniq_wc.

  hash_min($octets)
    Parses $octets and returns the hash with the lowest value. Useful in
    MinHash <http://en.wikipedia.org/wiki/MinHash> implementation. See also
    the included minhash_cmp utility.

UNICODE SUPPORT
    Due to the nature of Perl, Unicode support is handled differently from
    the original implementation. By default, Text::SpeedyFx recognizes UTF-8
    encoded code points in the range *00000-2FFFF*:

    *   Plane 0, the Basic Multilingual Plane (BMP, *0000–FFFF*)

    *   Plane 1, the Supplementary Multilingual Plane (SMP, *10000–1FFFF*)

    *   Plane 2, the Supplementary Ideographic Plane (SIP, *20000–2FFFF*)

    *   There are planes up to 16; however, as in Perl v5.16.2, there are no
        code points matching "isALNUM_utf8()" there (so it's irrelevant for
        proper algorithm operation).

    Although, there is a major drawback: in this mode, each instance
    allocates up to 1 MB of memory.

    If the application doesn't need to support code points beyond the Plane
    0 (like the original SpeedyFx implementation) it is possible to
    constraint the address space to 16 bits, which lowers memory allocation
    to up to 256 KB. In fact, Text::SpeedyFx constructor accepts bit range
    between 8 and 18 to address code points.

  LATIN-1 SUPPORT
    8 bit address space has one special meaning: it completely disables
    multibyte support. In 8 bit mode, each instance will only allocate 256
    bytes and hashing will be run up to 66% faster. Tokenization will
    fallback to *ISO 8859-1 West European languages (Latin-1)* character
    definitions.

BENCHMARK
    The test platform configuration:

    *   Intel® Core™ i7-2600 CPU @ 3.40GHz with 8 GB RAM;

    *   Ubuntu 11.10 (64-bit);

    *   Perl v5.16.2 (installed via perlbrew);

    *   enwik8 from the Large Text Compression Benchmark
        <https://cs.fit.edu/~mmahoney/compression/text.html>.

                         Rate     hash  hash_min_utf8  hash_fv  hash_min
        hash           19.5 MB/s    --           -68%     -78%      -80%
        hash_min_utf8  60.2 MB/s  209%             --     -34%      -40%
        hash_fv        90.5 MB/s  364%            50%       --       -9%
        hash_min       99.6 MB/s  411%            66%      10%        --

    All the tests except "hash_min_utf8" were made in Latin-1 mode.

REFERENCES
    *   Extremely Fast Text Feature Extraction for Classification and
        Indexing <http://www.hpl.hp.com/techreports/2008/HPL-2008-91R1.pdf>
        by George Forman <http://www.hpl.hp.com/personal/George_Forman/> and
        Evan Kirshenbaum <http://www.kirshenbaum.net/evan/index.htm>

    *   MinHash — выявляем похожие множества
        <http://habrahabr.ru/post/115147/>

    *   Фильтр Блума <http://habrahabr.ru/post/112069/>

    *   cosine_sim, minhash_cmp and uniq_wc utilities from this distribution

AUTHOR
    Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2012 by Stanislaw Pusep.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

