NAME

    Lingua::StopWords - Stop words for several languages.

SYNOPSIS

        use Lingua::StopWords qw( getStopWords );
        my $stopwords = getStopWords('en');
    
        my @words = qw( i am the walrus goo goo g'joob );
    
        # prints "walrus goo goo g'joob"
        print join ' ', grep { !$stopwords->{$_} } @words;

DESCRIPTION

    In keyword search, it is common practice to suppress a collection of
    "stopwords": words such as "the", "and", "maybe", etc. which exist in
    in a large number of documents and do not tell you anything important
    about any document which contains them. This module provides such
    "stoplists" in several languages.

 Supported Languages

        |-----------------------------------------------------------|
        | Language   | ISO code | default encoding | also available |
        |-----------------------------------------------------------|
        | Danish     | da       | ISO-8859-1       | UTF-8          |
        | Dutch      | nl       | ISO-8859-1       | UTF-8          |
        | English    | en       | ISO-8859-1       | UTF-8          |
        | Finnish    | fi       | ISO-8859-1       | UTF-8          |
        | French     | fr       | ISO-8859-1       | UTF-8          |
        | German     | de       | ISO-8859-1       | UTF-8          |
        | Hungarian  | hu       | ISO-8859-2       | UTF-8          |
        | Indonesian | id       | ISO-8859-1       | UTF-8          |
        | Italian    | it       | ISO-8859-1       | UTF-8          |
        | Norwegian  | no       | ISO-8859-1       | UTF-8          |
        | Portuguese | pt       | ISO-8859-1       | UTF-8          |
        | Romanian   | ro       | ISO-8859-2       | UTF-8          |
        | Spanish    | es       | ISO-8859-1       | UTF-8          |
        | Swedish    | sv       | ISO-8859-1       | UTF-8          |
        | Russian    | ru       | KOI8-R           | UTF-8          |
        |-----------------------------------------------------------|

FUNCTIONS

 getStopWords

        my $stoplist      = getStopWords('en');
        my $utf8_stoplist = getStopWords('en', 'UTF-8');

    Retrieve a stoplist in the form of a hashref where the keys are all
    stopwords and the values are all 1.

        $stoplist = {
            and => 1,
            if  => 1,
            # ...
        };

    getStopWords() expects 1-2 arguments. The first, which is required, is
    an ISO code representing a supported language. If the ISO code cannot
    be found, getStopWords returns undef.

    The second argument should be 'UTF-8' if you want the stopwords encoded
    in UTF-8. The UTF-8 flag will be turned on, so make sure you understand
    all the implications of that.

SEE ALSO

    The stoplists supplied by this module were created as part of the
    Snowball project (see http://snowball.tartarus.org,
    Lingua::Stem::Snowball).

    Lingua::EN::StopWords provides a different stoplist for English.

SOURCE REPOSITORY

    https://github.com/wollmers/Lingua-StopWords

AUTHOR

    Maintained by Helmut Wollmersdorfer <helmut@wollmersdorfer.at> and
    Marvin Humphrey <marvin at rectangular dot com>. Original author Fabien
    Potencier, <fabpot at cpan dot org>.

COPYRIGHT

    Copyright 2021 Helmut Wollmersdorfer Copyright 2004-2008 Fabien
    Potencier, Marvin Humphrey

LICENSE

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.3 or, at
    your option, any later version of Perl 5 you may have available.

