NAME
    "Text::Categorize::Textrank" - Computes textrank of text.

SYNOPSIS
      use Text::Categorize::Textrank;
      use Data::Dump qw(dump);
      my $textranker = Text::Categorize::Textrank->new;

DESCRIPTION
    "Text::Categorize::Textrank" uses the modules Text::StemTagPOS and
    Graph::Centrality::Pagerank to compute the textrank of words given the
    text. Encoding of all text should be in Perl's internal format; see the
    module Text::Iconv for converting text from various encodings.

CONSTRUCTOR
  "new"
    The method "new" creates an instance of the "Text::Categorize::Textrank"
    class with the following parameters:

    "isoLangCode"
         isoLangCode => 'en'

        "isoLangCode" is the ISO language code of the language that will be
        tagged and stemmed by the object. It must be 'en', which is also the
        default. In the future 'de' may be added.

    "endingSentenceTag"
         endingSentenceTag => 'PP'

        "endingSentenceTag" is the part-of-speech tag that should be used it
        indicate the end of a sentence. The default is 'PP'. The value of
        this tag must be a tag generated by the module Lingua::EN::Tagger.

    "listOfPOSTypesToKeep"
         listOfPOSTypesToKeep => [...]

        The Textrank method preprocesses the text so that only specific
        parts of speech are retained and used to build the graph
        representing the text. The module Lingua::EN::Tagger is used to tag
        the parts-of-speech of the text. The parts-of-speech retained can be
        specified by word types, where the type is a combination of 'ALL',
        'ADJECTIVES', 'ADVERBS', 'CONTENT_WORDS', 'NOUNS', 'PUNCTUATION',
        'TEXTRANK_WORDS', or 'VERBS'. The default is "[qw(TEXTRANK_WORDS)]",
        which equates to "[qw(ADJECTIVES NOUNS)]".

    "listOfPOSTagsToKeep"
         listOfPOSTagsToKeep => [...]

        "listOfPOSTagsToKeep" provides finer control over the
        parts-of-speech to be retained when filtering previously tagged
        text. For a list of all the possible tags call
        "getListOfPartOfSpeechTags".

    "stdevFactor"
         stdevFactor => 1

        If "avgPageRank" is the average pagerank score and "stdevPageRank"
        is their standard deviation, then a words page rank must be greater
        than or equal to the "avgPageRank + stdevFactor * stdevPageRank" to
        be selected as keyword. The default is one.

    "computeStatsIgnoringRepeatedWords"
         computeStatsIgnoringRepeatedWords => 0

        When computing the average and standard deviation of the pageranks
        of the words this can be done by treating the text as a sequence of
        words or a set of words; in the latter case repeating words are
        ignored. For example, suppose the only words in the text were "very"
        and "good" with pageranks of .3 and .7 respectively. If the original
        text is "Very very good.", then the average pagerank of the text is
        "(.3 + .3 + .7) / 3". If the pagerank is computed ignoring repeated
        words it is "(.3 + .7) / 2".

    "getTopKeyPhrases"
         getTopKeyPhrases => 10

        Instead of using "stdevFactor" to chose the top key phrases to
        extract from the text, "getTopKeyPhrases" can be used to return the
        top scoring words or phrases. This is the default method, with the
        top 10 words or phrases being returned. Note, if the text has ten
        words or less, the entire text will be returned as one phrase.

METHODS
  "getTextrankScores"
     getTextrankScores (...)

    The method "getTextrankScores" returns a hash-reference of all the
    stemmed words used in computing the textrank. The value of each hash is
    the textrank of the word. The sum of all the textranks is one.

    "listOfText"
         listOfText => [...]

        "listOfText" is the array reference of the strings of text that is
        to be categorized.

    "tokenEdgeSpan"
         tokenEdgeSpan => 3

        When building the textrank graph, "tokenEdgeSpan" tokenEdgeSpan
        holds the number of successive words linked by an edge. For example,
        given the tokens "apple orange pear", the edges "[apple, orange]"
        and "[apple, pear]" will be created for the token "apple" if
        $tokenEdgeSpan is three. The default is two. Need to change this!

    "undirectedGraph"
         directedGraph => 0

        Computing the textrank of the works involves building a graph from
        the words. If "directedGraph" evaluates to true, then the graph will
        be directed. The default is false, that is, the graph is undirected.

    "pageRankDampeningFactor"
         pageRankDampeningFactor => 0.85

        When computing the pagerank of the textrank graph, the dampening
        factor specified by "pageRankDampeningFactor" will be used; it
        should range from zero to one. The default is 0.85.

    "useEdgeWeights"
         useEdgeWeights => 1

        If "useEdgeWeights" is true, then when computing the pagerank of the
        textrank graph the edge weights will be included in the calculation.
        The default is true.

    "addEdgesSpanningSentences"
         addEdgesSpanningSentences => 1

        If "addEdgesSpanningSentences" is true, then when build the textrank
        graph, links between the words at the end of sentence and the
        beginning of the next sentence will be made. The default is true.

INSTALLATION
    To install the module run the following commands:

      perl Makefile.PL
      make
      make test
      make install

    If you are on a windows box you should use 'nmake' rather than 'make'.

AUTHOR
     Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT
    Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is
    free software; you can redistribute it and/or modify it under the same
    terms as Perl itself.

    The full text of the license can be found in the LICENSE file included
    with this module.

SEE ALSO
    Lingua::Stem::Snowball, Lingua::EN::Tagger, and Lingua::DE::Tagger

