NAME
    WordNet-SenseRelate-AllWords version 0.07

OVERVIEW
    This module carries out word sense disambiguation (WSD), which is the
    process of selcting the correct sense for a word in a given context. The
    correct sense is selected from a sense inventory which lists the
    possible meanings of a word. This module uses the WordNet lexical
    database as it's sense inventory.

SYNOPSIS
        use WordNet::SenseRelate::AllWords;
        use WordNet::QueryData;

        my $qd = WordNet::QueryData::AllWords->new;
    
        my %options = (wordnet => $qd,
                       measure => 'WordNet::Similarity::lesk'
                       );

        my $wsd = WordNet::SenseRelate::AllWords->new (%options);

        my @words = qw/when in the course of human events/;

        my @res = $wsd->disambiguate (window => 2, 
                                      tagged => 0, 
                                      scheme => 'normal',
                                      context => [@words],
                                      );
                                    
        print join (' ', @res), "\n";
   
CONTENTS
    When the distribution is unpacked, several subdirectories are created:

    /lib
        This directory contains the Perl modules that do the actual work of
        disambiguation. By default, these files are installed into
        /usr/local/lib/perl5/site_perl/PERL_VERSION (where PERL_VERSION is
        the version of Perl you are using). See the INSTALL file for more
        information.

    /utils
        This directoy contains a number of scripts that let you run word
        sense disambiguation experiments and reformat data.

        These scripts will be install when 'make install' is run. By
        default, these files are installed into your /usr/local/bin
        directory. See the INSTALL file for more information. The scripts in
        this directory are:

        wsd.pl
            This very useful script can be used to disambiguate a file of
            words. It is discussed in greater detail later in this document.

        semcor-reformat.pl
            This script will reformat a Semcor file so that it can be used
            as input to wsd.pl

        scorer2-format.pl
            This script will reformat the output of wsd.pl so that it can be
            used as input to the Senseval scorer2 program.

        Each of these scripts has detailed documentation. Run perldoc on a
        file to see the detailed documentation; for example, 'perldoc
        wsd.pl' shows the documentation for wsd.pl.

    /doc
        This directory contains all of the *pod files used to document the
        system. These are processed via pod2text and the output of this is
        placed in the top level directory, although these top level text
        files should be considered read only.

    /samples
        This directory contains examples of the different formats of data
        that are supported by this package. It also contains a sample
        stoplist. There is a README file in the directory that describes the
        contents in more detail.

    /t  This directory contains test scripts. These scripts are run when you
        execute 'make test'.

    /web
        This directory contains the allwords web server and interface. There
        are detailed README and INSTALL instructions within this directory.
        Installing the web interface is optional, and is separate from
        installing the main package.

DESCRIPTION
    Words can have multiple meanings or senses. For example, the word
    *glass* in WordNet [1] has seven senses as a noun and five senses as a
    verb. Glass can mean a clear solid, a container for drinking, the
    quantity a drinking container will hold, etc. WSD is the process of
    selecting the correct sense of a word when that word occurs in a
    specific context. For example, in the sentence, "the window is made of
    glass", the correct sense of glass is the first sense, a clear solid.

    WordNet::SenseRelate::AllWords extends a word sense disambiguation
    algorithm described by Pedersen, Banerjee, and Patwardhan [2] by making
    it disambiguate all words in text. The previous version of the algorithm
    was intended for lexical sample data, which means that a single word in
    a context is designated as the target word and is the only word to be
    disambiguated. By contrast, WordNet::SenseRelate::AllWords will assign a
    sense to every word known to WordNet that appears in a context.

    Prior to execution of the algorithm, we remove any word that is not
    known to WordNet, and any word that appears in a stoplist. The input to
    the algorithm is presumed to be a single sentence where non-WordNet
    words and stoplisted words have been removed.
    WordNet::SenseRelate::AllWords does not cross sentence boundaries when
    carrying out disambiguation.

  Algorithm
      for each word w in sentence
        disambiguate-single-word (w)

      disambiguate-single-word (w)
        for each sense s_ti of target word t, where i=0..N
            let score_i = 0

            for each word w_j in context_window 
                next if j = t

                for each sense s_jk of w_j
                    temp-score_k = relatedness (s_ti, s_jk)
                best-score = max temp-score
                if best-score > pairScore
                    score_i = score_i + best-score

        return s_ti s.t. score_i > score_j for all j in {s_t0, ..., s_tN} and score_i > contextScore

  The Context Window
    The size of the context window can be specified by the user. A context
    window of size 3 means that the context window will consist of three
    words, including the target word. Thus, the three words would be the
    word to the left of the target word, the target word itself, and the
    word to the right of the target word. The algorithm will expand the
    context window so that the three words will be words known to WordNet
    (the algorithm is unable to disambiguate words unknown to WordNet). For
    example, if the word 'the', occurs in the context window to the left of
    the target word, then the window will be expanded by one word to the
    left.

    If the window size is an even number, then there will be one more word
    to the left of the target word than to the right. For example, if the
    window size is 4, there will be two words to the left of the target word
    and one word to the right.

    Note that the context window will only include words in the same
    sentence as the target word. If, for example, the target word is the
    first word in the sentence, then there will be no words to left of the
    target word in the context window regarless of the specified window
    size.

    The minimum window size is 2 because a smaller window mean that there
    are no context words in the window. When the window size is 2, there is
    no context to use for disambiguating the first word in a sentence. To
    assign a sense number to that first word, the first sense of the word is
    chosen (i.e., sense number 1). Sense number 1 is usually the most
    frequent sense of a word.

  Part of Speech Coercion
    Certain measures of semantic similarity only work on noun-noun or
    verb-verb pairs; therefore, the usefulness of these measures for WSD is
    somewhat limited. As a way of coping with this problem,
    WordNet::SenseRelate::AllWords provides an option to "coerce" words in
    the context window to be of the same part of speech as the target word.

    When POS coercion is in effect, if the target word is a noun, then
    WordNet::SenseRelate::AllWords will attempt to convert non-nouns in the
    context window to noun forms of the same word. For example, if the
    target word is a noun and the verb *love* occurs in the window, the
    module might convert that word to the noun *love*.

    WordNet::SenseRelate::AllWords first uses the validForms method from
    WordNet::QueryData to find any valid forms of the word being coerced
    that are of the desired part of speech. In the case of part of speech
    tagged text, the POS tags are discarded. If validForms did not return
    any forms of the desired part of speech, then the derived forms relation
    in WordNet is used to find possible forms of the word. If neither of
    these methods returned usable forms, then no further attempt is made to
    coerce the word to be the desired part of speech.

  Tracing/Debugging
    Several different levels of trace output are available. The trace level
    can be specified as a command-line option to wsd.pl or as a parameter to
    the WordNet::SenseRelate::AllWords module.

   Trace Levels
    The trace levels are:

      1 Show the context window for each pass through the algorithm.

      2 Display winning score for each pass (i.e., for each target word).

      4 Display the non-zero scores for each sense of each target
        word (overrides 2).

      8 Display the non-zero values from the semantic relatedness measures.

     16 Show the zero values as well when combined with either 4 or 8.
        When not used with 4 or 8, this has no effect.

     32 Display traces from the semantic relatedness module.

    Different trace levels can be combined to achieve the desired behavior.
    For example, by specifying a trace level of 3, both level 1 and level 2
    traces are generated (i.e., the context window will be shown along with
    the winning score for each pass).

  Using wsd.pl
    The wsd.pl script provides an easy method of performing disambiguation
    from the command line. The text to be disambiguated is read from a file
    provided by the user on the command line.

   Output
    The output of wsd.pl is simply the disambiguated words. The output will
    be in the form word#part_of_speech#sense_number. The part of speech will
    be one of 'n' for noun, 'v' for verb, 'a' for adjective, or 'r' for
    adverb. Words from other parts of speech are not disambiguated and are
    not found in WordNet. The sense number will be a WordNet sense number.
    WordNet sense numbers are assigned by frequency, so sense 1 of a word is
    more common than sense 2, etc.

    Sometimes when a word is disambiguated, a "different" but synonymous
    word will be found in the output. This is not a bug but is a consequence
    of how WordNet works. The word sense returned will always be the first
    word sense in a synset (synonym set) to which the original word belongs.

   Usage
    wsd.pl --context FILE --format FORMAT [--scheme SCHEME] [--type MEASURE]
    [--config FILE] [--compounds FILE] [--stoplist FILE] [--window INT]
    [--contextScore NUM] [--pairScore NUM] [--outfile FILE] [--trace INT]
    [--silent] | --help | --version

    The format option specifies one of the three different formats supported
    by wsd.pl. The three formats are:

    raw Raw text that is not part of speech tagged and needs undergo
        sentence boundary detection. Example:

           Red cars are faster than white cars.  However, white cars
           are less expensive.

    parsed
        Parsed text is untagged text that has had all unwanted punctuation
        removed and has exactly one sentence per line. Example:

         Red cars are faster than white cars
         However white cars are less expensive

    tagged
        Tagged text is part of speech tagged text that has no unwanted
        punctuation and has exactly one sentence per line. Example:

         Red/JJ cars/NNS are/VBP faster/RBR than/IN white/JJ cars/NNS
         However/RB white/JJ cars/NNS are/VBP less/RBR expensive/JJ 

    wntagged
        Similar to tagged, except that the input should only contain words
        known to WordNet, and each word should have a letter indicating the
        part of speech ('n', 'v', 'a', or 'r' for nouns, verbs, adjectives,
        and adverbs). For example:

         red#a car#n be#v faster#r white#a car#n
         white#a car#n be#v less#r expensive#a

        Additionally, no attempt will be made to search for other valid
        forms of the words in the input. For example, if 'dogs#n' is in the
        input, the program will not attempt to use other forms such as
        'dog#n'.

    The different options and parameters for wsd.pl are discussed in detail
    in the documentation for wsd.pl. Run 'perldoc wsd.pl' to view the
    documentation.

   Usage Examples
    1.  wsd.pl --context input.txt --format raw

    2.  wsd.pl --trace 3 --context input.txt --format raw

    3.  wsd.pl --trace 3 --context input.txt --window 4 --format raw

  Using the Disambiguation Module
    The WordNet::SenseRelate::AllWords Perl module can be used in other Perl
    programs to perform word sense disambiguation.

   Example
      use WordNet::SenseRelate::AllWords;
      use WordNet::QueryData;
      my $qd = WordNet::QueryData->new;
      my $wsd = WordNet::SenseRelate::AllWords->new (wordnet => $qd,
                                           measure => 'WordNet::Similarity::lesk');
      my @words = qw/this is a test/;
      my @results = $wsd->disambiguate (context => [@words]);
      print join (' ', @results), "\n";

    The context parameter to disambiguate() specifies a set of words to
    disambiguate. The function treats the context as one sentence. To
    disambiguate multiple sentences, make a call to disambiguate() for each
    sentence.

    The usage of the disambiguation module is discussed in detail in the
    documentation for the module. Run 'perldoc
    WordNet::SenseRelate::AllWords' or 'man WordNet::SenseRelate::AllWords'
    (after installing the module) to view the documentation. To view the
    documentation before installing the module, run 'perldoc
    lib/WordNet/SenseRelate/AllWords.pm'.

SEE ALSO
    WordNet::SenseRelate::AllWords(3) wsd.pl(1)

    The main web page for SenseRelate is

    <http://senserelate.sourceforge.net/>

    There are several mailing lists for SenseRelate:

    <http://lists.sourceforge.net/lists/listinfo/senserelate-users/>

    <http://lists.sourceforge.net/lists/listinfo/senserelate-news/>

    <http://lists.sourceforge.net/lists/listinfo/senserelate-developers/>

AUTHORS
    Ted Pedersen <tpederse at d.umn.edu>

    Varada Kolhatkar <kolha002 at d.umn.edu>

    Jason Michelizzi <jmichelizzi at users.sourceforge.net>

COPYRIGHT AND LICENSE
    Copyright (C) 2004-2005 by Jason Michelizzi and Ted Pedersen Copyright
    (C) 2005-2008 by Varada Kolhatkar and Ted Pedersen

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

REFERENCES
    1.  Christiane Fellbaum. 1998. WordNet: an Electronic Lexical Database.
        MIT Press.

    2.  Ted Pedersen, Satanjeev Banerjee, and Siddharth Patwardhan (2005)
        Maximizing Semantic Relatedness to Perform Word Sense
        Disambiguation, University of Minnesota Supercomputing Institute
        Research Report UMSI 2005/25, March.
        <http://www.msi.umn.edu/general/Reports/rptfiles/2005-25.pdf>

