NAME
    AI::Categorizer - Automatic Text Categorization

SYNOPSIS
     use AI::Categorizer;
     my $c = new AI::Categorizer(...parameters...);
 
     # Run a complete experiment - training on a corpus, testing on a test
     # set, printing a summary of results to STDOUT
     $c->run_experiment;
 
     # Or, run the parts of $c->run_experiment separately
     $c->scan_features;
     $c->read_training_set;
     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
 
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }
 
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a knowledge set
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything important, human
    supervision is essential, both of the categorization process and the final
    results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object
    framework. This section gives a conceptual overview, but does not get into
    any of the details about interfaces or usage. See the documentation for the
    individual classes for more details.

    A diagram of the various classes in the framework can be seen in
    "doc/classes.png".

  Knowledge Sets

    A "knowledge set" is defined as a collection of documents, stored in a
    particular format, together with some information on the categories each
    document belongs to. Note that this term is somewhat unique to this project
    - other sources may call it a "training corpus", or "prior knowledge". A
    knowledge set also contains some information on how documents will be parsed
    and how their features (words) will be extracted and culled. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant bits of the documents would get lost in
    the "noise" of the insignificant bits.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a *Collection*
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A "KnowledgeSet" contains a single collection object. A "Categorizer"
    generally contains two collections, one for training and one for testing. A
    "Learner" can mass-categorize a collection.

  Categorization Algorithms

    Each categorization algorithm is a subclass of "AI::Categorizer::Learner".
    Currently the framework only includes one categorizer in its default
    distribution, "AI::Categorizer::Learner::NaiveBayes".

    There will soon be a Neural Network categorizer. Next on the agenda will/may
    be a k-Nearest-Neighbor algorithm, a decision tree algorithm, a
    mixture-of-experts combiner, and/or a general interface to the "Weka"
    machine learning system. No timetable for their creation has yet been set.

    Please see the documentation of these individual modules for more details on
    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
    a description of the general categorizer interface.

  Feature Vectors

    Most categorization algorithms don't deal directly with a document's data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem indicative of
    its category, but they are usually some version of the "most important"
    words in the document. A list of features and their weights in each document
    is encapsulated by the "AI::Categorizer::FeatureVector" class. You may think
    of this class as roughly analogous to a Perl hash, where the keys are the
    names of features and the values are their weights.

  Hypotheses

    The result of asking a categorizer to categorize a previously unseen
    document is called a hypothesis, because it is some kind of "statistical
    guess" of what categories this document should be assigned to. Since you may
    be interested in any of several pieces of information about the hypothesis
    (for instance, which categories were assigned, which category was the single
    most likely category, the scores assigned to each category, etc.), the
    hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
    class, and you can use its object methods to get information about the
    hypothesis. See its class documentation for the details.

  Experiments

    The "AI::Categorizer::Experiment" class helps you organize the results of
    categorization experiments. As you get lots of categorization results
    (Hypotheses) back from the Learner, you can feed these results to the
    Experiment class, along with the correct answers. When all results have been
    collected, you can get a report on accuracy, precision, recall, F1, and so
    on, with both micro-averaging and macro-averaging over categories. See the
    docs for "AI::Categorizer::Experiment" for more details.

METHODS
    new()
        Creates a new Categorizer object and returns it. Accepts lots of
        parameters controlling behavior. In addition to the parameters listed
        here, you may pass any parameter accepted by any class that we create
        internally (the KnowledgeSet, Learner, Experiment, or Collection
        classes). This is managed by the "Class::Container" module, so see its
        documentation for the details of how this works.

        The specific parameters accepted here are:

        progress_file
            A string that indicates a place where objects will be saved during
            several of the methods of this class. The default value is the
            string "save", which means files like "save-01-knowledge_set" will
            get created. The exact names of these files may change in future
            releases, since they're just used internally to resume where we last
            left off.

        verbose
            If true, a few status messages will be printed during execution.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to
            "$data_root/cats.txt".

        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        stopword_file
            Specifies a file containing a list of "stopwords", which are words
            that should automatically be disregarded when scanning/reading
            documents. The file should contain one word per line. The file will
            be parsed and then fed as the "stopwords" parameter to the
            KnowledgeSet "new()" method.

    learner()
        Returns the Learner object associated with this Categorizer. If
        "learner()" is called before "train()", the Learner will of course not
        be trained yet.

    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.

    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.

    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves the KnowledgeSet for later use.

        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.

    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this call the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.

    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        populated during "read_training_set()". Returns the Learner object. Also
        save the Learner object for later use.

    evaluate_test_set()
        Creates a Collection based on the value of the "test_set" parameter, and
        calls the Learner's "categorize_collection()" method using this
        Collection. Returns the resultant Experiment object. Also saves the
        Experiment object for later use in the "stats_table()" method.

    stats_table()
        Returns the value of the Experiment's (as created by
        "evaluate_test_set()") "stats_table()" method. This is a string that
        shows various statistics about the accuracy/precision/recall/F1/etc. of
        the assignments made during testing.

HISTORY
    This module is a revised and redesigned version of the previous
    "AI::Categorize" module by the same author. Note the added 'r' in the new
    name. The older module has a different interface, and no attempt at backward
    compatibility has been made - that's why I changed the name.

    You can have both "AI::Categorize" and "AI::Categorizer" installed at the
    same time on the same machine, if you want. They don't know about each other
    or use conflicting namespaces.

AUTHOR
    Ken Williams <kenw@ee.usyd.edu.au>

REFERENCES
    http://www.d.umn.edu/~tpederse/nsp.html (could be used later for feature
    selection)

COPYRIGHT
    This distribution is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself. These terms apply to every file in the
    distribution - if you have questions, please contact the author.

