NAME
    TODO - List of things TODO for SenseClusters

SYNOPSIS
    Plans for future versions of SenseClusters

DESCRIPTION
  Version 1.03 Todo
    * Consider the use of SVDLIBC rather than SVDPACKC for SVD, however,
    have been having problems compiling SVDLIBC on 64 bit platforms.
    * Testing of SVDPACK remains problematic, since results can vary from
    platform to platform. At present we are simply testing to see if output
    is created.
    * Introduction of CPAN style 'make test' option. Can be coded even for
    command line interfaces, but can be a little messy, especially when a
    program is not producing a single value but rather tables of values or
    formatted text (which is often what we do). Must also do this in such a
    way that it handles different file system (via File::Spec probably).
    * Improve order 1 efficiency. Rather than matching each context against
    every feature (regular expression), match each possible feature in
    context with the features. The simple approach for invoking count.pl for
    each context by creating a temporary file for each context suffers from
    file and process creation overhead. Need to investigate NSP APIs and use
    them so that features can be identified from contexts in memory instead
    of having to create temporary files.
    * Provide simple tools that allow a user to visualize more complicated
    data structures like the 2nd order context vectors. For example, right
    now a user can see the word vectors associated that will be averaged
    together (in the .wordvec file), but they are purely numeric. It would
    be nice to see what words are associated with these values.
    * Make discriminate.pl more modular in its organization, possibly
    through the use of subroutines. Reduce reliance on system calls which
    can lead to portability problems.
    * Fix --showargs option on discriminate.pl. Has not been working for
    many versions now.
    * Check to see if checking return values from vcluster and scluster in
    discriminate.pl is really accomplishing anything. Do they return error
    codes and success codes reliably? Any chance for false positive or false
    negative?
    * Replace default stoplist with a program that automatically generates
    stoplists from a given corpus.
    * Add version information to SenseClusters web interface, including
    version of SenseClusters and modules it is using.
    * There are various small utility programs whose return codes are
    checked by discriminate. These programs seem to always return via exit,
    suggesting that their return codes are always for success. May want to
    return a different value for failures so the discriminate.pl checks are
    meaningful.
    * Reduce the number of regular expressions used in the regex file
    provided to order2vec.pl for feature identification during LSA style
    context clustering. This is required if we adhere to the nsp2regex.pl
    approach for feature identification. Right now regexes are generated
    from training data based on all the features found in the training data.
    These are given as input to order1vec.pl to identify features from test
    data. The number of features identified form test data can be less than
    the number of regexes created from training data (i.e. some regexes may
    not match anything in the test data). Currently this same regex file is
    given as input to order2vec.pl in LSA context clustering mode. So we
    also need to additionally provide a feature file to order2vec.pl
    specifying what features were actually found in test data
    ($PREFIX.features_in_testdata file created by discriminate.pl). If we
    create a regex file corresponding to just the regexes that matched at
    least once in the test data, then just this new regex file can be
    provided as input to order2vec.pl --featregex option. This change needs
    to be done in order1vec.pl (just as currently it prints clabels
    selectively for only those features that were found in the test data, it
    can create a new regex file containing just the regexes that matched at
    least once in the test data). An additional FEATURE file will then no
    more be required by order2vec.pl in LSA context clustering mode (the
    FEATURE file will still be required in SC native word or context
    clustering).
    * Wherever possible and appropriate, add the error checks from
    discriminate.pl to the actual programs that require that error check.
    For example: Check for 0 zero features in order1vec.pl, wordvec.pl and
    order2vec.pl
    * Check why the criterion function values as different across platform
    (Linux vs. Solaris). Currently the test cases for clusterstopping.pl
    used platform dependent checking - can it be made platform independent?
    * The idea of global training data is to have one large file of plain
    text that is used as a source of training data for multiple target word
    discrimination problems. maketarget.pl used to produce regexes of the
    form /(line)/, presumably to be used to identify target words in plain
    corpora (where no head tags have been inserted). Does SenseClusters
    support the identification of target word features under these
    circumstances (where there is no head tag)? If so, we should adjust
    maketarget.pl so that it continues to produce target regexes of this
    form. As of 0.95 it only produces them with the head tag. Right now
    discriminate.pl insists that training data have a head tag in it to find
    tco. It seems like you should still be able to try and find tco features
    if you have specified a plain text regex such as we have above.
    * The gap statistic generates expected values for randomly created
    matrices that have the same marginal totals as the observed data. In
    some cases the expected values (for a criterion function) are actually
    greater than the observed, which suggests that the random data is in
    fact benefiting more from the clustering than the observed data. It
    isn't clear why the expected values aren't always less than the
    observed, since random data when clustered should not really get better
    and better criterion function scores as the number of clusters
    increases, or at least these should not be greater or better than the
    observed values.

  Version 0.98 and 1.00
    Reorganize package, move to CPAN, make installation easier via a Bundle,
    general clean up releases. (good start made)

  Version 0.95 ToDo
    1. Introduce LSA support for context discrimination. (Done)

  Version 0.93 ToDo
    1. Introducing feature_by_context/LSA support for word clustering (done)

  Version 0.83 ToDo
    1. Integrating Cluster Stopping (done)

  Version 0.67 ToDo
    1. Integrating gcluto (not done, on hold)

  Version 0.65 ToDo
    1. Taint flag (done)
        Implement the taint mode.

  Version 0.63 ToDo
    * Web Interface (done)

        1. Time outs (done)
            Currently if the given data or combinations of options leads to
            longer than usual processing time then the web-interface just
            hangs and does not give back any links to the results even if
            the process has finished. Investigate if the problem is with
            request time-out and find the solution for this problem.

        2. Scope Options (done)
            The current version of web-interface does not support the
            scope_train and scope_test options. Implement the same.

        3. Format option (done)
            Also the format option is not available. Implement the same.

        4. Taint flag (done)
            Understand and implement the taint mode.

  Version 0.61 Todo
    * Labeling Clusters (done)
        Discovered clusters will be labeled with their most discriminating
        features or with the actual dictionary glosses. This will indicate
        which clusters represent which word senses.

    * svdpackout (continuing issue)
        There remain problems with svdpackout test cases A1g, A1h, and A2 on
        Solaris. These are due to precision issues with 64 bit CPUs, and
        related issues, I think.

    * SVD on Order1 Type vectors (done)
        Omit null columns from the order1 context vectors as created by
        order1vec. This is causing a problem for mat2harbo.pl. Hence, SVD
        fails on order1 type vectors at present.

    * Order1 Efficiency (continuing issue)
        Order1 in its current form is way too slow even for few thousand
        contexts and features. It needs to be improved speed wise.

    * Warnings in Demo (done)
        Demo script makedata.sh shows Malformed UTF-8 warnings. Senseval-2
        data needs to be cleaned or programs preprocess.pl and
        prepare_sval2.pl need to be modified to avoid these encoding
        warnings.

    * Multi-Lexelt (on hold)
        preprocess.pl (and setup.pl that uses preprocess.pl) currently
        splits given DATA on lexelts. Allow multiple lexelt functionality by
        modifying preprocess.pl or using some other techniques ...

        Amruta has some mixml scripts that can be made distributable to
        handle this multi-lexelt issue. These scripts concatenate two xml
        files and create a single xml file. These are useful to combine the
        split lexelt results from setup/preprocess into a single
        multi-lexelt file and also for supporting experiments on
        multiple-lexelts as done in CONLL 2004.

    * Installation (on hold)
        Update Makefile to check if SVDPACK, CLUTO are installed. If not,
        warn users that options like --svd or cluto can't be used.
        Auto-download and install CPAN modules like PDL, Bit::Vector.

        One idea would be to bundle and redistribute all the required
        modules and programs from other packages into SenseClusters' tarball
        and install them like regular SenseClusters' code.

    * FAQs (ongoing)
        Include questions that would be interesting to general user
        community.

  For later versions
    * Fuzzy Feature matching (still a good idea)
        Perl supports fuzzy pattern matching ... might be useful in our
        feature regex matching !

    *   Technical report for SenseClusters - once the sparse matrix support
        is available. (great idea)

AUTHORS
     Ted Pedersen, University of Minnesota, Duluth
     tpederse at d.umn.edu

     Amruta Purandare 
     University of Pittsburgh

     Anagha Kulkarni 
     Carnegie-Mellon University

COPYRIGHT
    Copyright (C) 2005-2008 Ted Pedersen, Amruta Purandare, Anagha Kulkarni

    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2 or
    any later version published by the Free Software Foundation; with no
    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    Note: a copy of the GNU Free Documentation License is available on the
    web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
    distribution as FDL.txt.

