Lingua::Align - a toolbox for Tree Alignment
    Lingua::Align is a collection of command-line tools for automatic tree
    and word alignment of parallel corpora. The main purpose is to provide
    an experimental toolbox for experiments with various feature sets and
    alignment strategies. Alignment is based on local classification and
    alignment inference. The local classifier is typically trained on
    available aligned training data. We use a log-linear model for
    discriminative binary classification using the maximum entropy learning
    package megam (Hal Daume III).

  Download
    Lingua::Align is available from here:
    <https://bitbucket.org/tiedemann/lingua-align>

  Installation
    You can either install the perl modules and binaries as usual:

       perl Makefile.PL
       make
       make install

    Or you can simply run the treealign script (and the other tools) in the
    "bin/" directory without changing anything. The only requirement is a
    recent version of Perl and "XML::Parser" installed on your system (the
    Perl wrapper for the Expat XML parser).

    The Tree Aligner calls an external tool (megam) which is provided as a
    pre-compiled binary in the "bin/" directory. The default version is a
    i686 binary for Linux-based systems. The package also includes a binary
    for Intel-based Mac OS X. If you want to use this version, please change
    the link "bin/megam" to point to "bin/megam.osx". For all other
    platforms please download the source from
    <http://www.cs.utah.edu/~hal/megam/> and compile it on you platform.
    Make sure that the binary works and link it to "bin/megam".

    For some features you will need word alignment information. To produce
    these features you need to run tools such as Giza++
    <http://code.google.com/p/giza-pp/> and Moses
    <http://statmt.org/moses/>.

  Quickstart Tutorial
    The easiest way to use the Tree Aligner is to run the frontend script
    treealign in the bin directory. There are many options and command-line
    arguments that can be used to adjust the behaviour of the alignment
    tools. Have a look at Lingua::treealign for more information.

   Run tests on existing data sets
    For a simple test: go to the directory "europarl" and run "make test".

      cd europarl
      make test

    This will run a simple test with only a few training sentences from the
    Europarl corpus and simple features for classification. The test
    consists of two calls to tree aligner scripts: "treealign" is used to
    train a classifier and to align unseen sentences from the given data
    set. "treealigneval" is used to compute scores of the alignment
    performed with that model and the alignment strategy that is chosen.

    The example training data is stored in "europarl/nl-en-weak_125.xml"
    which has been produced by manual alignment (thanks to Gideon Kotzé)
    using the Stockholm Tree Aligner
    <http://kitt.cl.uzh.ch/kitt/treealigner>. The format looks like this:

     <?xml version="1.0" encoding="UTF-8"?>
     <treealign subversion="3" version="2">
     <head>
     ...
       <treebanks>
         <treebank id="en" filename="ep-00-12-15.125.en.tiger"/>
         <treebank id="nl" filename="ep-00-12-15.125.nl.tiger"/>
       </treebanks>
     ...
     </head>
     <alignments>
       <align type="good" last_change="2010-03-29" author="Gideon">
         <node treebank_id="en" node_id="s5_501"/>
         <node treebank_id="nl" node_id="s10_0"/>
       </align>
       <align type="good" last_change="2010-03-29" author="Gideon">
         <node treebank_id="en" node_id="s5_502"/>
         <node treebank_id="nl" node_id="s10_1"/>
       </align>
     ...

    The actual treebank data is stored in TigerXML (in this case) and links
    are pointers to these documents using the unique node IDs. This should
    be quite straightforward (looking at the example above). Other formats
    are also supported, for example, Penn Treebank format and AlpinoXML for
    storing treebanks. There is also support for other alignment formats
    like the tree alignment format used by the Dublin Subtree Aligner and
    word alignment formats used by Giza++, Moses and shared tasks on word
    alignment (WPT2003).

   Run with your own settings
    Basically you can call the tree-aligner frontend with your own data and
    settings like this:

      treealign -a <ALIGNFILE> -f <FEATURES> -n <NR_TRAIN_SENT> -e <NR_TEST_SENT>

    The alignment file ALIGNFILE has to contain the tree alignments that
    will be used for training the classifier. The default format is the one
    explained above (similar to the one used by the Stockholm Tree Aligner).
    FEATURES is a string specifying the features to be used in
    classification. NR_TRAIN_SENT is the number of sentences to be used for
    training and NR_TEST_SENT is the number of test sentences. There are
    many more options that can be set on the command line. Please look at
    Lingua::treealign for more information.

    Of course it is also possible to align treebanks using an existing
    alignment model. The only thing you need are the treebank files in both
    languages which have to be sentence aligned. Assuming that the alignment
    model is stored in the default file 'treealign.megam' and the two
    treebanks ("ep-00-12-15.125.en.penn", "ep-00-12-15.125.nl.penn" from the
    sample files in "europarl/") are stored in bracketed Penn Treebank
    format you can call the aligner like this:

      treealign -s ep-00-12-15.125.en.penn -S penn \
                -t ep-00-12-15.125.nl.penn -T penn \
                -m treealign.megam > alignments

    This will assume that trees from both treebanks are aligned with each
    other in the same order as they appear in the given files (corresponding
    lines in this case). Features to be used for classification have to be
    stored in "treealign.megam.feat" (they should be if "treealign.megam"
    has been produced by Lingua-Align). Tree alignments will be stored in
    "alignments" in STA format.

    Here are some more details about the things you need for running your
    own experiments:

   Training data
    Your own tree-aligned training data. The easiest way is to use the
    Stockholm Tree Aligner. The format produced by this tool can directly be
    used by Lingua::Align. You need at least 100 pairs of parse trees in
    order to obtain reasonable results. More is better of course. The corpus
    has to be parsed on both sides. You need to use TigerXML for the
    Stockholm Tree Aligner and this is also most convenient for the tree
    aligner later on (also for visualizing automatic alignment).

    There is a tool to convert treebanks using the formats supported by
    Lingua::Align: "bin/convert_treebank". For example, if your parse trees
    are stored in Penn Treebank format ("treebank") you might try to use the
    following command:

      convert_treebank treebank penn tiger > treebank.tiger

    Hopefully this will work to create a corpus that can be loaded into the
    Stockholm Tree Aligner. You might have to check the specifications in
    the XML header and adjust some (meta) information. You can, for example,
    validate your Tiger-XML against the schema:

      xmllint --schema http://www.cl.uzh.ch/kitt/treealigner/data/schema/TigerXML.xsd --noout <your-tiger-file>

    You can also use another format which is similar to the one used by the
    Dublin Tree aligner which applies a bracketed format for tree structures
    and links in terms of references to the nodes in these trees. Here is an
    example (from "europarl/nl-en_125.dublin"):

     (ROOT-1 (S-2 (VP-3 (VBP-4 Are)(RB-5 there)(NP-6 (DT-7 any)(NNS-8 comments)))(.-9 ?)))
     (top-1 (np-2 (det-3 Geen)(noun-4 bezwaren)(punct-5 ?)))
     1 1 6 2 7 3 8 4 9 5

    The first row is the source language tree, the second one is the target
    language row and the third one contains the links between source and
    target nodes. This format is not entirely compatible with the Dublin
    Tree Aligner format as it does not support conflated unary productions.

    If you like to use this format for your training data you can call the
    tree-aligner script with an extra parameter (-A) specifying the
    alignment format, for example:

       treealign -a nl-en_125.dublin -A dublin -f catpos -n 10 -e 10 > align

   Features & parameters
    First of all, you need to decide what kind of features should be used in
    the classifier model. Quite a lot of features are supported by
    Lingua::Align and you can easily add new ones. Look at
    Lingua::Align::Features for more information on classification features.
    Features are given as a list of features type names separated by ':'
    using the command-line flag '-f'. Here are some simple exampels assuming
    that the training corpus is stored in Stockholm Tree Aligner format in a
    file called "nl-en_125.xml" (from "europarl/"):

      treealign -a nl-en_125.xml -f catpos -n 10 -e 10 > aligned.xml

    This will train a model on the first 10 tree pairs using the "catpos"
    feature (pairs of category or POS labels) and then align the following
    10 tree pairs. The classification model is stored in the default file
    "treealign.megam". Have a look at the parameters if you like (it's just
    a plain text file). For other training options default settings are used
    (check the Lingua::treealign script for more details). Also for
    alignment standard settings are applied. The tree aligner will perform a
    two-step strategy using local classification and a greedy alignment
    inference.

    The result is printed to STDOUT and piped to "aligned.xml" in the
    example above. Use the following command to evaluate the alignment just
    done:

      treealigneval nl-en_125.xml aligned.xml

    This should give you very low precision and recall values (around 25%;
    Note that individual recall values for terminal nodes and non-terminal
    nodes are not correct because node type information is not available in
    the gold standard and IDs do not follow the standard to make a clear
    distinction).

    Another simple example using two feature types ("tree level similarity"
    and "tree span similarity") is the following:

      treealign -a nl-en_125.xml -f treelevelsim:treespansim -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    The classification model will look something like this:

      **BIAS** -8.23925781250000000000
      treespansim 3.54736995697021484375
      treelevelsim 3.76471185684204101562

    Still, these features are not very informative and the scores will be
    still very low. Try now a combination of the three feature types
    mentioned above:

      treealign -a nl-en_125.xml -f treelevelsim:treespansim:catpos \
           -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    This will give you much better results already (around 50% F-scores).

    Now you can start to experiment with contextual features, for example,
    "catpos" features of parent and children nodes:

      treealign -a nl-en_125.xml \
         -f treelevelsim:treespansim:catpos:parent_catpos:children_catpos \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    And, surprise, this gives you another improvement (around 60% F-score).
    You can also get features from neighboring nodes using 'sister_' and
    'neighborXY_' as prefix. With 'sister_' features will be extracted from
    ALL sister nodes, i.e. nodes that have the same parent. In case of
    real-valued features the average (arithmetic mean) of the feature values
    of these sister nodes will be used. For binary feature templates (for
    example 'catpos') all of them will be included. (This is exactly the
    same behaviour for 'children_').

    The 'neighborXY_' prefix is more flexible. You can specify neighbors
    using X as the distance in the source language tree and Y as the
    distance in the target language. Negative values will be interpreted as
    left neighbors and positive values (don't use '+'!) for neighbors to the
    right. For terminal nodes: All surface words will be considered for
    retrieving neighbors. For nonterminals: only neighboring nodes with the
    same parent as the current node will be considered! Observe that the
    distances have to be less than 10 because the pattern only allows single
    digits! Here is an example for the use of neighbor feature:

      treealign -a nl-en_125.xml \
         -f catpos:neighbor-10_catpos:neighbor-11_catpos \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    This retrieves the 'catpos' feature from the current node pair, from the
    left source tree neighbor together with the current target node, and
    from the left source tree neighbor together with the right target node
    neighbor.

    Note that these models so far do not use any other information than the
    features directly extracted from the parse trees and the alignment
    information available in the training data. There are also features that
    need external resources. For example, you may include word alignment
    information for the tree alignment. For this you need to run automatic
    word alignment first (on the treebank sentences you're using in your
    experiments) and you need to store the information in the format
    supported by Lingua::Align. You may use the Viterbi alignment produced
    by Giza++ ("moses/giza.src-trg/src-trg.A3.final.gz" and
    "moses/giza.trg-src/trg-src.A3.final.gz"):

      treealign -a nl-en_125.xml \
         -g moses/giza.src-trg/src-trg.A3.final.gz \
         -G moses/giza.trg-src/trg-src.A3.final.gz \
         -f gizae2f:gizaf2e \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    You can see how effective these features are for tree alignment (well,
    at least they give you already around 55% F-scores with the tiny
    training data we are using in our examples). Of course, you can use word
    alignment features from context nodes as well (giving you around 65%
    F-scores):

      treealign -a nl-en_125.xml \
         -g moses/giza.src-trg/src-trg.A3.final.gz \
         -G moses/giza.trg-src/trg-src.A3.final.gz \
         -f gizae2f:gizaf2e:parent_giza:children_giza \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    Note that we use a combination of "gizae2f" and "gizaf2e" for the
    context nodes. Now try a combination of all features we mentioned so
    far. You should get a decent score of around 74% F-score. Nice, isn't
    it?

    Another word alignment feature is based on the symmetrized alignments
    produced by Moses. Use them in the following way:

      treealign -a nl-en_125.xml \
         -y moses/model/aligned.intersect \
         -f moses:parent_moses:children_moses \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    Don't ask me why the parameter is '-y' for the Moses alignment file.
    (It's basically because treealign only uses short command-line options
    and I was running out of letters ....)

    Actually, you could leave out the file specifications in the examples
    above because we were just using the default names and paths. You can
    use the flag ("-M moses-dir") if the file-names and sub-directories are
    the same but the main Moses work-directory is different (for example
    "my-moses-dir"):

      treealign -a nl-en_125.xml \
         -M my-moses-dir \
         -f moses:parent_moses:gizaf2e \
         -n 10 -e 10 > aligned.xml
      treealigneval nl-en_125.xml aligned.xml

    Finally, we should introduce history features. For now we just did local
    classification without considering alignment decisions on other nodes.
    The classifier can also be trained with so-called history features --
    features based on previous decisions. Using such features will force the
    tree aligner to use a sequential classification procedure, either
    bottom-up or top-down. In top-down classification will start with the
    root-nodes and the classifier uses alignment decisions on the parent
    nodes as additional features. You can use these so-called parent
    features like this (adding the flag "-P" to the command-line):

      treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
                -n 10 -e 10 -P > aligned.xml

    Compare this to the alignment without the "-P" flag and you will see the
    difference when running evaluation. In bottom-up classification, two
    types of history features are supported: proportion of links between
    immediate children nodes ("-C") and proportion of links between all
    children nodes in the entire subtrees ("-U").

      treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
                -n 10 -e 10 -C -U > aligned.xml

    Note that history features coming from parent links and coming from
    children cannot be combined (for obvious reasons). And don't expect
    improvements in all cases. Especially for rich feature sets no big
    improvements can be expected. Note that alignment will also be (even)
    slower.

   Alignment strategies
    In the default settings a two-step procedure is used: First all node
    pairs are classified using the local classifier, possibly including
    history features. The second step comprises the actual alignment step
    (inference) in which nodes are linked to each other according to the
    link likelihoods assigned by the local classifier in the first step. The
    default strategy is a "greedy" alignment procedure, starting with the
    node pair with the highest link likelihood and running greedily through
    the set of candidates. A necessary constraint is that all nodes are
    aligned at most once (on both sides).

    You can use other strategies for example using an additional
    well-formedness constraint:

      treealign -a nl-en_125.xml \
                -y moses/model/aligned.intersect \
                -f moses:parent_moses:children_moses \
                -l GreedyWellformed \
                -n 10 -e 10 -C -U > aligned.xml

    Compare this to the result obtained with the standard strategy ("-l
    greedy"). Another common technique is to use graph-theoretic algorithms
    modeling tree alignment as a maximum weighted bipartite matching
    problem. Lingua::Align includes a free implementation of the Hungarian
    algorithm (Kuhn-Munkres) that solves this problem in polynomial time.

      treealign -a nl-en_125.xml \
                -y moses/model/aligned.intersect \
                -f moses:parent_moses:children_moses \
                -l munkres \
                -n 10 -e 10 -C -U > aligned.xml

    Several other inference strategies can be used. The documentation of the
    ones included in Lingua::Align is still rather unexisting. Look at the
    code in the module Lingua::Align::LinkSearch for more information.

    You can also do without simply using the decisions of the local
    classifier (default: scores above 0.5 indicate a link):

       treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e \
                 -n 10 -e 10 -P -l threshold > aligned.xml

    For simple feature sets these scores will be much lower. Alignment
    constraints such as the one-to-one link constraint and well-formedness
    of links are important in those cases. For richer feature sets this
    difference fades away.

    One problem with the greedy strategies is that alignment is slow because
    all node pairs have to be considered as candidates for classification
    (and alignment). This is because "feature extraction" is actually the
    bottle neck in the entire alignment procedure (not classification nor
    alignment inference). There is a way to speed this up by combining local
    classification with a greedy alignment strategy. This is (again) called
    'bottom-up' alignment but this time using classifier scores immediately
    for establishing links between nodes. Alignment starts at the leaf nodes
    and each node pair that receives a score above 0.5 will be aligned
    immediately (and not considered aferwards anymore). After this greedy
    bottom-up procedure the chosen alignment inference strategy will be used
    for the remaining unlinked nodes. Use the option "-b bottom-up"):

       treealign -a nl-en_125.xml -f moses:gizae2f:gizaf2e -n 10 -e 10 -C -U -v \
                 -l GreedyWellformed -b bottom-up > aligned.xml

    Observe that we can use history features again (but not "-P"). This
    should speed-up the alignment process a bit (not that much as you might
    have expected ...). You can get information about the runtime by
    including the verbose output flag (see above "-v").

   Library structure
    There are several options that can be set. For further information have
    a look at the manpages linked below or just look at the source code.
    Extending the code is quite straightforward even though the
    documentation is not perfect and the code is partially awful (well, it's
    Perl ....). Here is a (hopefully up-to-date) list of modules (many of
    them are under-developed / experimental / non-functioning possible
    projects for the future):

    top-level modules:
          Lingua/Align.pm
          Lingua/Align/Trees.pm

    modules for feature extraction
          Lingua/Align/Features.pm
          Lingua/Align/Features/Cooccurrence.pm
          Lingua/Align/Features/Lexical.pm
          Lingua/Align/Features/Alignment.pm
          Lingua/Align/Features/Tree.pm
          Lingua/Align/Features/Orthography.pm
          Lingua/Align/Features/History.pm

    modules for classification
          Lingua/Align/Classifier.pm
          Lingua/Align/Classifier/Megam.pm
          Lingua/Align/Classifier/Clues.pm
          Lingua/Align/Classifier/Diagonal.pm
          Lingua/Align/Classifier/LibSVM.pm

    modules for alignment inference
          Lingua/Align/LinkSearch.pm
          Lingua/Align/LinkSearch/Assignment.pm
          Lingua/Align/LinkSearch/AssignmentWellFormed.pm
          Algorithm/Munkres.pm
          Lingua/Align/LinkSearch/Cascaded.pm
          Lingua/Align/LinkSearch/GreedyFinalAnd.pm
          Lingua/Align/LinkSearch/GreedyFinal.pm
          Lingua/Align/LinkSearch/Greedy.pm
          Lingua/Align/LinkSearch/GreedyWellFormed.pm
          Lingua/Align/LinkSearch/Intersection.pm
          Lingua/Align/LinkSearch/NTFirst.pm
          Lingua/Align/LinkSearch/NTonly.pm
          Lingua/Align/LinkSearch/PaCoMT.pm
          Lingua/Align/LinkSearch/Src2Trg.pm
          Lingua/Align/LinkSearch/Src2TrgWellFormed.pm
          Lingua/Align/LinkSearch/Threshold.pm
          Lingua/Align/LinkSearch/Tonly.pm
          Lingua/Align/LinkSearch/Trg2Src.pm
          Lingua/Align/LinkSearch/Viterbi.pm

    modules for data manipulation
          Lingua/Align/Corpus.pm
          Lingua/Align/Corpus/Treebank.pm
          Lingua/Align/Corpus/Treebank/AlpinoXML.pm
          Lingua/Align/Corpus/Treebank/Penn.pm
          Lingua/Align/Corpus/Treebank/Stanford.pm
          Lingua/Align/Corpus/Treebank/TigerXML.pm
          Lingua/Align/Corpus/Parallel/Bitext.pm
          Lingua/Align/Corpus/Parallel/Dublin.pm
          Lingua/Align/Corpus/Parallel/Giza.pm
          Lingua/Align/Corpus/Parallel/Moses.pm
          Lingua/Align/Corpus/Parallel/OPUS.pm
          Lingua/Align/Corpus/Parallel/OrderedIds.pm
          Lingua/Align/Corpus/Parallel.pm
          Lingua/Align/Corpus/Parallel/STA.pm
          Lingua/Align/Corpus/Parallel/WPT.pm
          Lingua/Align/Corpus/Factored.pm

   How to do word alignment
    Lingua::Align can, of course, also be used for word alignment. It is
    straightforward if you have parse trees available. Then you can just
    specify the flag "-L" (leafs only) to only consider terminal nodes
    during training and alignment. (Note that you can also align
    non-terminal nodes only using the flag "-N" and if you use both flags
    "-N -L" only nodes of the same type will be aligned).

    Furthermore, you can also use the software to do word alignment on plain
    text files (this is still quite experimental). Look at the example in
    "europarl/wpt03" to see how to run the aligner. Again, you need some
    training data and you have to specify some features to be used for
    classification. Training data can be in the format of the shared task on
    word alignment WPT 2003/2005 (<http://www.cse.unt.edu/~rada/wpt/>):

      0008 4 2 S
      0008 1 1 P
      0008 2 1 P
      0008 3 1 P

    As features you may use, for example, string similarity measures such as
    LCSR score (longest common sub-sequence ratio), Dice scores based on
    co-occurrence frequencies, Moses/Giza++ alignments, binary features such
    as the occurrence of suffix pairs etc. Run "make" in the
    "europarl/wpt03" to see an example alignment experiment.

    To run your own experiments you can specify your own setup. Here is a
    simple example:

      treealign -a test.wa.nullalign -A wpt \
                -s test.e -S text \
                -t test.f -T text \
                -f lcsr=3:suffix=4:treespansim \
                -n 20 -e 20 -L > aligned.xml

    This uses the file "test.wa.nullalign" for training and testing which is
    in "WPT" format ("-A") and aligns source languages texts ("test.e") to
    the target language texts ("test.f"), both in plain text format. The
    features are string similarity (LCSR) between tokens that are at least 3
    characters long, pairs of 4-character suffixes and "tree span
    similarity", which is in case of word alignment the relative position
    difference between the token witin the sentences.

    For evaluation you can use the standard evaluation script just
    specifying that the gold standard is in WPT format:

      treealigneval -g wpt test.wa.nullalign aligned.xml

    If you want to use Moses/Giza++ alignments as features: Just use the
    same parameters as for tree alignment.

      treealign -a test.wa.nullalign -A wpt \
                -s test.e -S text \
                -t test.f -T text \
                -g moses/giza.e-f/A3.final.447.gz \
                -G moses/giza.f-e/A3.final.447.gz \
                -y moses/model/aligned.grow-diag-final.447 \
                -f moses:gizae2f:gizaf2e
                -n 20 -e 20 -L > aligned.xml

    Another common feature is co-occurrence which can be measured in various
    ways. You can use the script "bin/coocfreq" to generate co-occurrence
    frequencies from arbitrary parallel corpora that can be plugged into the
    aligner as a feature. An example computing co-occurrence frequencies
    from tokens in the test corpus (which is much too small to compute
    reliable scores) is the following:

      coocfreq -s test.e -t test.f \
               -x word -y word \
               -f word.src -e word.trg -c word.cooc

    This uses the parallel corpus "test.e" and "test.f" (in Moses/Giza++
    plain text format -- corresponding lines are aligned to each other) to
    count frequencies that will be stored in "word.src" (source language
    tokens) "word.trg" (target language tokens) and "word.cooc"
    (co-occurrence frequencies). These scores can then be used in the
    aligner as a feature:

      treealign -a test.wa.nullalign -A wpt \
                -s test.e -S text \
                -t test.f -T text \
                -f dice=word.cooc \
                -n 20 -e 20 -L > aligned.xml

    Don't expect too much as these Dice scores are not reliable from such a
    small corpus! Of course, you can combine these scores with any other
    feature as described above.

    Co-occurrence frequencies can be computed for various kinds of features
    and feature combinations. For example, you can compute frequencies of
    word suffixes with the following command:

      coocfreq -s test.e -t test.f \
               -x suffix=4 -y suffix=4 \
               -f suffix.src -e suffix.trg -c suffix.cooc

    In order to use several Dice scores in alignment you can give these
    feature types different names (they have to start with 'dice'):

      treealign -a test.wa.nullalign -A wpt \
                -s test.e -S text \
                -t test.f -T text \
                -f diceword=word.cooc:dicesuffix=suffix.cooc \
                -n 20 -e 20 -L > aligned.xml

    It is maybe worth mentioning that these feature types (Dice, LCSR,
    suffix-pairs, etc) also can be used for tree alignment as explained
    earlier. Especially Dice scores can also be calculated for any feature
    connected to arbitrary nodes in a tree. Examples of such co-occurrence
    measures can be seen in the "smultron/" directory. Here is an example
    for computing co-occurrence frequencies for POS labels and parent
    category labels from parse tree pairs:

      coocfreq -a sophie.xml -A sta \
               -x pos:parent_cat -y pos:parent_cat \
               -f pospcat.src -e pospcat.trg -c pospcat.cooc

    In tree alignment it would also make sense to use the contextual
    co-occurrence features, for example, "-f parent_dicecat=cat.cooc" (if
    "cat.cooc" includes the co-occurrence frequencies of category labels).

    Finally, you can also visualize word alignment using a little tool in
    the bin directory of Lingua::Align:

      compare_wordalign.pl -A corpus \
                    -b wpt -B corpus -S test.e -T test.f \
                    aligned.xml test.wa.nullalign

    This will print link matrices comparing the proposed links with the gold
    standard links. It also computes cumulative evaluation measures
    (precision, recall, AER). It looks like this:

      compare word alignments for: 0157 -- 0157
      -----------------------------------------------|--
                                       · · · · · ·   | he
                                       · · · · · ·   | said
                                       · · · · · ·   | that
       S                                             | if
         S                                           | we
           S                                         | use
             · S                                     | unemployment
                 z ·                                 | as
                 · · · ·                 *           | the
                 · · · ·               *             | solution
                 · z · ·                             | to
                   · · S                             | inflation
                         S                           | ,
                           S · · · ·                 | we
                           · z · · ·                 | will
                           · · z · ·                 | get
                           · · · · P                 | recovery
                                                   S | .
      -----------------------------------------------|--
       s n u l c p c l i , n p r l é , v l p d l g .
       i o t e h o o e n   o o e e c   o a o e e o  
         u i   ô u m   f   u u l   o   i   s     u  
         s l   m r b   l   s r a   n   l   i     v  
           i   a   a   a     r n   o   à   t     e  
           s   g   t   t     o c   m       i     r  
           o   e   t   i     n e   i       o     n  
           n       r   o     s r   e       n     e  
           s       e   n                         m  
                                                 e  
                                                 n  
                                                 t  
         8 x (S) .... proposed = gold = S
         1 x (P) .... proposed = gold = P
         4 x (z) .... proposed = P, gold = S (ok!)
         0 x (d) .... proposed = S, gold = P (ok!)
      !  2 x (*) .... proposed = S, gold = not aligned (wrong!)
         0 x (+) .... proposed = P, gold = not aligned (wrong!)
         0 x (-) .... proposed = not aligned, gold = S (missing!)
        49 x (·) .... proposed = not aligned, gold = P (missing!)
      ----------------------------------------------------------------  
      total: 13 correct, 49 missing, 2 wrong
      this sentence: precision = 0.8667, recall = 1.0000, AER = 0.0741
            average: precision = 0.8338, recall = 0.9613, AER = 0.1150
              total: precision = 0.8526, recall = 0.9695, AER = 0.0941

  Documentation & References
    There are several man-pages generated from the "pod" information in the
    Perl modules and scripts included in Lingua::Align. Look at the
    following files:

    Lingua::treealign

    Lingua::treealigneval

    Lingua::Align

    Lingua::Align::Trees

    Lingua::Align::Features

    Lingua::Align::LinkSearch

    Lingua::Align::Corpus

    Lingua::coocfreq

    Lingua::convert_treebank

    Lingua::sta2moses

    Lingua::sta2phrases

    Here are some publications (please cite if you use the software):

    Tiedemann, J. (2010)
        Lingua-Align: An Experimental Toolbox for Automatic Tree-to-Tree
        Alignment. In *Proceedings of the 7th International Conference on
        Language Resources and Evaluation* (LREC'2010), 2010.
        <http://stp.lingfil.uu.se/~joerg/published/lrec2010.pdf>

          @InProceedings{Tiedemann:LREC10,
            author =     {Jörg Tiedemann},
            title =      {Lingua-Align: An Experimental Toolbox for Automatic
                          Tree-to-Tree Alignment},
            booktitle =  {Proceedings of the 7th International Conference on
                          Language Resources and Evaluation (LREC'2010)},
            year =       2010,
            address =    {Valetta, Malta},
          }

    Tiedemann, J. and Kotzé, G. (2009)
        Building a Large Machine-Aligned Parallel Treebank. In *Proceedings
        of the 8th International Workshop on Treebanks and Linguistic
        Theories* (TLT'08), pages 197-208, EDUCatt, Milano/Italy, 2009.
        <http://stp.lingfil.uu.se/~joerg/published/tlt09.pdf>

         @InProceedings{TiedemannKotze:TLT09,
           author =      {Jörg Tiedemann and Gideon Kotzé},
           title =       {Building a Large Machine-Aligned Parallel Treebank},
           booktitle =   {Proceedings of the 8th International Workshop on
                          Treebanks and Linguistic Theories (TLT'08)},
           year =        2009,
           pages =        {197--208},
           isbn =        {978-88-8311-712-1},
           editor =       {Marco Passarotti and Adam Przepiórkowski and 
                          Savina Raynaud and Frank Van Eynde},
           publisher =    {EDUCatt, Milano/Italy}
         }

    Tiedemann, J. and Kotzé, G. (2009)
        A Discriminative Approach to Tree Alignment. In *Proceedings of the
        International Workshop on Natural Language Processing Methods and
        Corpora in Translation, Lexicography and Language Learning* (in
        connection with RANLP'09), pages 33 - 39, 2009.
        <http://stp.lingfil.uu.se/~joerg/published/ranlp09_tree.pdf>

         @InProceedings{TiedemannKotze:RANLP09,
           author =      {Jörg Tiedemann and Gideon Kotzé},
           title =       {A Discriminative Approach to Tree Alignment},
           booktitle =   {Proceedings of the International Workshop on Natural
                          Language Processing Methods and Corpora in
                          Translation, Lexicography and Language Learning (in
                          connection with RANLP'09)},
           pages =       {33 -- 39},
           year =        2009,
           editor =      {Iustina Ilisei and Viktor Pekar and Silvia
                          Bernardini},
           isbn =        {978-954-452-010-6}
         }

  Author
    Joerg Tiedemann, <jorg.tiedemann@lingfil.uu.se>

  Copyright and License
    Copyright (C) 2009, 2010 by Joerg Tiedemann, Gideon Kotzé

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.8 or, at
    your option, any later version of Perl 5 you may have available.

    Copyright for MegaM by Hal Daume III see
    http://www.cs.utah.edu/~hal/megam/ for more information Paper: Notes on
    CG and LM-BFGS Optimization of Logistic Regression, 2004
    http://www.cs.utah.edu/~hal/docs/daume04cg-bfgs.pdf

