NAME
    Lingua::Align::Trees::Features - Perl modules for feature extraction for
    the Lingua::Align::Trees tree aligner

SYNOPSIS
      use Lingua::Align::Trees::Features;

      my $FeatString = 'catpos:treespansim:parent_catpos';
      my $extractor = new Lingua::Align::Trees::Features(
                              -features => $FeatString);

      my %features = $extractor->features(\%srctree,\%trgtree,
                                          $srcnode,$trgnode);

      my $FeatString2 = 'giza:gizae2f:gizaf2e:moses';
      my $extractor2 = new Lingua::Align::Trees::Features(
                          -features => $FeatString2,
                          -lexe2f => 'moses/model/lex.0-0.e2f',
                          -lexf2e => 'moses/model/lex.0-0.f2e',
                          -moses_align => 'moses/model/aligned.intersect');

      my %features = $extractor2->features(\%srctree,\%trgtree,
                                           $srcnode,$trgnode);

DESCRIPTION
    Extract features from a pair of nodes from two given syntactic trees
    (source and target language). The trees should be complex hash
    structures as produced by Lingua::Align::Corpus::Treebank::TigerXML. The
    returned features are given as simple key-value pairs (%features)

    Features to be used are specified in the feature string given to the
    constructor ($FeatString). Default is 'inside2:outside2' which refers to
    2 features, the inside score and the outside score as defined by the
    Dublin Sub-Tree Aligner (see
    http://www2.sfs.uni-tuebingen.de/ventzi/Home/Software/Software.html,
    http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009.pdf
    ). For this you will need the probabilistic lexicons as created by Moses
    (http://statmt.org/moses/); see the -lexe2f and -lexf2e parameters in
    the constructor of the second example.

    Features in the feature string are separated by ':'. Here is an example
    of a feature string including feature types, tree-level similarity
    scores, tree span similarity scores and category/POS label pairs (more
    information about supported feature types can be found below):

      treelevelsim:treespansim:catpos

    You can also refer to contextual features, meaning that you can extract
    all possible feature types from connected nodes. This is done by
    specifying how the contextual node is connected to the current one. For
    example, you can refer to parent nodes on source and/or target language
    side. A feature with the prefix 'parent_' makes the feature extractor to
    take the corresponding values from the first parent nodes in source and
    target language trees. The prefix 'srcparent_' takes the values from the
    source language parent (but the current target language node) and
    'trgparent_' takes the target language parent but not the source
    language parent. For example 'parent_catpos' gets the labels of the
    parent nodes. These feature types can again be combined with others as
    described above (product, mean, concatenation). We can also use
    'sister_' features 'children_' features which will refer to the feature
    with the maximum value among all sister/children nodes, respectively.

    Finally, there is also a way to address neighbor nodes using the prefix
    'neighborXY_' where X and Y refer to the distance from the current node
    (single digits only!). X gives the distance of the source language
    neighbor and Y the distance of the target language neighbor. Negative
    values refer to left neighbors and positive values (do not use '+' to
    indicate positive values!) refer to neighbors to the right. For terminal
    nodes all surface words are considered to retrieve neighbors. For all
    other nodes only neighbors that are connected via the same parent node
    will be retrieved. Here are some examples of contextual features:

      # category/POS label pairs from the left neighbor on the source side
      # and the current node at the target side
      neighbor-10_catpos

      # Moses word alignment feature from the left source language neighbor
      # and the neighbor 2 positions to the left on the target side
      neighbor-1-2_moses

      # tree level similarity between the source parent and the current target node
      srcparent_treelevelsim

      # average gizae2f score of all children of the current node pair
      children_gizae2f

      # category/POS label pair of the grandparent on the source side
      # and the parent of the target side
      parent_srcparent_catpos

      # lexical "inside2" score feature from the left source neighbor
      # and the right target neighbor
      neighbor-11_inside2

      # category/POS label pairs of ALL combinations of sister nodes
      # of the current node pair (from both sides)
      sister_catpos

    Feature types can also be combined to form complex features. Possible
    combinations are:

    product (*)
        multiply the value of 2 or more feature types, e.g.
        'inside2*outside2' would refer to the product of inside2 and
        outside2 scores

    average (+)
        compute the average (arithmetic mean) of 2 or more features, e.g.
        'inside2+outside2' would refer to the mean of inside2 and outside2
        scores

    concatenation (.)
        merge 2 or more feature keys and compute the average of their
        scores. This can especially be useful for "nominal" feature types
        that have several instantiations. For example, 'catpos' refers to
        the labels of the nodes (category or POS label) and the value of
        this feature is either 1 (present). This means that for 2 given
        nodes the feature might be 'catpos_NN_NP => 1' if the label of the
        source tree node is 'NN' and the label of the target tree node is
        'NP'. Such nominal features can be combined with real valued
        features such as inside2 scores, e.g. 'catpos.inside2' means to
        concatenate the keys of both feature types and to compute the
        arithmetic mean of both scores. Here are some more examples of
        complex features:

          # product of inside and outside scores
          inside2*outside2

          # product of tree level similarity score between source parent and
          # current target and Moses score for parent nodes on both sides
          srcparent_treelevelsim*parent_moses  

  FEATURES
    The following feature types are implemented in the Tree Aligner:

   lexical equivalence features
    Lexical equivalence features evaluate the relations between words
    dominated by the current subtree root nodes (alignment candidates). They
    all use lexical probabilities usually derived from automatic word
    alignment (other types of probabilistic lexica could be used as well).
    The notion of inside words refers to terminal nodes that are dominated
    by the current subtree root nodes and outside words refer to terminal
    nodes that are not dominated by the current subtree root nodes. Various
    variants of scores are possible:

    inside1 (insideST1*insideTS1)
        This is the unnormalized score of words inside of the current
        subtrees (see
        http://ventsislavzhechev.eu/Downloads/Zhechev%20MT%20Marathon%202009
        .pdf). Lexical probabilities are taken from automatic word alignment
        (lex-files). NULL links are also taken into account. It is actually
        the product of insideST1 (probabilities from source-to-target
        lexicon) and insideTS1 (probabilities from target-to-source lexicon)
        which also can be used separately (as individual features).

    outside1 (outsideST1*outsideTS1)
        The same as inside1 but for word pairs outside of the current
        subtrees. NULL links are counted and scores are not normalized.

    inside2 (insideST2*insideTS2)
        This refers to the normalized inside scores as defined in the Dublin
        Subtree Aligner.

    outside2 (outsideST1*outsideTS1)
        The normalized scores of word pairs outside of the subtrees.

    inside3 (insideST3*insideTS3)
        The same as inside1 (unnormalized) but without considering NULL
        links (which makes feature extraction much faster)

    outside3 (outsideST1*outsideTS1)
        The same as outside1 but without NULL links.

    inside4 (insideST4*insideTS4)
        The same as inside2 but without NULL links.

    outside4 (insideST4*insideTS4)
        The same as outside2 but without NULL links.

    maxinside (maxinsideST*maxinsideTS)
        This is basically the same as inside4 but using "max P(x|y)" instead
        of "1/|y \SUM P(x|y)" as in the original definition. maxinsideST is
        using the source-to-target scores and maxinsideTS is using the
        target-to-source scores.

    maxoutside (maxoutsideST*maxoutsideTS)
        The same as maxinside but for outside word pairs

    avgmaxinside (avgmaxinsideST*avgmaxinsideTS)
        This is the same as maxinside but computing the average (1/|x|\SUM_x
        max P(x|y)) instead of the product (\PROD_x max P(x|y))

    avgmaxoutside (avgmaxoutsideST*avgmaxoutsideTS)
        The same as avgmaxinside but for outside word pairs

    unioninside (unioninsideST*unioninsideTS)
        Add all lexical probabilities using the addition rule of independent
        but not mutually exclusive probabilities
        (P(x1|y1)+P(x2|y2)-P(x1|y1)*P(x2|y2))

    unionoutside (unionoutsideST*unionoutsideTS)
        The same as unioninside but for outside word pairs.

   word alignment features
    Word alignment features use the automatic word alignment directly. Again
    we distinguish between words that are dominated by the current subtree
    root nodes (inside) and the ones that are outside. Alignment is binary
    (1 if two words are aligned and 0 if not) and as a score we usuallty
    compute the proportion of interlinked inside word pairs among all links
    involving either source or target inside words. One exception is the
    moselink feature which is only defined for terminal nodes.

    moses
        The proportion of interlinked words (from automatic word alignment)
        inside of the current subtree among all links involving either
        source or target words inside of the subtrees.

    moseslink
        Only for terminal nodes: is set to 1 if the twwo words are linked in
        the automatic word alignment derived from GIZA++/Moses.

    gizae2f
        Link proportion as for moses but now using the assymmetric GIZA++
        alignments only (source-to-target).

    gizaf2e
        Link proportion as for moses but now using the assymmetric GIZA++
        alignments only (target-to-source).

    giza
        Links from gizae2f and gizaf2e combined.

   sub-tree features
    Sub-tree features refer to features that are related to the structure
    and position of the current subtrees.

    treespansim
        This is a feature for measuring the "horizontal" similarity of the
        subtrees under consideration. It is defined as the 1 - the relative
        position difference of the subtree spans. The relative position of a
        subtree is defined as the middle of the span of a subtree
        (begin+end/2) divided by the length of the sentence.

    treelevelsim
        This is a feature measuring the "vertical" similarity of two nodes.
        It is defined as 1 - the relative tree level difference. The
        relative tree level is defined as the distance to the sentence root
        node divided by the size of the tree (which is the maximum distance
        of any node in the tree to the sentence root).

    nrleafsratio
        This is the ratio of the number of leaf nodes dominated by the two
        candidate nodes. The ratio is defined as the
        minimum(nr_src_leafs/nr_trg_leafs,nr_trg_leafs/nr_src_leafs).

   annotation/label features
    "catpos"
        This feature type extracts node label pairs and gives them the value
        1. It uses the "cat" attribute if it exists, otherwise it uses the
        "pos" attribute if that one exists.

    "edge"
        This feature refers to the pairs of edge labels (relations) of the
        current nodes to their immediate parent (only the first parent is
        considered if multiple exist). This is a binary feature and is set
        to 1 for each observed label pair.

   co-occurrence features
    Measures of co-occurrence can also be used as features. Currently Dice
    scores are supported that will be computed "on-the-fly" from
    co-occurrence frequency counts. Frequencies should be stored in plain
    text files using a simple format:

    Source/target language frequencies: First line starts with '#' and
    specifies the node features used (for example '# word' means that the
    actual surface words will be used). All other lines should contain three
    items, the actual token, a unique token ID and the frequency (all
    separated by one TAB character). A file should look like this:

      # word
      learned 682     4
      stamp   722     3
      hat     1056    5
      what    399     20
      again   220     14

    Co-occurrence frequencies are also stored in plain text files with the
    following format: The first two lines specify the files which are used
    to store source and target language frequencies. All other lines contain
    source token ID, target token ID and the corresponding co-occurrence
    frequency. An example could look like this:

      # source frequencies: word.src
      # target frequencies: word.trg
      127     32      4
      127     898     3
      127     31      3
      798     64      3
      798     861     4

    The easiest way to produce such frequency files is to use the script
    "coocfreq" in the "bin" directory of Lingua-Align. Look at the header of
    this script for possible options.

    Features for the frequency counts can be quite complex. Any node
    attribute can be used. Special features are "suffix=X", "prefix=X" which
    refer to word-suffix resp. word-prefix of length X (number of
    characters). Another special feature is "edge" which refers to the
    relation to the head of the current node. Features can also be combined
    (separate each feature with one ':' character). You may also use
    features of context nodes using 'parent_', 'children_' and 'sister_' as
    for the alignment features. Here is an example of a complex feature:

      word:pos:parent_suffix=3:parent_cat

    This refers to the surface word at the current node, its POS label, the
    3-letter-suffix and the category label of the parent node. All these
    feature values will be concatenated with ':' and frequencies refer to
    those concatenated strings.

    "diceNAME=COOCFILE"
        Using features that start with the prefix 'dice' you can use Dice
        scores as features which will be computed from the frequencies in
        COOCFILE (and the source/target frequencies in the files specified
        in COOCFILE). You have to give each dice-feature a unique name
        (NAME) if you want to use several Dice score features. For example
        "dicePOS=pos.coocfreq" enables Dice score features over POS-label
        co-occurrence frequencies stored in pos.coocfreq (if that's what
        you've stored in pos.coocfreq).

        You may again use context features in the same way as for all other
        features, for example, 'sister_dicePOS=pos.coocfreq'. Note that
        co-occurrence features not always exist for all nodes in the tree
        (for example POS labels do not exist for non-terminal nodes).

   "orthographic" features
    You can also use features that are based on the comparison and
    combination of strings. There are (sub)string features, string
    similarity features, string class features and length comparison
    features.

    "lendiff"
        This is the absolute character length difference of the source and
        the target language strings dominated by the current nodes.

    "lenratio"
        This is the character-length ratio of the source and the target
        language strings dominated by the current nodes (shorter string
        divided by longer string)

    "word"
        This is the pair of words at the current node (leaf nodes only).

    "suffix=X"
        This is the pair of suffixes of length X from both source and target
        language words (leaf nodes only).

    "prefix=X"
        This is the pair of prefixes of length X from both source and target
        language words (leaf nodes only).

    "isnumber"
        This feature is set to 1 if both strings match the pattern
        /^[\d\.\,]+\%?$/

    "hasdigit"
        This feature is set to 1 if both strings contain at least one digit.

    "haspunct"
        This feature is set to 1 if both strings contain punctuations.

    "ispunct"
        This feature is set to 1 if both strings are single character
        punctuations.

    "punct"
        This feature is set to the actual pair of strings if both strings
        are single character punctuations.

    "identical=minlength"
        This feature is 1 if both strings are longer than "minlength" and
        are identical.

    "lcsr=minlength"
        This feature is the longest subsequence ratio between the two
        strings if they are both longer than "minlength" characters.

    "lcsrlc=minlength"
        This is the same as "lcsr" but using lowercased strings.

    "lcsrascii=minlength"
        This is the same as "lcsr" but using only the ASCII characters in
        both strings.

    "lcsrcons=minlength"
        This is the same as "lcsr" but uses a simple regex to remove all
        vowels (using a fixed set of characters to match).

SEE ALSO
    For the tree structure see Lingua::Align::Corpus::Treebank. For
    information on the tree aligner look at Lingua::Align::Trees

AUTHOR
    Joerg Tiedemann, <jorg.tiedemann@lingfil.uu.se>

COPYRIGHT AND LICENSE
    Copyright (C) 2009 by Joerg Tiedemann

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.8 or, at
    your option, any later version of Perl 5 you may have available.

