NAME
    coocfreq - count co-occurrence frequencies for arbitrary features of
    nodes in a parallel treebank (corpus).

SYNOPSIS
      coocfreq [OPTIONS]

      # count co-occurrence frequencies between category labels
      # in the parallel treebank of Sophie's World (Smultron)
      # and print the results in plain text files 

      coocfreq -a sophie.xml -A sta -x cat -y cat -f cat.src -e cat.trg -c cat.cooc

      # count co-occurrences of 3-letter-suffix + category label of the parent node
      # of the source language tree with words from the target language tree
      # results will be stored in src.freq, trg.freq and cooc.freq

      coocfreq -a sophie.xml -A sta -x suffix=3:parent_cat -y word

DESCRIPTION
    This script counts frequencies and co-occurrence frequencies of source
    and target language features. It runs through the sentence aligned
    treebank and combines all node pairs. Note that co-occurrence
    frequencies in a sentence are " max( srcfreq(srcfeature) ,
    trgfreq(trgfeature) ) " to ensure Dice scores between 0 and 1!

OPTIONS
    -f src.freq
        Specify the name for the source language frequencies. The file will
        start with a line specifying the source language features used
        (starting with an initial '#'). All other lines have three TAB
        separated items: the feature string, a unique ID, and finally the
        frequency.

         # word
         learned 682     4
         stamp   722     3
         hat     1056    5
         what    399     20
         again   220     14
         of      27      118

    -e trg.freq
        Specify the name for the target language frequencies. The format is
        the same as for the source language.

    -c cooc.freq
        Specify the name for the co-occurrence frequencies. The first two
        lines specify the names of the files with the source and the target
        language frequencies and all other lines contain TAB separated
        source feature ID, target feature ID and co-occurrence frequency.
        Here is an example:

         # source frequencies: word.src
         # target frequencies: word.trg
         127     32      4
         127     898     3
         127     31      3
         127     11      5
         127     138     6
         798     9       4
         1250    1367    3

    -a align-file
        Name of the alignment file (needs to include sentence alignment
        information). Parallel corpora without explicit sentence alignment
        files can also be used. For example, you can leave out this
        parameter if your parallel corpus is a plain text corpus with two
        separate files for source and target language and corresponding
        lines are aligned.

    -A align-file-format
        This argument specifies the format of the sentence alignment file.
        For example, it can be OPUS (XCES format used in OPUS) or STA
        (Stockholm Tree Aligner format).

    -s src-file
        Source language file of your parallel corpus.

    -S src-file-format
        Format of the source language file. Default will be "plain text".

    -s trg-file
        Target language file of your parallel corpus.

    -T trg-file-format
        Format of the target language file. Default will be "plain text".

    -x srcfeatures
        Features in the source language. Default feature is 'word' = surface
        words at each terminal node. All kinds of node attributes and
        combinations of features and contextual features can be used.

    -y trgfeatures
        The same as -x but for the target language trees.

    -m freq-threshold
        The frequency threshold. Default is 2.

    -D  A flag that enables storing the source and target language
        vocabulary in DB_FILE database files on disk to save memory when
        counting. This can be useful especially for complex (long) feature
        strings. Otherwise it doesn't save that much. The co-occurrence
        matrix is the big problem .....

SEE ALSO
    Lingua::treealign, Lingua::Align::Trees, Lingua::Align::Features

AUTHOR
    Joerg Tiedemann, <jorg.tiedemann@lingfil.uu.se>

COPYRIGHT AND LICENSE
    Copyright (C) 2009 by Joerg Tiedemann

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself, either Perl version 5.8.8 or, at
    your option, any later version of Perl 5 you may have available.

