NAME
    Lingua::JA::WebIDF - WebIDF calculator

SYNOPSIS
      use Lingua::JA::WebIDF;

      my $webidf = Lingua::JA::WebIDF->new(
          api       => 'Yahoo',
          appid     => $appid,
          fetch_df  => 1,
          Furl_HTTP => { timeout => 3 }
      );

      print $webidf->idf("東京"); # low
      print $webidf->idf("スリジャヤワルダナプラコッテ"); # high

DESCRIPTION
    Lingua::JA::WebIDF calculates WebIDF scores.

    WebIDF(Inverse Document Frequency) scores represent the rarity of words
    on the Web. The WebIDF scores of rare words are high. Conversely, the
    WebIDF scores of common words are low.

    IDF is based on the intuition that a query term which occurs in many
    documents is not a good discriminator and should be given less weight
    than one which occurs in few documents.

METHODS
  new( %config || \%config )
    Creates a new Lingua::JA::WebIDF instance.

    The following configuration is used if you don't set %config.

      KEY                 DEFAULT VALUE
      -----------         ---------------
      idf_type            1
      api                 'Yahoo'
      appid               undef
      driver              'Storable'
      df_file             undef
      fetch_df            1
      expires_in          365
      documents           250_0000_0000
      Furl_HTTP           undef

    idf_type => 1 || 2 || 3
        The type1 is the most commonly cited form of IDF.

                           N
          idf(t_i) = log -----  (1)
                          n_i

          N  : the number of documents
          n_i: the number of documents which contain term t_i
          t_i: term

        The type2 is a simple version of the RSJ weight.

                      N - n_i + 0.5
          w_i = log ----------------  (2)
                       n_i + 0.5

        The type3 is a modification of (2).

                      N + 0.5
          w_i = log -----------  (3)
                     n_i + 0.5

    api => 'Yahoo' || 'YahooPremium' || 'Bing'
        Uses the specified Web API when fetches WebDF(Document Frequency)
        scores from the Web.

    driver => 'Storable' || 'TokyoCabinet'
        Fetches and saves WebDF scores with the specified driver.

    df_file => $path
        Saves WebDF scores to the specified path.

        If undef is specified, 'yahoo_utf8.st' is used. This file is located
        in File::ShareDir::dist_dir('Lingua-JA-WebIDF'), and contains the
        WebDF scores of about 100000 words. There are other format files in
        the 'share' directory of this library.

        The 100000 words were fetched from the following data.

        *   Noun.csv and Noun.adjv.csv in IPA dictionary

        *   Japanese WordNet

        I recommend that you change the file depending on the type of Web
        API you specifies because WebDF may be different depending on it.

    fech_df => 0
        Doesn't fetch WebDF scores. (If 0 is specified.)

        If the WebDF score you want to know is already saved, it is used.
        Otherwise, returns undef.

    expires_in => $days
        If 365 is specified, a WebDF score expires in 365 days after fetches
        it.

    Furl_HTTP => \%option
        Sets the options of Furl::HTTP->new.

        If you want to use proxy server, you have to use this option.

  idf($word)
    Calculates the WebIDF score of $word.

    If the WebDF score of $word is not saved or is expired, fetches it by
    using the Web API you specified and saves it.

  df($word)
    Fetches the WebDF score of $word.

    If the WebDF score of $word is not saved or is expired, fetches it by
    using the Web API you specified and saves it.

  db_open($mode)
    Opens the database file.

    If you use TokyoCabinet, you have to open database file by using this
    method before idf|df|db_close|purge method is called.

    $mode is 'read' or 'write'.

  db_close
    Closes the database file.

    This method is called automatically when the object is destroyed. So,
    you might not need to use this method explicitly.

  purge($expires_in)
    Purges old data in df_file.

    If 365 is specified, the data which 365 days elapsed are purged.

AUTHOR
    pawa <pawapawa@cpan.org>

SEE ALSO
    Lingua::JA::TFWebIDF

    Lingua::JA::WebIDF::Driver::TokyoTyrant

    Bing API: <http://www.bing.com/toolbox/bingdeveloper/>

    Yahoo API: <http://developer.yahoo.co.jp/>

    Tokyo Cabinet: <http://fallabs.com/tokyocabinet/>

    S. Robertson, Understanding inverse document frequency: on theoretical
    arguments for IDF. Journal of Documentation 60, 503-520, 2004.

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

