NAME
    HTML::Untemplate - web scraping assistant

VERSION
    version 0.013

DESCRIPTION
    Suppose you have a set of HTML documents generated by populating the
    same template with the data from some kind of database. HTML::Untemplate
    is a set of command-line tools ("xpathify", "untemplate") and modules
    (HTML::Linear and it's dependencies) which assist in original data
    retrieval.

    This process is also known as wrapper induction
    <https://en.wikipedia.org/wiki/Wrapper_(data_mining)>.

    To achieve this goal, HTML tree nodes are presented as XPath/content
    pairs. HTML documents linearized this way can be easily inspected
    manually or with a diff tool. Please refer to "EXAMPLES".

    Despite being named similarly to HTML::Template, this distribution is
    not directly related to it. Instead, it attempts to reverse the
    templating action, whatever the template agent used.

  Why?
    Suppose you have a CMS. Typical CMS works roughly as this (data flows
    bottom-down):

                RDBMS
          scripting language
                 HTML
             HTTP server
                (...)
              HTTP agent
            layout engine
                screen
                 user

    Consider the first 3 steps: "RDBMS => scripting language => HTML"

    This is "applying template".

    Now, consider this: "HTML => scripting language => RDBMS"

    I would call that "un-applying template", or "untemplate" ":)"

    The practical application of this set of tools is to assist in creation
    of web scrappers.

    A similar (however completely unrelated) approach is described in the
    paper XPath-Wrapper Induction for Data Extraction
    <http://www.coltech.vnu.edu.vn/~thuyhq/papers/10_Khanh_Cuong_thuy_4288a1
    50.pdf>.

  Human-readability
    Consider the following HTML node address representations:

    *   0.1.3.0.0.4.0.0.0.2 (HTML::TreeBuilder internal address
        representation);

    *   "/html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3]"
        (HTML::Linear, strict);

    *   "//td[1]/ul[1]/li[3]" (HTML::Linear, strict, shrink);

    *   "/html/body[@class='section_home']/div[@id='content_holder'][1]/div[
        @id='content']/div[@id='main']/table[@class='content_table'][2]/tr/t
        d/ul/li[@class='rss_content rss_content_col'][2]" (HTML::Linear,
        non-strict);

    *   "//li[@class='rss_content rss_content_col'][2]" (HTML::Linear,
        non-strict, shrink).

    They all point to the same node, however, their verbosity/readability
    vary. The *strict* mode specifies tag names and positions only.
    Disabling *strict* will use additional data from CSS selectors. *Shrink*
    mode attempts to find the shortest XPath unique for every node
    ("/html/body" is shared among almost all nodes, thus is likely to be
    irrelevant).

EXAMPLES
  xpathify
    The xpathify tool flatterns the HTML tree into key/value list:

        <!DOCTYPE html>
        <html>
            <head>
                <title>Hello HTML</title>
            </head>
            <body>
                <h1>Hello World!</h1>
                <p>This is a sample HTML</p>
                Beware!
                <p>HTML is <b>not</b> XML!</p>
                Have a nice day.
            </body>
        </html>

    Becomes:

    *(HTML block)*

    The keys are in XPath format, while the values are respective content
    from the HTML tree. Theoretically, it could be possible to reassemble
    the HTML tree from the flat key/value list this tool generates.

  untemplate
    The untemplate tool flatterns a set of HTML documents using the
    algorithm from xpathify. Then, it strips the shared key/value pairs. The
    "rest" is composed of original values fed into the template engine.

    And this is how the result actually looks like with some simple
    real-world examples (quotes 1839 <http://bash.org/?1839> and 2486
    <http://bash.org/?2486> from bash.org <http://bash.org/>):

    *(HTML block)*

MODULES
    May be used to serialize/flattern HTML documents by your own:

    *   HTML::Linear - represent HTML::Tree as a flat list

    *   HTML::Linear::Element - represent elements to populate HTML::Linear

    *   HTML::Linear::Path - represent paths inside HTML::Tree

SEE ALSO
    *   Wrapper (data mining)
        <https://en.wikipedia.org/wiki/Wrapper_(data_mining)>

    *   XPath-Wrapper Induction for Data Extraction
        <http://www.coltech.vnu.edu.vn/~thuyhq/papers/10_Khanh_Cuong_thuy_42
        88a150.pdf>

    *   HTML::TreeBuilder

    *   HTML::Similarity

    *   XML::DifferenceMarkup

AUTHOR
    Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2012 by Stanislaw Pusep.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

