NAME
    RDF::RDFa::Parser - flexible RDFa parser

SYNOPSIS
     use RDF::RDFa::Parser;
 
     ### Create an object...
     $p = RDF::RDFa::Parser->new_from_url($url);
     # or: $p = RDF::RDFa::Parser->new($markup, $base_url);
 
     ### Get an RDF::Trine::Model containing the document's data...
     $data = $p->graph;
 
     ### Get Open Graph Protocol data...
     $title = $p->opengraph('title');

VERSION
    1.09_04

DESCRIPTION
  Constructors
    "$p = RDF::RDFa::Parser->new($markup, $base, [$config], [$storage])"
        This method creates a new RDF::RDFa::Parser object and returns it.

        The $markup variable may contain an XHTML/XML string, or a
        XML::LibXML::Document. If a string, the document is parsed using
        XML::LibXML::Parser or HTML::HTML5::Parser, depending on the
        configuration in $config. XML well-formedness errors will cause the
        function to die.

        $base is a URL used to resolve relative links found in the document.

        If $markup is undef, then RDF::RDFa::Parser will fetch $base to
        obtain the document to be parsed. This is probably not a feature you
        should exploit, as it may behave unexpectedly in certain
        circumstances (e.g. if it follows a redirect trail). Use
        "new_from_uri" if you want to fetch and parse a page in one step.

        $config optionally holds an RDF::RDFa::Parser::Config object which
        determines the set of rules used to parse the RDFa. It defaults to
        XHTML+RDFa 1.0.

        Advanced usage note: $storage optionally holds an RDF::Trine::Store
        object. If undef, then a new temporary store is created.

    "$p = RDF::RDFa::Parser->new_from_url($url, [$config], [$storage])"
        $url is a URL to fetch and parse.

        $config optionally holds an RDF::RDFa::Parser::Config object which
        determines the set of rules used to parse the RDFa. The default is
        to determine the configuration by looking at the HTTP response
        Content-Type header; it's probably sensible to keep the default.

        $storage optionally holds an RDF::Trine::Store object. If undef,
        then a new temporary store is created.

        This function can also be called as "new_from_uri". Same thing.

  Public Methods
    "$p->graph"
        This will return an RDF::Trine::Model containing all the RDFa data
        found on the page.

        Advanced usage note: If passed a graph URI as a parameter, will
        return a single named graph from within the page. This feature is
        only useful if you're using named graphs.

    "$p->graphs"
        Advanced usage only.

        Will return a hashref of all named graphs, where the graph name is a
        key and the value is a RDF::Trine::Model tied to a temporary
        storage.

        This method is only useful if you're using named graphs.

    "$p->opengraph([$property])"
        If $property is provided, will return the value or list of values
        (if called in list context) for that Open Graph Protocol property.
        (In pure RDF terms, it returns the non-bnode objects of triples
        where the subject is the document base URI; and the predicate is
        $property, with non-URI $property strings taken as having the
        implicit prefix 'http://opengraphprotocol.org/schema/'. There is no
        distinction between literal and non-literal values.)

        If $property is omitted, returns a list of possible properties.

        Example:

          foreach my $property (sort $p->opengraph)
          {
            print "$property :\n";
            foreach my $val (sort $p->opengraph($property))
            {
              print "  * $val\n";
            }
          }

        See also: <http://opengraphprotocol.org/>.

    "$p->dom"
        Returns the parsed XML::LibXML::Document.

    "$p->uri( [$other_uri] )"
        Returns the base URI of the document being parsed. This will usually
        be the same as the base URI provided to the constructor, but may
        differ if the document contains a <base> HTML element.

        Optionally it may be passed a parameter - an absolute or relative
        URI - in which case it returns the same URI which it was passed as a
        parameter, but as an absolute URI, resolved relative to the
        document's base URI.

        This seems like two unrelated functions, but if you consider the
        consequence of passing a relative URI consisting of a zero-length
        string, it in fact makes sense.

    "$p->errors"
        Returns a list of errors and warnings that occurred during parsing.

    "$p->consume"
        Advanced usage only.

        The document is parsed for RDFa. As of RDF::RDFa::Parser 1.09_04,
        this is called automatically when needed; you probably don't need to
        touch it unless you're doing interesting things with callbacks.

    "$p->set_callbacks(\%callbacks)"
        Advanced usage only.

        Set callback functions for the parser to call on certain events.
        These are only necessary if you want to do something especially
        unusual.

          $p->set_callbacks({
            'pretriple_resource' => sub { ... } ,
            'pretriple_literal'  => sub { ... } ,
            'ontriple'           => undef ,
            'onprefix'           => \&some_function ,
            });

        Either of the two pretriple callbacks can be set to the string
        'print' instead of a coderef. This enables built-in callbacks for
        printing Turtle to STDOUT.

        For details of the callback functions, see the section CALLBACKS. If
        used, "set_callbacks" must be called *before* "consume".
        "set_callbacks" returns a reference to the parser object itself.

CALLBACKS
    Several callback functions are provided. These may be set using the
    "set_callbacks" function, which taskes a hashref of keys pointing to
    coderefs. The keys are named for the event to fire the callback on.

  pretriple_resource
    This is called when a triple has been found, but before preparing the
    triple for adding to the model. It is only called for triples with a
    non-literal object value.

    The parameters passed to the callback function are:

    *   A reference to the "RDF::RDFa::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   Subject URI or bnode (string)

    *   Predicate URI (string)

    *   Object URI or bnode (string)

    *   Graph URI or bnode (string or undef)

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise.

  pretriple_literal
    This is the equivalent of pretriple_resource, but is only called for
    triples with a literal object value.

    The parameters passed to the callback function are:

    *   A reference to the "RDF::RDFa::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   Subject URI or bnode (string)

    *   Predicate URI (string)

    *   Object literal (string)

    *   Datatype URI (string or undef)

    *   Language (string or undef)

    *   Graph URI or bnode (string or undef)

    Beware: sometimes both a datatype *and* a language will be passed. This
    goes beyond the normal RDF data model.)

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise.

  ontriple
    This is called once a triple is ready to be added to the graph. (After
    the pretriple callbacks.) The parameters passed to the callback function
    are:

    *   A reference to the "RDF::RDFa::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   An RDF::Trine::Statement object.

    The callback should return 1 to tell the parser to skip this triple (not
    add it to the graph); return 0 otherwise. The callback may modify the
    RDF::Trine::Statement object.

  onprefix
    This is called when a new CURIE prefix is discovered. The parameters
    passed to the callback function are:

    *   A reference to the "RDF::RDFa::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   The prefix (string, e.g. "foaf")

    *   The expanded URI (string, e.g. "http://xmlns.com/foaf/0.1/")

    The return value of this callback is currently ignored, but you should
    return 0 in case future versions of this module assign significance to
    the return value.

  ontoken
    This is called when a CURIE has been expanded. The parameters are:

    *   A reference to the "RDF::RDFa::Parser" object

    *   A reference to the "XML::LibXML::Element" being parsed

    *   The CURIE or token as a string (e.g. "foaf:name" or "Stylesheet")

    *   The fully expanded URI

    The callback function must return a fully expanded URI, or if it wants
    the CURIE to be ignored, undef.

  onerror
    This is called when an error occurs:

    *   A reference to the "RDF::RDFa::Parser" object

    *   The error level (RDF::RDFa::Parser::ERR_ERROR or
        RDF::RDFa::Parser::ERR_WARNING)

    *   An error code

    *   An error message

    *   A hash of other information

    The return value of this callback is currently ignored, but you should
    return 0 in case future versions of this module assign significance to
    the return value.

    If you do not define an onerror callback, then errors will be output via
    STDERR and warnings will be silent. Either way, you can retrieve errors
    after parsing using the "errors" method.

FEATURES
  HTML Support
    This module is able to handle well-formed XML/XHTML and tag-soup HTML.
    How the input markup is parsed depends on the configuration settings
    passed to the constructor. If you use an XML or XHTML configuration but
    pass non-well-formed markup, the the parser will die.

  Atom / DataRSS
    When processing Atom, if the 'atom_elements' option is switched on,
    RDF::RDFa::Parser will treat <feed> and <entry> elements specially. This
    is similar to the special support for <head> and <body> mandated by the
    XHTML+RDFa Recommendation. Essentially <feed> and <entry> elements are
    assumed to have an imaginary "about" attribute which has its value set
    to a brand new blank node.

    If the 'atom_parser' option is switched on, RDF::RDFa::Parser fully
    parses Atom feeds and entries, using the XML::Atom::OWL package. The two
    modules attempt to work together in assigning blank node identifiers
    consistently, etc. Callbacks *should* work properly, but this has not
    been extensively tested. If XML::Atom::OWL is not installed, then this
    option will be silently ignored.

    "RDF::RDFa::Parser::Config" is capable of enabling settings for parsing
    Atom. It switches on the 'atom_elements' option (but not 'atom_parser'),
    adds support for IANA-registered rel/rev keywords, switches off support
    for some XHTML-specific features, enables processing of the xml:base
    attribute, and adds support for embedded chunks of RDF/XML.

    Generally speaking, adding RDFa attributes to elements in the Atom
    namespace themselves can result in some slightly muddy semantics. It's
    best to use an extension namespace and add the RDFa attributes to
    elements in that namespace. DataRSS provides a good example of this. See
    <http://developer.yahoo.com/searchmonkey/smguide/datarss.html>.

  SVG
    The SVG Tiny 1.2 specification makes the use of RDFa attributes within
    SVG images valid.

    "RDF::RDFa::Parser::Config" is capable of enabling settings for parsing
    SVG. It switches off support for some XHTML-specific features, enables
    processing of the xml:base attribute, and adds support for embedded
    chunks of RDF/XML.

  Embedded RDF/XML
    Though a rarely used feature, XHTML allows other XML markup languages to
    be directly embedded into it. In particular, chunks of RDF/XML can be
    included in XHTML. While this is not common in XHTML, it's seen quite
    often in SVG and other XML markup languages.

    When RDF::RDFa::Parser encounters a chunk of RDF/XML in a document it's
    parsing (i.e. an element called 'RDF' with namespace
    'http://www.w3.org/1999/02/22-rdf-syntax-ns#'), there are three
    different courses of action it can take:

    0. Continue straight through it.
        This is the behaviour that XHTML+RDFa seems to suggest is the right
        option. It should mostly not do any harm: triples encoded in RDF/XML
        will be generally ignored (though the chunk itself could
        theoretically end up as part of an XML literal). It will waste a bit
        of time though.

    1. Skip the chunk.
        This will skip over the RDF element entirely, and thus save you a
        bit of time.

    2. Parse the RDF/XML.
        The parser will parse the RDF/XML properly. If named graphs are
        enabled, any triples will be added to a separate graph. This is the
        behaviour that SVG Tiny 1.2 seems to suggest is the correct thing to
        do.

    You can decide which path to take by setting the 'embedded_rdfxml'
    option in the constructor. For HTML and XHTML, you probably want to set
    embedded_rdfxml to '0' (the default) or '1'. For other XML markup
    languages (e.g. SVG or Atom), then you probably want to set it to '2'.

  Named Graphs
    The parser has support for named graphs within a single RDFa document.
    To switch this on, use the 'graph' option in the constructor.

    The name of the attribute which indicates graph URIs is by default
    'graph', but can be changed using the 'graph_attr' option. This option
    accepts clark notation to specify a namespaced attribute. By default,
    the attribute value is interpreted as a fragment identifier (like the
    'id' attribute), but if you set 'graph_type' to 'about', it will be
    treated as a URI or safe CURIE (like the 'about' attribute).

    The 'graph_default' option allows you to set the default graph URI/bnode
    identifier.

    Once you're using named graphs, the "graphs" method becomes useful: it
    returns a hashref of { graph_uri => trine_model } pairs. The optional
    parameter to the "graph" method also becomes useful.

    See also <http://buzzword.org.uk/2009/rdfa4/spec>.

  Auto Config
    RDF::RDFa::Parser has a lot of different options that can be switched on
    and off. Sometimes it might be useful to allow the page being parsed to
    control some of the options. If you switch on the 'auto_config' option,
    pages can do this.

    A page can set options using a specially crafted <meta> tag:

      <meta name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
         content="xhtml_lang=1&amp;keywords=rdfa+html5+html4+html32" />

    Note that the "content" attribute is an
    application/x-www-form-urlencoded string (which must then be
    HTML-escaped of course). Semicolons may be used instead of ampersands,
    as these tend to look nicer:

      <meta name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
         content="xhtml_lang=1;keywords=rdfa+html5+html4+html32" />

    Any option allowed in the constructor may be given using auto config,
    except 'use_rtnlx', and of course 'auto_config' itself.

    It's possible to use auto config outside XHTML (e.g. in Atom or SVG)
    using namespaces:

      <xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml"
         name="http://search.cpan.org/dist/RDF-RDFa-Parser/#auto_config"
         keywords="iana+rdfa;xml_base=2;atom_elements=1" />

BUGS
    RDF::RDFa::Parser 0.21 passed all approved tests in the XHTML+RDFa test
    suite at the time of its release.

    RDF::RDFa::Parser 0.22 (used in conjunction with HTML::HTML5::Parser
    0.01 and HTML::HTML5::Sanity 0.01) additionally passes all approved
    tests in the HTML4+RDFa and HTML5+RDFa test suites at the time of its
    release; except test cases 0113 and 0121, which the author of this
    module believes mandate incorrect HTML parsing.

    Please report any bugs to <http://rt.cpan.org/>.

    Common gotchas:

    *       Are you using the XML catalogue?

            RDF::RDFa::Parser maintains a locally cached version of the
            XHTML+RDFa DTD. This will normally be within your Perl module
            directory, in a subdirectory named
            "auto/share/dist/RDF-RDFa-Parser/catalogue/". If this is
            missing, the parser should still work, but will be very slow.

SEE ALSO
    RDF::RDFa::Parser::Config, RDF::RDFa::Parser::Profile.

    XML::LibXML, RDF::Trine, HTML::HTML5::Parser, HTML::HTML5::Sanity,
    XML::Atom::OWL.

    <http://www.perlrdf.org/>.

AUTHOR
    Toby Inkster <tobyink@cpan.org>.

ACKNOWLEDGEMENTS
    Kjetil Kjernsmo <kjetilk@cpan.org> wrote much of the stuff for building
    RDF::Trine models. Neubert Joachim taught me to use XML catalogues,
    which massively speeds up parsing of XHTML files that have DTDs.

COPYRIGHT
    Copyright 2008-2010 Toby Inkster

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

