NAME
    AnyEvent::Net::Curl::Queued - Moose wrapper for queued downloads via
    Net::Curl & AnyEvent

VERSION
    version 0.010

SYNOPSIS
        #!/usr/bin/env perl

        package CrawlApache;
        use common::sense;

        use HTML::LinkExtor;
        use Moose;

        extends 'AnyEvent::Net::Curl::Queued::Easy';

        after finish => sub {
            my ($self, $result) = @_;

            say $result . "\t" . $self->final_url;

            if (
                not $self->has_error
                and $self->getinfo('content_type') =~ m{^text/html}
            ) {
                my @links;

                HTML::LinkExtor->new(sub {
                    my ($tag, %links) = @_;
                    push @links,
                        grep { $_->scheme eq 'http' and $_->host eq 'localhost' }
                        values %links;
                }, $self->final_url)->parse(${$self->data});

                for my $link (@links) {
                    $self->queue->prepend(sub {
                        CrawlApache->new({ initial_url => $link });
                    });
                }
            }
        };

        no Moose;
        __PACKAGE__->meta->make_immutable;

        1;

        package main;
        use common::sense;

        use AnyEvent::Net::Curl::Queued;

        my $q = AnyEvent::Net::Curl::Queued->new;
        $q->append(sub {
            CrawlApache->new({ initial_url => 'http://localhost/manual/' })
        });
        $q->wait;

DESCRIPTION
    Efficient and flexible batch downloader with a straight-forward
    interface:

    *   create a queue;

    *   append/prepend URLs;

    *   wait for downloads to end (retry on errors).

    Download init/finish/error handling is defined through Moose's method
    modifiers.

  MOTIVATION
    I am very unhappy with the performance of LWP. It's almost perfect for
    properly handling HTTP headers, cookies & stuff, but it comes at the
    cost of *speed*. While this doesn't matter when you make single
    downloads, batch downloading becomes a real pain.

    When I download large batch of documents, I don't care about cookies or
    headers, only content and proper redirection matters. And, as it is
    clearly an I/O bottleneck operation, I want to make as many parallel
    requests as possible.

    So, this is what CPAN offers to fulfill my needs:

    *   Net::Curl: Perl interface to the all-mighty libcurl
        <http://curl.haxx.se/libcurl/>, is well-documented (opposite to
        WWW::Curl);

    *   AnyEvent: the DBI of event loops. Net::Curl also provides a nice and
        well-documented example of AnyEvent usage (03-multi-event.pl);

    *   MooseX::NonMoose: Net::Curl uses a Pure-Perl object implementation,
        which is lightweight, but a bit messy for my Moose-based projects.
        MooseX::NonMoose patches this gap.

    AnyEvent::Net::Curl::Queued is a glue module to wrap it all together. It
    offers no callbacks and (almost) no default handlers. It's up to you to
    extend the base class AnyEvent::Net::Curl::Queued::Easy so it will
    actually download something and store it somewhere.

  OVERHEAD
    Obviously, the bottleneck of any kind of download agent is the
    connection itself. However, socket handling and header parsing add a
    lots of overhead. The script eg/benchmark.pl compares
    AnyEvent::Net::Curl::Queued against several other download agents. Only
    AnyEvent::Net::Curl::Queued itself, AnyEvent::Curl::Multi and lftp
    <http://lftp.yar.ru/> support parallel connections; thus, forks are used
    to reproduce the same behaviour for the remaining agents. Both
    AnyEvent::Curl::Multi and LWP::Curl are frontends for WWW::Curl. The
    download target is a local copy of the Apache documentation
    <http://httpd.apache.org/docs/2.2/>.

                                    URL/s WWW::Mechanize LWP::UserAgent HTTP::Lite HTTP::Tiny AnyEvent::Curl::Multi  lftp AnyEvent::Net::Curl::Queued AnyEvent::HTTP  curl LWP::Curl  wget
        WWW::Mechanize                196             --           -60%       -80%       -85%                  -86%  -88%                        -89%           -92%  -97%      -97% -100%
        LWP::UserAgent                484           148%             --       -51%       -63%                  -66%  -70%                        -72%           -80%  -93%      -93%  -99%
        HTTP::Lite                    989           405%           104%         --       -25%                  -32%  -39%                        -42%           -59%  -85%      -86%  -99%
        HTTP::Tiny                   1312           569%           170%        33%         --                   -9%  -19%                        -23%           -46%  -80%      -82%  -99%
        AnyEvent::Curl::Multi        1446           638%           198%        46%        10%                    --  -10%                        -16%           -41%  -78%      -80%  -98%
        lftp                         1609           722%           232%        63%        23%                   11%    --                         -6%           -34%  -75%      -77%  -98%
        AnyEvent::Net::Curl::Queued  1713           773%           253%        73%        30%                   18%    6%                          --           -30%  -74%      -76%  -98%
        AnyEvent::HTTP               2437          1144%           403%       146%        86%                   69%   51%                         42%             --  -63%      -66%  -97%
        curl                         6512          3228%          1244%       559%       397%                  351%  305%                        281%           167%    --       -8%  -93%
        LWP::Curl                    7110          3524%          1364%       618%       442%                  391%  341%                        315%           191%    9%        --  -92%
        wget                        88875         45240%         18215%      8877%      6675%                 6045% 5418%                       5092%          3544% 1262%     1151%    --

    AnyEvent::HTTP & LWP::Curl are actually faster, but both lack
    queueing/retry.

ATTRIBUTES
  allow_dups
    Allow duplicate requests (default: false). By default, requests to the
    same URL (more precisely, requests with the same signature are issued
    only once. To seed POST parameters, you must extend the
    AnyEvent::Net::Curl::Queued::Easy class. Setting "allow_dups" to true
    value disables request checks.

  completed
    Count completed requests.

  cv
    AnyEvent condition variable. Initialized automatically, unless you
    specify your own.

  max
    Maximum number of parallel connections (default: 4; minimum value: 1).

  multi
    Net::Curl::Multi instance.

  queue
    "ArrayRef" to the queue. Has the following helper methods:

    *   queue_push: append item at the end of the queue;

    *   queue_unshift: prepend item at the top of the queue;

    *   dequeue: shift item from the top of the queue;

    *   count: number of items in queue.

  share
    Net::Curl::Share instance.

  stats
    AnyEvent::Net::Curl::Queued::Stats instance.

  timeout
    Timeout (default: 60 seconds).

METHODS
  start()
    Populate empty request slots with workers from the queue.

  empty()
    Check if there are active requests or requests in queue.

  add($worker)
    Activate a worker.

  append($worker)
    Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
    end of the queue. For lazy initialization, wrap the worker in a "sub {
    ... }", the same way you do with the Moose "default => sub { ... }":

        $queue->append(sub {
            AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
        });

  prepend($worker)
    Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
    beginning of the queue. For lazy initialization, wrap the worker in a
    "sub { ... }", the same way you do with the Moose "default => sub { ...
    }":

        $queue->prepend(sub {
            AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
        });

  wait()
    Shortcut to "$queue->cv->recv".

SEE ALSO
    *   AnyEvent

    *   Moose

    *   Net::Curl

    *   WWW::Curl

    *   AnyEvent::Curl::Multi

AUTHOR
    Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2011 by Stanislaw Pusep.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

