NAME
    AnyEvent::Net::Curl::Queued - Any::Moose wrapper for queued downloads
    via Net::Curl & AnyEvent

VERSION
    version 0.035

SYNOPSIS
        #!/usr/bin/env perl

        package CrawlApache;
        use feature qw(say);
        use strict;
        use utf8;
        use warnings qw(all);

        use HTML::LinkExtor;
        use Any::Moose;

        extends 'AnyEvent::Net::Curl::Queued::Easy';

        after finish => sub {
            my ($self, $result) = @_;

            say $result . "\t" . $self->final_url;

            if (
                not $self->has_error
                and $self->getinfo('content_type') =~ m{^text/html}
            ) {
                my @links;

                HTML::LinkExtor->new(sub {
                    my ($tag, %links) = @_;
                    push @links,
                        grep { $_->scheme eq 'http' and $_->host eq 'localhost' }
                        values %links;
                }, $self->final_url)->parse(${$self->data});

                for my $link (@links) {
                    $self->queue->prepend(sub {
                        CrawlApache->new({ initial_url => $link });
                    });
                }
            }
        };

        no Any::Moose;
        __PACKAGE__->meta->make_immutable;

        1;

        package main;
        use strict;
        use utf8;
        use warnings qw(all);

        use AnyEvent::Net::Curl::Queued;

        my $q = AnyEvent::Net::Curl::Queued->new;
        $q->append(sub {
            CrawlApache->new({ initial_url => 'http://localhost/manual/' })
        });
        $q->wait;

DESCRIPTION
    AnyEvent::Net::Curl::Queued (a.k.a. YADA, *Yet Another Download
    Accelerator*) is an efficient and flexible batch downloader with a
    straight-forward interface capable of:

    *   create a queue;

    *   append/prepend URLs;

    *   wait for downloads to end (retry on errors).

    Download init/finish/error handling is defined through Moose's method
    modifiers.

  MOTIVATION
    I am very unhappy with the performance of LWP. It's almost perfect for
    properly handling HTTP headers, cookies & stuff, but it comes at the
    cost of *speed*. While this doesn't matter when you make single
    downloads, batch downloading becomes a real pain.

    When I download large batch of documents, I don't care about cookies or
    headers, only content and proper redirection matters. And, as it is
    clearly an I/O bottleneck operation, I want to make as many parallel
    requests as possible.

    So, this is what CPAN offers to fulfill my needs:

    *   Net::Curl: Perl interface to the all-mighty libcurl
        <http://curl.haxx.se/libcurl/>, is well-documented (opposite to
        WWW::Curl);

    *   AnyEvent: the DBI of event loops. Net::Curl also provides a nice and
        well-documented example of AnyEvent usage (03-multi-event.pl);

    *   MooseX::NonMoose: Net::Curl uses a Pure-Perl object implementation,
        which is lightweight, but a bit messy for my Moose-based projects.
        MooseX::NonMoose patches this gap.

    AnyEvent::Net::Curl::Queued is a glue module to wrap it all together. It
    offers no callbacks and (almost) no default handlers. It's up to you to
    extend the base class AnyEvent::Net::Curl::Queued::Easy so it will
    actually download something and store it somewhere.

  ALTERNATIVES
    As there's more than one way to do it, I'll list the alternatives which
    can be used to implement batch downloads:

    *   WWW::Mechanize: no (builtin) parallelism, no (builtin) queueing.
        Slow, but very powerful for site traversal;

    *   LWP::UserAgent: no parallelism, no queueing. WWW::Mechanize is built
        on top of LWP, by the way;

    *   LWP::Curl: LWP::UserAgent-alike interface for WWW::Curl. No
        parallelism, no queueing. Fast and simple to use;

    *   HTTP::Tiny: no parallelism, no queueing. Fast and part of CORE since
        Perl v5.13.9;

    *   HTTP::Lite: no parallelism, no queueing. Also fast;

    *   Furl: no parallelism, no queueing. Very fast;

    *   Mojo::UserAgent: capable of non-blocking parallel requests, no
        queueing;

    *   AnyEvent::Curl::Multi: queued parallel downloads via WWW::Curl.
        Queues are non-lazy, thus large ones can use many RAM;

    *   Parallel::Downloader: queued parallel downloads via AnyEvent::HTTP.
        Very fast and is pure-Perl (compiling event driver is optional). You
        only access results when the whole batch is done; so huge batches
        will require lots of RAM to store contents.

  BENCHMARK
    (see also: CPAN modules for making HTTP requests
    <http://neilb.org/reviews/http-requesters.html>)

    Obviously, the bottleneck of any kind of download agent is the
    connection itself. However, socket handling and header parsing add a
    lots of overhead.

    The script eg/benchmark.pl compares AnyEvent::Net::Curl::Queued against
    several other download agents. Only AnyEvent::Net::Curl::Queued itself,
    AnyEvent::Curl::Multi, Parallel::Downloader, Mojo::UserAgent and lftp
    <http://lftp.yar.ru/> support parallel connections natively; thus,
    Parallel::ForkManager is used to reproduce the same behaviour for the
    remaining agents. Both AnyEvent::Curl::Multi and LWP::Curl are frontends
    for WWW::Curl. Parallel::Downloader uses AnyEvent::HTTP as it's backend.

    The download target is a copy of the Apache documentation
    <http://httpd.apache.org/docs/2.2/> on a local Apache server. The test
    platform configuration:

    *   Intel® Core™ i7-2600 CPU @ 3.40GHz with 8 GB RAM;

    *   Ubuntu 11.10 (64-bit);

    *   Perl v5.16.1 (installed via perlbrew);

    *   libcurl 7.27.0 (without AsynchDNS, which slows down curl_easy_init()
        <http://curl.haxx.se/libcurl/c/curl_easy_init.html>).

    The script eg/benchmark.pl uses Benchmark::Forking and Class::Load to
    keep UA modules isolated and loaded only once.

        $ perl benchmark.pl --count 100 --parallel 4 --repeat 5

                                 Request rate WWW::M LWP::UA Mojo::UA HTTP::Tiny HTTP::Lite AE::C::M P::D lftp YADA Furl wget curl LWP::Curl
        WWW::Mechanize v1.72            303/s     --    -65%     -80%       -82%       -85%     -86% -91% -91% -93% -95% -96% -96%      -97%
        LWP::UserAgent v6.04            873/s   187%      --     -44%       -48%       -58%     -60% -74% -74% -79% -87% -89% -89%      -90%
        Mojo::UserAgent v3.39          1558/s   412%     78%       --        -7%       -24%     -29% -54% -54% -63% -76% -80% -80%      -82%
        HTTP::Tiny v0.017              1672/s   451%     92%       8%         --       -19%     -24% -51% -51% -60% -74% -79% -79%      -81%
        HTTP::Lite v2.4                2058/s   577%    136%      32%        23%         --      -6% -39% -39% -51% -68% -74% -74%      -77%
        AnyEvent::Curl::Multi v1.1     2203/s   624%    152%      41%        31%         7%       -- -35% -35% -47% -66% -72% -72%      -75%
        Parallel::Downloader v0.121560 3378/s  1015%    288%     118%       102%        65%      54%   --  -0% -19% -48% -57% -57%      -61%
        lftp v4.3.1                    3401/s  1018%    289%     118%       103%        65%      55%   0%   -- -19% -48% -57% -57%      -61%
        YADA v0.027                    4167/s  1276%    379%     169%       150%       103%      90%  23%  23%   -- -36% -47% -47%      -52%
        Furl v0.40                     6502/s  2041%    645%     318%       288%       216%     196%  92%  91%  56%   -- -17% -18%      -26%
        wget v1.12                     7874/s  2493%    803%     406%       371%       283%     258% 133% 132%  88%  21%   --  -0%      -10%
        curl v7.27.0                   7899/s  2501%    806%     408%       372%       284%     260% 133% 133%  89%  22%   0%   --      -10%
        LWP::Curl v0.12                8757/s  2780%    902%     462%       423%       326%     298% 158% 158% 109%  35%  11%  11%        --

        (output formatted to show module versions at row labels and keep column labels abbreviated)

ATTRIBUTES
  allow_dups
    Allow duplicate requests (default: false). By default, requests to the
    same URL (more precisely, requests with the same signature are issued
    only once. To seed POST parameters, you must extend the
    AnyEvent::Net::Curl::Queued::Easy class. Setting "allow_dups" to true
    value disables request checks.

  common_opts
    "opts" in AnyEvent::Net::Curl::Queued::Easy attribute common to all
    workers initialized under the same queue. You may define "User-Agent"
    string here.

  completed
    Count completed requests.

  cv
    AnyEvent condition variable. Initialized automatically, unless you
    specify your own. Also reset automatically after "wait", so keep your
    own reference if you really need it!

  max
    Maximum number of parallel connections (default: 4; minimum value: 1).

  multi
    Net::Curl::Multi instance.

  queue
    "ArrayRef" to the queue. Has the following helper methods:

  queue_push
    Append item at the end of the queue.

  queue_unshift
    Prepend item at the top of the queue.

  dequeue
    Shift item from the top of the queue.

  count
    Number of items in queue.

  share
    Net::Curl::Share instance.

  stats
    AnyEvent::Net::Curl::Queued::Stats instance.

  timeout
    Timeout (default: 60 seconds).

  unique
    Signature cache.

  watchdog
    The last resort against the non-deterministic chaos of evil lurking
    sockets.

METHODS
  start()
    Populate empty request slots with workers from the queue.

  empty()
    Check if there are active requests or requests in queue.

  add($worker)
    Activate a worker.

  append($worker)
    Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
    end of the queue. For lazy initialization, wrap the worker in a "sub {
    ... }", the same way you do with the Moose "default => sub { ... }":

        $queue->append(sub {
            AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
        });

  prepend($worker)
    Put the worker (instance of AnyEvent::Net::Curl::Queued::Easy) at the
    beginning of the queue. For lazy initialization, wrap the worker in a
    "sub { ... }", the same way you do with the Moose "default => sub { ...
    }":

        $queue->prepend(sub {
            AnyEvent::Net::Curl::Queued::Easy->new({ initial_url => 'http://.../' })
        });

  wait()
    Process queue.

CAVEAT
    *   Many sources suggest to compile libcurl <http://curl.haxx.se/> with
        c-ares <http://c-ares.haxx.se/> support. This only improves
        performance if you are supposed to do many DNS resolutions (e.g.
        access many hosts). If you are fetching many documents from a single
        server, "c-ares" initialization will actually slow down the whole
        process!

SEE ALSO
    *   AnyEvent

    *   Any::Moose

    *   Net::Curl

    *   WWW::Curl

    *   AnyEvent::Curl::Multi

AUTHOR
    Stanislaw Pusep <stas@sysd.org>

COPYRIGHT AND LICENSE
    This software is copyright (c) 2012 by Stanislaw Pusep.

    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.

