#!perl -w

# RSS2Leafnode -- copy RSS feeds to a local news spool

# Copyright 2007, 2008, 2009, 2010, 2011 Kevin Ryde
#
# This file is part of RSS2Leafnode.
#
# RSS2Leafnode is free software; you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the Free
# Software Foundation; either version 3, or (at your option) any later
# version.
#
# RSS2Leafnode is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
# or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
# for more details.
#
# You should have received a copy of the GNU General Public License along
# with RSS2Leafnode.  If not, see <http://www.gnu.org/licenses/>.

use 5.010;
use strict;
use warnings;
use App::RSS2Leafnode;

use Encode;           # for Encode::PERLQQ
use PerlIO::encoding; # for fallback
# version 0.06 for bug fix of a struct size for perl 5.10 (there's some
# fragile duplication)
use PerlIO::locale 0.06;

our $VERSION = 54;

# locale encoding conversion on the tty, wide-chars everywhere internally
# for instance $subject from an item might be wide chars printed when --verbose
{ no warnings 'once';
  local $PerlIO::encoding::fallback = Encode::PERLQQ; # \x{1234} style
  (binmode (STDOUT, ':locale') && binmode (STDERR, ':locale'))
    or die "Cannot set :encoding on stdout/stderr: $!\n";
}

my $r2l = App::RSS2Leafnode->new;
exit $r2l->command_line;

__END__

=for stopwords rss2leafnode rss leafnode NNTP config leafnode undef charset utf-8 non-ascii charsets builtins misconfigured Eg Unrendered Google pre-releases Ryde PNG libxml multibyte codings feed's NOAA XHTML unescaping X-From-Url X-RSS-Url X-RSS-Generator eg sn codepage unescape favicon kbytes repost r2l.perl

=head1 NAME

rss2leafnode -- post RSS or Atom feeds and web pages to newsgroups

=head1 SYNOPSIS

 rss2leafnode [--options]

=head1 DESCRIPTION

RSS2Leafnode downloads RSS or Atom feeds and posts items as messages to an
NNTP news server.  It's designed to make simple text items available in
local newsgroups, not propagating anywhere (though that's not enforced).

Desired feeds are given in a configuration file F<.rss2leafnode.conf> in
your home directory.  For example to put a feed in group "r2l.perl"

    fetch_rss ('r2l.perl', 'http://log.perl.org/atom.xml');

This is actually Perl code, so comment lines begin with C<#> and you can
write conditionals etc.  The target newsgroup must exist (see L</Leafnode>
below).  With that done, run C<rss2leafnode> as

    rss2leafnode

You can automate with C<cron> or similar.  If you do it under user C<news>
it could be just after a normal news fetch.  The C<--config> option below
lets you run different config files at different times, etc.
A sample config file is included in the RSS2Leafnode sources.

Messages are added to the news spool using NNTP "POST" commands.  When a
feed is re-downloaded any items previously added are not repeated.  Multiple
feeds can be put into a single newsgroup.  Feeds are inserted as they're
downloaded, so the first articles appear while the rest are still in
progress.

The target newsgroup can also be a C<news:> or C<nntp:> URL of a server on a
different host or a different port number if running a personal server on a
high port.

    fetch_rss('news://somehost.mydomain.org:8119/r2l.weather',
              'http://feeds.feedburner.com/PTCC');

=head2 Web Pages

Plain web pages can be downloaded too.  Each time the page changes a new
article is injected.  This is good for a latest news or status page which
doesn't have an RSS feed.  For example

    fetch_html ('r2l.music,
      'http://www.abc.net.au/rage/playlist/print/saturday_print.htm');

The target can be an image or similar directly too, it's simply put into a
news message with its indicated MIME type.  How well it displays depends on
your newsreader.

The message "Subject" is the HTML C<< <title> >>, or something better from
C<URI::Title> or C<Image::ExifTool> if you've got them.  C<URI::Title> has
special cases for a few unhelpful sites and C<Image::ExifTool> can get a PNG
image title.

Since the conf file is Perl code you can write something to construct a URL
with a date etc if there isn't a single updating page.  It may be worth
fetching the latest and previous if you're not quite certain when the new
one becomes available.

=head2 Re-Downloading

HTTP C<ETag> and C<Last-Modified> headers are used, if provided by the
server, to avoid re-downloading unchanged content (feeds or web pages).
C<< <thr:count> >> is used to check for unchanged comments feeds.  Values
seen from the last run are saved in a F<.rss2leafnode.status> file in your
home directory.

If you've got C<XML::RSS::Timing> then it's used for RSS C<ttl>,
C<updateFrequency>, etc from a feed.  This means the feed is not
re-downloaded until its specified update times.  Only a few feeds have good
timing info, most merely give a C<ttl> advising for instance 5 minutes
between rechecks.

With C<--verbose> the next calculated update time is printed in case you're
wondering why nothing is happening.  The easiest way to force a re-download
is to delete the F<~/.rss2leafnode.status> file.  Old status file entries
are automatically dropped if you don't fetch a particular feed for a while,
so that file should normally need no maintenance.

=head2 Leafnode

C<rss2leafnode> was originally created with the C<leafnode> program in mind,
but can be used with any server accepting posts.  It's your responsibility
to be careful where a target newsgroup propagates.  Don't make automated
postings to the world!

For leafnode version 2 see its F<README> file section "LOCAL NEWSGROUPS" on
creating local-only groups.  Basically add a line to the
F</etc/news/leafnode/local.groups> file like

    r2l.stuff	y	My various feeds

The group name is arbitrary and the description is optional, but note it
must be a tab character between the name and the "y" and between the "y" and
any description.  "y" means posting is allowed.

=head2 Small News

The Small News "sn" program is a possible local server too.  Create groups
in it with C<snnewgroup r2l.something>.  When running C<snntpd> from
C<inetd> or similar don't forget a logger program argument on the command
line as shown in its F<INSTALL.run>, otherwise log messages go to the client
connection and will upset most programs, including C<Net::NNTP> as used by
C<rss2leafnode>.

=head2 Copyright

It's your responsibility to check the terms of use for any feeds or web
pages you download with C<rss2leafnode>.  Pay particular attention if
propagating or re-transmitting resulting messages.

Copyright or license statements in a feed are included in the messages as
C<X-Copyright> headers.  Unless the content is in the public domain such
copyright notices must be retained.

The transformations RSS2Leafnode makes to turn feed items into messages are
purely mechanical and the author believes they don't cause the program's
terms (ie. GPL, per L</"LICENSE"> below) to be imposed on the results.

=head1 COMMAND LINE OPTIONS

The command line options are

=over 4

=item C<--config=/some/filename>

Read the specified configuration file instead of F<~/.rss2leafnode.conf>.

=item C<--help>

Print some brief help information.

=item C<--verbose>

Print some diagnostics about what's being done.  With C<--verbose=2> print
various technical details.

=item C<--version>

Print the program version number and exit.

=back

=head1 CONFIG OPTIONS

The following variables can be set in the configuration file

=over 4

=item $rss_get_links (default 0)

If true then download links in each item and include the content in the news
message.  For example,

    $rss_get_links = 1;
    fetch_rss ('r2l.finance',
      'http://au.biz.yahoo.com/financenews/htt/financenews.xml');

Not all feeds have interesting things at their link.  Sometimes the RSS has
the full item text already.  But if the RSS is a summary then
C<$rss_get_links> can make the full article ready to read immediately,
instead of having to click through from the message.

Only the immediate link target URL is retrieved.  No images within the page
are downloaded (which is often a good thing), and you'll probably have
trouble if the link uses frames (a set of HTML pages instead of just one).

=item $rss_get_comments (default 0)

If true then download the comments feeds for items and post as followup news
articles.  For example,

    $rss_get_comments = 1;
    fetch_rss ('r2l.food',
      'http://wickedgooddinner.blogspot.com/feeds/posts/default');

To send a followup comment you generally must go to the links in the
original article (or the followups) and use some sort of web form.  Posting
a message to the newsgroup goes nowhere.

When a feed is available in both Atom and RSS formats sometimes only the
Atom one includes a comments feed URL.

Comments feeds are followed for as long as an article appears in the feed,
though in the current implementation might be checked for new comments only
when the originating feed changes.

=item $render (default 0)

If true then render HTML to text for the news messages.  Normally item text,
C<$rss_get_links> downloaded parts, and C<fetch_html> pages are all
presented as C<text/html>.  If your newsreader doesn't handle HTML very well
then C<$render> is a good way to see just the text.  Setting C<1> uses
C<HTML::FormatText>

    $render = 1;
    fetch_rss ('r2l.weather',
      'http://xml.weather.yahoo.com/forecastrss?p=ASXX0001&u=f');

Setting C<"WithLinks"> uses the C<HTML::FormatText::WithLinks> variant (you
must have that module) which shows HTML links as footnotes.

    $render = 'WithLinks';
    fetch_rss ('r2l.stuff',
               'http://rss.sciam.com/sciam/basic-science');

Settings C<elinks>, C<lynx> or C<w3m> dump through the respective external
program (you must have C<HTML::FormatExternal> and the program).

    $render = 'lynx';
    $rss_get_links = 1;
    fetch_rss ('r2l.sport',
               'http://fr.news.yahoo.com/rss/rugby.xml');

=item $render_width (default 60)

The number of columns to use when rendering HTML to plain text or when
wrapping Atom text.  You can set this to whatever you find easiest to read,
or any special width needed by a particular feed.

=item $get_icon (default 0)

Download an RSS/Atom icon or HTML favicon as an image for the C<Face>
header.  The C<Face> header is shown by Gnus and perhaps only a few other
news readers.  In Gnus it appears with the "From" in the article mode
display on a graphical screen.  It can be a good visual cue to the channel
origin, but may not always be worth the extra download.

    $get_icon = 1;
    fetch_rss ('r2l.whatsnew',
               'http://www.archive.org/services/collection-rss.php');

C<Image::Magick> is required to process the images.  Banner images which are
much wider than high are suppressed as probably advertising and in any case
not suited to 48x48 size of the Face header specification.  A 48x48 image
may add perhaps 4 kbytes or more to each message.

For plain RSS and Atom feeds an image is normally per-channel so is the same
for all articles from the feed.  But an C<itunes:image> can be per-item and
is used if present.

=back

=head2 Obscure Options

=over 4

=item $rss_charset_override (default undef)

If set then force RSS content to be interpreted in this charset,
irrespective of what the document says.  See L<XML::Parser/ENCODINGS> for
the charsets supported (F<.enc> files under
F</usr/lib/perl5/XML/Parser/Encodings/> plus some builtins).

Use this option if the document is wrong or has no charset specified and
isn't the XML default utf-8.  Usually you'll only want this for a particular
offending feed.  For example,

    # AIR is latin-1, but doesn't have a <?xml> saying that
    $rss_charset_override = 'iso-8859-1';
    fetch_rss ('r2l.finance', 'http://www.aireview.com.au/rss.php');
    $rss_charset_override = undef;

By default RSS2Leafnode attempts to cope with bad multibyte sequences by
re-coding to the feed's claimed charset.  If that works then the text will
have some substitute characters (either U+FFFD or question marks "?") and a
warning is given like

    Feed http://example.org/feed.xml
      recoded utf-8 to parse, expect substitutions for bad non-ascii
      (line 214, column 75, byte 13196)

Bad single-byte codings generally aren't detected and will just go through
to display something incorrect (eg. if MS-DOS codepage 1252 is used where
Latin-1 is claimed).  Nose around the raw feed as necessary to see where it
goes wrong.

=item $html_charset_from_content (default 0)

If true then the charset used for C<fetch_html> content is taken from the
HTML itself, rather than the server's HTTP headers.  Normally the server
should be believed, but if a particular server is misconfigured then you can
try this.

    $html_charset_from_content = 1;
    fetch_rss ('r2l.stuff',
               'http://www.somebadserver.com/newspage.html');

=back

=head2 Variable Extent

Variables take effect from the point they're set, through to the end of the
file, or until a new setting.  The Perl C<local> feature and a braces block
can confine a setting to a particular few feeds.  Eg.

    { local $rss_get_links = 1;
      fetch_rss ('r2l.finance',
                 'http://www.debian.org/News/weekly/dwn.en.rdf');
    }

=head1 OTHER DETAILS

Non-ascii RSS and Atom text and rendered HTML text are all coded as utf-8 in
the generated messages so for non-ascii content you'll need a newsreader
which supports that.  Unrendered HTML is left in the charset the server
gave, to ensure it matches any C<< <meta http-equiv> >> in the document.  In
all cases the charset is specified in the MIME message headers or attachment
parts.  Transfer format in the message body is chosen by C<MIME::Entity>
(except Atom base64 C<< <content> >>) which normally means quoted-printable
it there's any non-ascii or very long lines.

Links are shown for

    <link>                 RSS and Atom
    <enclosure>            RSS
    <comments>             RSS
    <content>              Atom externals, except other XML feeds
    <source>               RSS and Atom
    <wfw:comment>          well-formed web
    <wiki:diff> 
    <wiki:history>
    <sioc:has_creator>
    <sioc:has_discussion>
    <sioc:links_to>
    <sioc:reply_of>
    Author <url>           Atom and wiki, not downloaded

Comment or reply links show a count from any of

    <thr:total>
    count="123"         \ attribute of <link>
    thr:count="123"     /
    <slash:comments>    sub-element of <comments>

The RSS format comment feeds used by C<$rss_get_comments> are as follows.
"appication" is a typo from WordPress pre 2.5 and still sometimes found in
use as of Feb 2011.

    <wfw:commentRss>
    <link rel='replies' type='application/atom+xml' ...>
    <link rel='replies' type='appication/atom+xml' ...>

Comments links in the resulting news messages are shown as "Replies" or "RSS
Replies".  If an RSS comment feed hasn't been detected as RSS it may show up
as a plain "Replies" instead of "RSS Replies" (and won't be downloaded by
<$rss_get_comments>).

Common Alerts Protocol (CAP) fields for weather alerts etc are shown if
present (eg. from the US NOAA).  This can have more detail than just the
text.  Pseudo-link footnotes are shown for,

    <geo:lat>,<geo:long>
    <geo:Point>
    <georss:point>
    <statusnet:origin>      possibly with URL target too
    <media:credit>

Unrecognised item fields are shown in XML at the end of the message so as
not to drop information, and to perhaps suggest extra things RSS2Leafnode
might present or interpret.

An attempt is made to repair bad XML from a feed with C<XML::Liberal> if you
have that module.  It uses C<XML::LibXML> and the C<libxml> library and is
often successful on annoying things like bad entities, at least enough to
process something.  On hopelessly wrong data it might be a bit slow.

The most common XML problem is too much or too little entity escaping.  Too
little can turn HTML markup into nested XML elements.  RSS2Leafnode treats
that as if it was XHTML elements, though the result is likely to be
imperfect.  Too much escaping currently ends up displaying raw or semi-raw
HTML C<< <p> >> or C<&foo;> etc.  An option for extra unescaping might
improve the display of some bad feeds, but in practice that's unlikely to be
successful since each bad feed tends to be bad in its own special way.

=head2 Message Headers

For reference the headers in the messages are generated roughly as follows,

=over

=item From:

First non-empty of

    <author>
    <dc:creator>
    <dc:contributor>
    <wiki:username>
    <itunes:author>
    <managingEditor>
    <webMaster>
    <dc:publisher>
    <itunes:owner>
    channel <title>

If there's no identifiable mailbox part then C<nobody@rss2leafnode.dummy> is
added to make an RFC822 address.  The channel title as a fallback shows
something about where a message came from when there's no other author
identified.  An author's home page is shown in the links (as noted above).

=item Subject:

C<< <title> >> or C<< <dc:subject> >>.  A C<< <dc:subject> >> is normally
only a keyword but might be better than nothing.

=item Date:

First present of

    <pubDate>
    <dc:date>
    <modified>
    <updated>
    <issued>
    <created>
    <lastBuildDate>
    <published>

C<dc:date> is ISO format "2000-01-01T12:00:00Z" etc and anything in that
form is converted to RFC822 style for the messages.  An unrecognised form is
put through unmodified.

=item Date-Received:

The date/time when C<rss2leafnode> made the message.

=item Message-ID:

First of

    <id>                         (Atom)
    <guid isPermaLink="true">
    <link>                       Yahoo Finance special case
    <guid isPermaLink="false">   and feed URL
    MD5 hash                     of various fields and feed URL

Yahoo Finance items repeated in different feeds are noticed using a special
match of the C<< <link> >> so that just one copy is posted.  (As of March
2010 those items don't offer RSS C<guid> identifiers.)

=item Keywords:

All of

    <category>
    <itunes:category>
    <cap:category>
    <itunes:keywords>
    <media:keywords>
    <dc:subject>
    <slash:section>

The sub-category system of <itunes:category> is not currently put through.

=item In-Reply-To:

C<< <thr:in-reply-to> >> elements (per RFC 4685) turned into Message-IDs the
same way as an Atom <id>.  This might help thread display in a news reader
if the parent item was downloaded too.

C<< <sioc:reply_of> >> is not used.  It'd be a possibility, but would
probably need a hard-coded mapping of URL to Message-ID.  For now it's just
shown as a link (as noted above).

=item Content-Location:

The URL of a C<fetch_html()> or a C<$get_links> attachment part.  Good
newsreaders use this to resolve relative links in a HTML part.

=item Content-Language:

First of

    <language>
    <dc:language>
    <twitter:lang>
    xml:lang=""
    HTTP response Content-Language header

C<xml:lang> is a standard XML attribute which may be present on any element
and is sometimes found on Atom C<< <content> >> text.

=item Content-MD5:

From the corresponding HTTP header of a C<fetch_html()> or C<$get_links>
download part, though in practice this is almost never used.

=item Importance:

=item Priority:

Common Alerts Protocol C<< <cap:severity> >> levels Extreme and Severe are
treated as "Importance: high" and "Priority: urgent".
C<< <wiki:importance> >> "minor" is "Importance: low".  These headers are
only supposed to be for X.400 inter-operation though.

=item Precedence:

"list" for certain Google Groups lists, identified by their link URLs per
C<List-Post> below.  Perhaps other feeds which come from mailing lists could
be identified.

=item Face:

As per the C<$get_icons> option above, the first item or channel element

     <image>           RSS
     <icon>            Atom
     <logo>            Atom
     <itunes:image>
     <statusnet:postIcon>
     <activity:actor><link rel="avatar">
     HTML favicon      for fetch_html()

Gnus and perhaps other newsreaders can display C<Face:>, see
L<http://quimby.gnus.org/circus/face>.

It'd be possible to generate an C<X-Face:> as well or instead, but it's
black and white and a conversion from a colour image out of the feeds is
unlikely to look good most of the time.

=item List-Post:

Mailbox of a Google Groups mailing list feeds such as
L<http://groups.google.com/group/cake-php/feed/rss_v2_0_msgs.xml>.  This may
help post a followup to the list, depending on the newsreader.  (A followup
to an C<rss2leafnode> newsgroup will normally go nowhere.)

=item PICS-Label:

Channel C<< <rating> >>.  Perhaps C<< <itunes:explicit> >> or
C<< <media:adult> >> could be turned into a rating too.

=item X-Mailer:

"RSS2Leafnode/VERSION" plus the usual from C<MIME::Entity> (see
L<MIME::Entity/build PARAMHASH>).

=item X-Copyright:

An RSS2Leafnode extension, being all of following.  See L</Copyright> above.

    <rights>
    <copyright>
    <dc:license>
    <dc:rights>
    <creativeCommons:license>
    <cc:license>

=item X-RSS-Url:

An RSS2Leafnode extension, being the originating C<fetch_rss()> feed URL
downloaded.  This is handy if an item has come out badly and want to check
the raw feed.

=item X-RSS-Generator:

An RSS2Leafnode extension, being the channel C<< <generator> >>.  This might
help assign blame for bad feed content etc.

=back

Of course all this mapping wouldn't be necessary if RSS had been news to
start with.  A news server already serves short messages, either read-only
or with followups, and if news servers hadn't got a fairly well deserved
reputation for being a pain to administer, and if it hadn't been based on
transferring gigabytes of "full feed" instead of on-demand, then RSS might
never have been needed.  Of course the other side is that if you're
accustomed to HTTP for web pages then everything looks like a web resource,
and if you're used to HTML then an edifice like XML to encapsulate a half
dozen bits of text seems like a good idea.

=head1 BUGS

The way Message-IDs are checked on the news server means that the server
should be setup to retain messages for at least as long as the feed retains
items.  If that's not so then old articles will be re-posted by the next
C<fetch_rss> and will look like new articles to a newsreader.

Letting the news server track articles keeps down the amount of state
C<rss2leafnode> must maintain and means multiple users can insert a feed
without duplication.  But perhaps long running or mothballed feeds will need
further repost protection.

Some pre-releases of leafnode 2 have trouble with posts to local newsgroups
while a C<fetchnews> run is in progress.  The local articles don't show up
until after a subsequent further C<fetchnews>.

No attention is paid to C<< <atom:updated> >> or other changes in an item.
Should an updated item be re-posted?  Is the C<Supersedes:> header better,
replacing the article?  Something allowing readers to see or not see updates
according to user preference would be good.  Currently if C<< <atom:id> >>
changes then the item is reposted, or if there's no C<id> and the content is
different enough to make the MD5 hash change.  But C<id> is supposed to stay
the same for an update is it?

The way C<$rss_get_links> only gets the immediate link target could perhaps
be extended to fetch images, frame parts, etc of a HTML page and include
them in the message as RFC 2557 style "MHTML".  Not sure that any news
readers will actually display that though.

=head1 ENVIRONMENT VARIABLES

=over 4

=item C<NNTPSERVER>

=item C<NEWSHOST>

Default news server as per C<Net::NNTP>.

=back

=head1 FILES

=over 4

=item F<~/.rss2leafnode.conf>

Configuration file.

=item F<~/.rss2leafnode.status>

Status file, recording "last modified" dates for downloads.  This can be
deleted if something bad seems to have happened to it; the next
C<rss2leafnode> run will recreate it.

=item C</etc/perl/Net/libnet.cfg>

=item C<~/.libnet.cfg>

Defaults per C<Net::NNTP> and C<Net::Config>.

=back

=head1 SEE ALSO

L<leafnode(8)>,
L<HTML::FormatText>, L<HTML::FormatText::WithLinks>, L<HTML::FormatExternal>,
L<lynx(1)>,
L<URI::Title>, L<XML::Parser>, L<XML::Liberal>, L<Image::Magick>,
C<Net::NNTP>, C<Net::Config>

L<Plagger>, L<feed2imap(1)>, L<rss2email(1)>, L<rssdrop(1)>, L<toursst(1)>,
L<http://www.gwene.org>

=head1 HOME PAGE

L<http://user42.tuxfamily.org/rss2leafnode/index.html>

=head1 LICENSE

Copyright 2007, 2008, 2009, 2010, 2011 Kevin Ryde

RSS2Leafnode is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 3, or (at your option) any later
version.

RSS2Leafnode is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
more details.

You should have received a copy of the GNU General Public License along with
RSS2Leafnode.  If not, see L<http://www.gnu.org/licenses/>.

=cut
