PROPAGANDA:

See http://www.math.uio.no/~janl/w3mir/ for propaganda.

--------------------------------------------------------------------------
FAQS:

Q: Where can I get a new version of w3mir?
A: http://www.math.uio.no/~janl/w3mir/

Q: Are there any mailing lists?
A: Yes, see below.

Q: Should I subscribe to any of the mailinglists?
A: Yes, if you use w3mir at all you should subscribe to 
   w3mir-info@usit.uio.no, send e-mail to janl@math.uio.no to be
   subscribed.

Q: I found a bug!
A: See below.

Q: Does it handle cgis?
A: In a manner of speaking.  The default is to just execute the cgi and
   save the result.  Instead you can also let all references to cgis
   be ignored and let the point back to the original site.  This requires
   use of a config file and directives like:
     Ignore: *.cgi
     Ignore: *-cgi

Q: Does it handle imagemaps?
A: It does handle client side imagemaps, but not server side imagemaps.
   Server side imagemaps can be handled like cgis, they point back to the
   original site:
     Ignore: *.map

Q: Does it handle Applets/java/activeX?
A: Only if the "codebase" directory is browseable and contains all
   objects the applet needs, and if all the other URL attributes in the
   OBJECT/APPLET tag are relative to the codebase.  Support for this
   will be made more complete in a (near) future release.

Q: Does it handle CSSes (Cascading Style Sheets)?
A: Limited support, in the sense that the CSSes are retrived but are not
   examined to find and/or edit URLs.  <style> tags are also handled
   in the sense that w3mir does not do _anything_ to them, except making
   sure they are not damaged.  w3mir will be extended to support
   CSSes more fully in the near future.

Q: Does it handle scripts?
A: No.   w3mir does not damage scripts, but they are not executed either,
   and URLs embeded in scripts will not be found.

--------------------------------------------------------------------------
BUGS:

- -lc switch does not work too well.

Please see below for how to report bugs.

--------------------------------------------------------------------------
FEATURES (NOT bugs):

- URLs with two /es ('//') in the path component does not work
  as some might expect.  According to my reading of the http/url spec.
  it is an illegal construct, which is a Good Thing, because I don't
  know how to handle it if it's legal.
- If you start at http://foo/bar/ then index.html might be gotten twice.
- Some documents point to a point above the server root, i.e.,
  http://some.server/../stuff.html.  Netscape, and other browsers, in
  defiance of the URL standard documents will change the URL to
  http://some.server/stuff.html.  W3mir will not.

--------------------------------------------------------------------------
MAIL LISTS, REPORTING BUGS:

Please send bug reports to w3mir-core@usit.uio.no, please include URL
and command line that triggered the bug.  Ideas (see todo lists
further down please), questions about usage, general discussions and
other related talk to w3mir-info@usit.uio.no.  To subscribe to these
lists email janl@math.uio.no.  The w3mir-core list is intended for
w3mir hackers only.

--------------------------------------------------------------------------
COPYRIGTHS:

w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
Copyrighted by the various involved hackers.  If you want to copy,
hack or distribte w3mir you can do that providing you comply with the
'Artistic License' enclosed in the w3mir distribution in the file
named Artistic.

--------------------------------------------------------------------------
CREDITS:

- Oscar Nierstrasz: Wrote htget
- Gorm Haug Eriksen: Started w3mir on the foundations of htget,
	contributed code later.
- Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
  everything.
- Chris Szurgot: Adapting to win32, good ideas and code contribs, 
  Debugging.  And criticism.
- Ed Jordan: patch, debugging.
- Rik Faith: Uses w3mir extensively, not shy about complaining and
  commenting and suggesting.  
- The libwww-perl author(s) that made adding some new featres
  ridicolously easy.

--------------------------------------------------------------------------
TODO LIST:

Currently I'm preparing for version 1 of w3mir.

* TODO for version 1:

- Fix bugs discovered.
- Release 1.0

* TODO, after version 1:

Some of these are speculative, some others are very useful.

- CSS parsing/support at the same level as HTML
- Full support for APPLETS/OBJECT tags.
- Alias rules.  These would enable w3mir to map ukoln.bath.ac.uk and 
  bubl.bath.ac.uk to www.ukoln.ac.uk and know that the objects contained
  in these are all the same.  Another use would be to mirror from a mirror
  instead of the _real_ site, since the original site to which you have
  references are on a slow link while the mirror is on a fast link.
- FTP support (easy if through a http style ftp proxy, but is that what
  we want?)
- Retrive recursively until N-th order links are reached.  This
  differs siginificantly from directory recursion which we do now.  When
  this is done w3mir should also know the difference between inline and
  external links, inline links should always be retrived.  Trivia question:
  What order is needed to reach every findable document on the web from
  Yahoo?
- SSL support
- Integrate with cvs or rcs (or other version controll system) to make
  retriver able to reproduce mirrored site for any given date.
- Some text processing:  Adding and removing text/sgml comments when suitable
  options and tags are found.  Suggested by Ed Jordan.
- Put retrival date-stamps as comments in html files, to document the when
  and how of how this document was retrived.
- Example: If you're mirroring a site primarily to get to the papers, but
  the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
  foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version.  Implement
  a way to get only one version of documents provided in multipele versions,
  something like multi axis preference list to get only the most attractive
  version of the doc.
- Logging of retrivals to file, need to change every print to a functioncall.
- Your suggestion here.

* TODO, http related
- Use Keep-alive.  Then we should probably stop using 30 second pauses
  between document retrivals.
- HTTP/1.1?  HTTP/1.1 servers should do keep-alive even with 1.0 requests.
- Separate quenes for each server, interleave requests.

* If perl gets threads:
- Make the retrival and analysis engines separate threads, and have each
  one retrival thread pr. method/server/port and do paralell retrivals.

