
WWW::Search and AutoSearch and WebSearch
========================================


WHAT IS NEW IN WWW::Search 2.29?  (2002-03-22)
----------------------------------------------

overview: 
 * BUGFIX do not try to delete the TreeBuilder

For details, see the ChangeLog file and/or the pod of each affected
module.


WHAT IS WWW::Search?
--------------------

WWW::Search is a collection of Perl modules which provide an API to
search engines on the world-wide web (and similar engines).
Currently, WWW::Search includes backends for WebCrawler, among others.
Backends for many engines can be obtained separately, such as
AltaVista, Ebay, HotBot, and Yahoo.  This distribution includes two
applications built from this library: AutoSearch, a program to
automate tracking of search results over time; and WebSearch, a small
demonstration program to drive the library.

WWW::Search does NOT try to emulate the default search that you would
get with each search engine's GUI.  I.e. WWW::Search does NOT
necessarily return the same results you would get by visiting the
search engine's web page.  A few backends implement the method
gui_query which does get the same results as searches from the
engine's default web page; see `perldoc WWW::Search` for details.  See
also below under FUTURE PLANS.  WWW::Search performs the search in a
way that is efficient and convenient for text processing.  This might
include using the "advanced search" interface; getting "text-only"
pages; making "OR" the default query term operator instead of "AND";
ungrouping same-site results; making sure descriptions are turned on;
and increasing the number of hits per page, among other tricks.

Because WWW::Search depends on parsing the HTML output of web search
engines, it will fail if the search engine operators change their
format (an unfortunately frequent occurrence).  WWW::Search includes a
test suite for a few backends, which verifies that they are
functioning correctly.  The test suite can be run by typing 'make
test_parsing'; see under INSTALLATION below for details.  

This base WWW::Search distribution contains backends for the following
search engines.  Unfortunately, almost none are operational.  We would
like to have volunteers to fix and/or take over maintenance of these
backends.

Crawler			not working
ExciteForWebServers	not working
Fireball		partially working (not in test suite)
FolioViews		not working
Gopher			not working? (not in test suite)
HotFiles		not working
Livelink		not working? (not in test suite)
MetaCrawler             not working?
Metapedia		partially working? (not in test suite)
MSIndexServer		not working?
NetFind                 not working?
Null			working
PLweb			not working
Profusion               defunct
Search97		not working
SFgate			partially working?
Simple			not working? (not in test suite)
Verity			not working (not in test suite)
VoilaFr                 partially working? (not in test suite)

''Partially working'' indicates that some tests passed and some failed.

The following backends (and more!) are registered at CPAN
independently (not included with this WWW::Search distribution):

AltaVista       http://www.perl.com/CPAN-local/authors/by-module/WWW
AP              in the WWW::Search::News distribution
Ebay            http://www.perl.com/CPAN-local/authors/by-module/WWW
Euroseek        http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
Go              http://www.perl.com/CPAN-local/authors/by-module/WWW
GoTo            http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
Google          http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
HotBot          http://www.perl.com/CPAN-local/modules/by-module/WWW/
LookSmart       http://www.perl.com/CPAN-local/modules/by-module/WWW/JSMYSER
Lycos           http://www.perl.com/CPAN-local/modules/by-module/WWW/MTHURN/
Magellan        http://www.perl.com/CPAN-local/modules/by-module/WWW/MTHURN/
Newsbytes       in the WWW::Search::News distribution
Nomade          http://www.perl.com/CPAN-local/authors/by-module/WWW
NorthernLight   http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
OpenDirectory   http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
PRWire          http://www.perl.com/CPAN-local/authors/by-module/WWW
Pubmed          http://www.perl.com/CPAN-local/authors/by-module/WWW
Snap            http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
Yahoo           http://www.perl.com/CPAN-local/modules/by-module/WWW
ZDNet           http://www.perl.com/CPAN-local/authors/by-module/WWW/JSMYSER/
WashPost        in the WWW::Search::News distribution
WashTech        in the WWW::Search::News distribution

There are even more backends available for manual download and
installation at http://www.idexer.com/backends/


REQUIREMENTS
------------

WWW::Search requires Perl5, the libwww-perl module suite, the URI
module, the HTML::Parser module, and a few other modules (see
Makefile.PL for a complete list).  For information on Perl5, see
<http://www.perl.com>.  For modules, see
<http://www.perl.com/CPAN-local/modules>.


AVAILABILITY
------------

The latest version of WWW::Search should always be available on CPAN.
Here is a good URL for finding it:
http://www.perl.com/CPAN-local/modules/by-module/WWW


INSTALLATION
------------

It is highly recommended that you use CPAN.pm to install WWW::Search.
It will automatically install all the prerequisite modules and all the
backends and put everything in the right places.  On a Unix or linux
system, while connected to the internet, just type

   perl -MCPAN -e 'install WWW::Search'

Otherwise, you can install WWW::Search as you would any perl module
library, by running these commands in the WWW-Search-x.xx directory
after unpacking the archive (and after installing all the prerequisite
modules):

    perl Makefile.PL
    make test
    make install

On Win32, maintenance and testing is done with Microsoft's nmake.exe;
if that's true for you, use 'nmake' instead of 'make' in the above
sequence of commands.

When you do `perl makefile.pl` on Win32, you might get warnings that a
whole bunch of 'zero*.out' files are missing.  This seems to be a bug
in some versions of WinZip which refuse to extract empty files from
the archive.  Since those files are supposed to be empty anyway, you
can ignore these warnings.

If you want to install a private copy of WWW::Search in your home
directory, then you should do the installation with something like
these commands:

    perl Makefile.PL INSTALLDIRS=perl PREFIX=/my/perl/lib 
    make test
    make pure_perl_install UNINST=1

Don't forget to add /my/perl/lib to your PERL5LIB environment variable
(or use lib '/my/perl/lib'; or unshift @INC, '/my/perl/lib')!


TESTING
-------

[This section of the documentation is primarily for backend authors
and maintainers.]

The "make test_parsing" command compares expected output
(precalculated and shipped with the archive) with actual output (from
the internet).  Sorry, the "make test_parsing" command does not run on
Win32, only in UNIX-type shells.  You can give arguments to the
test_parsing program by using the TEST_ARGS macro.  For example, the
following command only runs the external queries for WebCrawler:

make test_parsing TEST_ARGS='-e WebCrawler -x'

To see all the available options, do this:

make test_parsing TEST_ARGS='-help'

The "test_parsing" utility detects two kinds of errors:

- internal parsing:
	First it checks to make sure that your system computes
	the same results as my system based on some saved
	Web queries.  This test should always pass for working
        backends; if it doesn't, send me mail.

- external queries:
	Second, it makes real queries against the search engines
	and compares them with some saved results.

External queries can fail for several reasons:

- new pages have been added which match the test queries, or matching
  pages have been deleted, causing the page count to go too far out of
  whack from the expected number (not necessarily a bad thing)

- changes in the web search engine output which break WWW::Search's
  parsers, usually resulting in no URLs being returned (a bad thing)

If the external tests fail, please either investigate the error or
send a description of the problem, a list of your operating system and
all relevant perl version number, and the relevant output of "make
test_parsing" to the maintainer of the backend for the search engine
that fails.


WHAT IS AutoSearch?
-------------------

WWW::Search's primary client is AutoSearch.  AutoSearch performs a
web-based search and puts the results set in a web page.  It
periodically updates this web page, indicating how the search changes
over time.  Sample output from AutoSearch can be found at
<http://www.isi.edu/lsam/tools/autosearch/>.  Output format is
configurable.

See `perldoc AutoSearch` for details, or the DEMONSTRATION section
below for quick-start instructions.


DISCUSSION, BUG REPORTS, AND IMPROVEMENTS
-----------------------------------------

When submitting a bug report or request for help, please remember to
include:
  - the operating system name and version
  - the version of perl
  - the version of WWW::Search
  - the version of the backend
  - the code you ran to produce the error (PLEASE cut-and-paste, do not just summarize!)
  - actual output showing the error (PLEASE cut-and-paste, do not just summarize!)

There is a mailing list for WWW::Search discussion.  To subscribe,
send "subscribe info-www-search" as the body of a message to
<info-www-search-request@isi.edu>.  If you use WWW::Search at all, you
should subscribe to the mailing list.  Bug fixes are usually posted
there as soon as they're fixed.

Feedback about WWW::Search is encouraged.  If you're using it for a
neat application, please let us know.  If you'd like to (or have
already) implement and publish a new backend for WWW::Search, let us
know so we don't duplicate work.  <mthurn@cpan.org>

Backend-related bug reports ("search engine ABC doesn't work") should
be sent to the author of the backend (backend authors are identified
in the corresponding man page and in the output of `make
test_parsing`).  

All other feedback, bug reports, fixes, and new backends (if you want
them to be included with the base distribution) should be sent to
Martin Thurn <mthurn@cpan.org>.  When sending e-mail, please please
put [WWW::Search] in the subject line (or risk me losing the message
among the spam).


DEMONSTRATION
-------------

After installing the distribution, connect to the internet and type:

	WebSearch '"Your Name Here"'

or, if you are on Win32:

        WebSearch "\"Your Name Here\""

to see who's talking about you on the web.  Then (in a browsable web
page directory), try:

        cd /path/to/your/web/pages
	AutoSearch -n me_on_the_web -s '"Your Name Here"' me
        netscape /path/to/your/web/pages/me/index.html

or, if you are on Win32:

        cd /path/to/your/web/pages
	AutoSearch -n me_on_the_web -s "\"Your Name Here\"" me
        netscape /path/to/your/web/pages/me/index.html

If you are on UNIX you can add

	0 3 * * 1 AutoSearch /path/to/your/web/pages/me

to your crontab to update this search every week at 3:00 Monday
morning.  If you install WWW::Search::Ebay, and add the --mail option
to AutoSearch, you'll have your own private replacement for ebay's
personal search service... WITHOUT the three-query limit!


DOCUMENTATION
-------------

See `perldoc WWW::Search` after installation for an overview of the
library.  POD-style documentation is also included in all modules and
programs, so you can do `perldoc WebSearch` and `perldoc AutoSearch`
and `perldoc WWW::Search::Crawler` after installation.


FUTURE PLANS
------------

Some things we need, and ideas for new features:

 - more robust test mechanism (i.e. more than just counting the number
of URLs returned) (e.g. look at the various values and make sure
they're being parsed correctly) (e.g. change_date() is really a date,
URL is not double-encoded, results are not duplicated, etc.)  Contact
<mthurn@cpan.org>

 - updates to each backend to implement the submit() method.  Contact
each backend's maintainer.

 - updates to each backend that will force WWW::Search to perform the
same search as the engine's web GUI (I'm looking for contributions of
the precise arguments that will produce such a search for each engine;
i.e. the hash that should be passed as the second argument to
native_query).  Contact <mthurn@cpan.org>

 - test cases for WebSearch.  Contact <mthurn@cpan.org>

 - test cases for AutoSearch.  Contact <mthurn@cpan.org>

 - use LWP::ParallelUA to speed up multiple backend search requests
(I'm trying to decide what the API interface will look like; please
send suggestions).  Contact <mthurn@cpan.org>

 - add a "language" parameter to the WWW::Search object?  We would
need a critical mass of backends/engines that can search multiple
languages before this would be useful.

 - more widespread use of result tags such as description, date, size,
etc. across all backends.  Contact backend maintainers.

 - a freeze/restore interface to suspend and resume in-progress queries.

 - more backends!

Contributions are always welcome.  Send me e-mail if you plan a new
backend, or to discuss architectural changes (to avoid duplicating
work).  Contact <mthurn@cpan.org>


SUPPORT AND CREDITS
-------------------

The WWW::Search architecture was originally written by John Heidemann,
with feedback from other contributors listed below.  NOTE: This list
is no longer updated; consult the on-line documentation and/or the
output of `make test_parsing` to find out who is currently maintaining
each component.

PLATFORM SUPPORT:
	Unix			John Heidemann <johnh@isi.edu>
	Windows			Jim Smyser <jsmyser@bigfoot.com>
                		(see <http://members.xoom.com/WWW_Search>)

COOKIE & HTTP_REFERER TESTING:  Jerry Hermel <jerryxh@earthlink.net>

APPLICATIONS:
	WebSearch		John Heidemann
	AutoSearch 		William Scheding <wls@isi.edu>

BACKENDS:
	AltaVista		John Heidemann
	Dejanews		Cesare Feroldi de Rosa <C.Feroldi@it.net>
				and Martin Thurn <mthurn@cpan.org>
	Crawler			Andreas Borchert
	Excite			GLen Pringle <pringle@cs.monash.edu.au>
				and Martin Thurn
	ExciteForWebServers	Paul Lindner <lindner@reliefweb.int>
	Fireball		Andreas Borchert
	FolioViews		Paul Lindner
	Gopher			Paul Lindner
	HotBot			William Scheding and Martin Thurn
	HotFiles		Jim Smyser
	Infoseek		Cesare Feroldi de Rosa and Martin Thurn
	Livelink		Paul Lindner
	Lycos			William Scheding and John Heidemann,
				Martin Thurn
	Magellan		Martin Thurn
	MSIndexServer		Paul Lindner
	NorthernLight		Jim Smyser
	Null			Paul Lindner
	OpenDirectory		Jim Smyser
	PLWeb			Paul Lindner
	Profusion		Jim Smyser
	Search97		Paul Lindner
	SFgate			Paul Lindner
	Simple			Paul Lindner
	Snap			Jim Smyser
	Verity			Paul Lindner
	WebCrawler		Martin Thurn
	Yahoo			William Scheding and Martin Thurn
	ZDNet			Jim Smyser

AutoSearch is based on an earlier implementation by Kedar Jog
<jog@isi.edu> with advice from Joe Touch <touch@isi.edu>.

Bugs and extensions (to the software and documentation) have been
identified by William Scheding <wls@isi.edu>, T. V. Raman
<raman@adobe.com> (proxy support), C. Feroldi <C.Feroldi@it.net>,
Larry Virden <lvirden@cas.org>, Paul Lindner <paul.lindner@itu.int>,
Guy Decoux <decoux@moulon.inra.fr>, R Chandrasekar (Mickey)
<mickeyc@linc.cis.upenn.edu>, Martin Thurn <mthurn@cpan.org>,
Chris Nandor <pudge@pobox.com>, Martin Valldeby
<martin.valldeby@pakom.se>, Jim Smyser <jsmyser@bigfoot.com>, Darren
Stalder <darren@u.washington.edu>, Neil Bowers
<neilb@cre.canon.co.uk>, Ave Wrigley <wrigley@cre.canon.co.uk>,
Andreas Borchert <borchert@mathematik.uni-ulm.de>, Jim Smyser
<jsmyser@bigfoot.com>.

Bugs have reported by Joseph McDonald <joe@smartlink.net>, Juan Jose
Amor <jjamor@infor.es>, Bowen Dwelle <bowen@hotwired.com>, Vassilis
Papadimos <vpapad@dblab.ece.ntua.gr>, Vidyut Luther <vluther@hpctc.org>, 
Chris P. Acantilado <cacantil@spawar.navy.mil>.


COPYRIGHT
---------

Copyright (c) 1996 University of Southern California.
All rights reserved.                                            
                                                               
Redistribution and use in source and binary forms are permitted
provided that the above copyright notice and this paragraph are
duplicated in all such forms and that any documentation, advertising
materials, and other materials related to such distribution and use
acknowledge that the software was developed by the University of
Southern California, Information Sciences Institute.  The name of the
University may not be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.


Portions of this README are derived from the README for libwww-perl.

