grepmail - search mailboxes for a particular email

Grepmail searches a normal, gzip'd, bzip'd, or tzip'd mailbox for a given
regular expression, and returns those emails that match it. Piped input is
allowed, and date and size restrictions are supported, as are searches using
logical operators.

New in version 5.00:
- grepmail is now orders of magnitude faster for mailboxes which have very
  large (>30MB) emails in them
- "grep" is now used to find the start of emails, if it is installed. For
  mailboxes with large emails in them, this can speed things up by about 5x.
- Reduced memory consumption by about 3 times.
- -- now marks the end of options and the beginning of folders
- -f now reads patterns from a file like GNU grep does.
- Added smail compatibility.
- Date specifications without times (e.g. "today") are interpreted as midnight
  of the given day instead of the current time of that day.
- Fixed -i when used with -Y -- it was always case sensitive before.
- Updated t/functionality.t to avoid running gzip-related test cases when gzip
  is not installed on the system.
- Improved some error messages so that they prepend "grepmail: " as they
  should
- The "**" prefix on warnings has been changed to "grepmail:"
- Cleaned up some warnings about ambiguous hash values
- Added a warning about the version of perl required for new pattern features
- -t flag renamed to -j
- Fixed broken Gnus support
- Improved test case for Gnus


SOME NOTES

perl version:

If you plan to use advanced pattern features such as "(?>...)", you will need
to make sure that your version of perl supports them.


-s flag:

*** WARNING ***

As of version 4.91, the semantics of -s has changed. For example:
  grepmail -s 1234 file
now matches emails whose size is exactly 1234 bytes. Use 
  grepmail -s '<1234' file
if you prefer the older semantics.

Caching:

Caching appears to be working, but is currently disabled by default because
the speedup is only 10-20%, and because it requires a file to be created in
the user's directory. I'm hoping that the caching infrastructure can lead to
further improvements. For example, we should cache the boundaries of
attachments for the -M flag. Also, the current self-rolled implementation may
be better served using Cache::FileCache instead.

Complex queries:

The -E flag allows you to perform complex searches involving logical
operators. For example,

  $email_header =~ /^From: .*\@coppit.org/ && $email =~ /grepmail/i

will find all emails which originate from coppit.org (you must escape the "@"
sign with a backslash), and which contain the keyword "grepmail" anywhere in
the message, in any capitalization.

NOTE: -E support is experimental right now. I'm looking for feedback on the
following:

- Do you like the feature?
- Do you like the Perl-based syntax? Is there an alternative which is easier?
- How should date and size constraints be integrated? Should they be
  "variables", a la: "$email =~ /grepmail/ && $date <= 'sep 20 1998' || $size
  > 50000"?
- Should -i, -h, and -b be supported in conjunction with -E? (Where "-h
  pattern" would mean augmenting the -E pattern with "$email_header =~
  /pattern/ && ")
- -S ignores signatures. If/when this feature is implemented for -E, should it
  be "global" for all $email_body matches, or should it be possible to specify
  this for each $email_body match? For example, one can append an "i" modifier
  to an individual pattern match to make it case-insensitive. Should there be
  a standard way of dealing with such "global" pattern matching options on an
  individual pattern match basis? 

Message IDs:

NOTE: For emails without message ids, grepmail will use Digest::MD5 to
compute a hash based on the email header. If you don't have
Digest::MD5, grepmail will just use the header itself as the messsage
id. The Digest::MD5 checksum takes a little while to compute, but
saves a lot of space. Currently there is no easy way to choose space
over time. Let me know if this is a problem.


MODULE DEPENDENCIES

- Mail::Mbox::MessageParser: required
- Date::Parse: required if you want to search based on date (-d)
- Date::Manip: required if you want to search using complex date
  specifications (-d)
- Digest::MD5: not required, but can help grepmail use less memory if
  you are checking for unique emails (-u) and your emails don't have a
  Message-Id header

The modules can be found here:

Mail::Mbox::MessageParser: http://search.cpan.org/search?dist=MailMboxMessageParser
Date::Parse (in TimeDate): http://search.cpan.org/search?dist=TimeDate
Date::Manip:               http://search.cpan.org/search?dist=DateManip
Digest::MD5:               http://search.cpan.org/search?dist=Digest-MD5

Installation can also be done automatically using the CPAN module:

  perl -MCPAN -e 'install Mail::Mbox::MessageParser'
  perl -MCPAN -e 'install Date::Parse'
  perl -MCPAN -e 'install Date::Manip'
  perl -MCPAN -e 'install Digest::MD5'


INSTALLATION

=> On Non-Windows systems:

  % perl Makefile.PL
  % make
  % make test
  % make install

The "perl Makefile.PL" command will prompt you for an installation location if
you run it interactively, and will use the default values if it is run
non-iteractively. You can force it to run non-interactively by specifying
either "PREFIX=/installation/path" (for installation into a custom location),
"INSTALLDIRS=site" (for installation into site-specific Perl directories), or
"INSTALLDIRS=perl" (for installation into standard Perl directories).

If make test fails, please see the INSTALLATION PROBLEMS section below.

=> On Windows systems:

- Just copy "grepmail" to a place in your path. You may want to rename it
  "grepmail.pl" if you've associated .pl files with perl.exe.


CONFIGURATION

You may want to set your MAIL environment variable so that grepmail will know
the default location to search for mailboxes.

If you are terribly concerned about performance, you may want to modify the
value of the variable READ_CHUNK_SIZE located in the code. This variable
controls how much text is read from the mailbox at a time. If the value is set
to 0, the entire file is read into memory. (There is no user-visible option
for setting this value.) You may also want to hack the code to not use
Digest::MD5, thereby trading space for time.

If you frequently use the same set of flags, you may wish to alias "grepmail"
to "grepmail -flags" within your command interpreter (shell). See the
documentation for your shell for details on how to do this.


INSTALLATION PROBLEMS

If "make test" fails, run

  make test TEST_VERBOSE=1

and see which test(s) are failing. Please email, to the address below, the
test##.stderr and test##.stdout files for the test, which are located in
t/results. Also email the output of running the test with the -D flag. e.g.:

  blib/script/grepmail library -D -d "before July 9 1998" t/mailarc-1.txt \
    > test##.debug

If you see errors about your timezone, and you are in an uncommon timezone, it
may be the case that Date::Manip does not support your timezone yet. Try this:

  perl -MDate::Manip -e 'print "TIMEZONE: ".&Date::Manip::Date_TimeZone."\n"'

If you get an error, contact the author of Date::Manip.

For other bugs, see the section REPORTING BUGS below.


DOCUMENTATION

Just "perldoc grepmail". After installation on Unix systems, you can also do
"man grepmail".


HOMEPAGE

Visit http://grepmail.sourceforge.net/ for the latest version, mailing lists,
discussion forums, CVS access, cool utilities, and more.


TODO/WISHLIST

- Michael D. Schleif <mds@helices.org> suggested grepmail have support for
  compressed mail directories. Adding support for this is not easy, and I'm
  not sure many people would benefit from the feature. Let me know if you too
  want this support, and I may implement it.


REPORTING BUGS

You can report bugs at http://sourceforge.net/bugs/?group_id=2207.  Please
attach the output of running grepmail with the -D switch. If the bug is
related to processing of a particular mailbox, try to trim the mailbox to the
smallest set of emails that still exhibit the problem.  Then use the
"anonymize_mailbox" program that comes with grepmail to remove any sensitive
information, and attach the mailbox to the bug report.


PRIMARY AUTHOR

Written by David Coppit (david@coppit.org, http://coppit.org/), with the
generous help of many kind people. See the file CHANGES for detailed
information.


LICENSE

This code is distributed under the GNU General Public License (GPL). See
http://www.opensource.org/gpl-license.html and http://www.opensource.org/.
