

bp -- bibliography package for Perl
Dana Jacobsen (dana@acm.org)
12 March 1996

bp.pl is the main file that gets included.  It will include all the other
files as needed.  Since this package was started before Perl version 5, I
wrote dynamic loading routines myself using the require command.  This is
not as spiffy as Perl 5's dynamic loading, but works just fine, and allows
Perl 4 to be used.  The variable $glb_bpprefix in the main package controls
the prefix used for the additional files.  This is "bp-" right now.  If you
want to change this, go ahead.  The two things I had in mind when I made
this were "bp/" and "/usr/local/lib/bp/".  The first would have all the bp
files in a subdirectory of the directory bp.pl was in.  The second would
only clutter /usr/lib/perl with one file, but isn't recommended, since an
even cleaner solution is to set your BPHOME directory to wherever you put
all the bp files.  Personally, I just leave them in my source directory
and set BPHOME to point to them.

Modules are searched for by using require, so they can be found anywhere in
your @INC (properly written bp programs will add BPHOME to your @INC as the
first thing they do).  The rules for module names (not enforced currently)
are that they should not be named 'p-...', 's-...', or 'cs-...' because
those are reserved for package, style, and character sets, respectively.
Names like "xxx2yyy" are reserved for converters.  I'm trying to keep all
the files down to 8 characters (not counting the bpprefix) just in case
some people use crippled operating systems like DOS.


--- timing commands used here are
 time perl5 tconv.pl -convert -format=refer,bibtex ad425.ref > foo
 time perl5 tconv.pl -convert -format=bibtex,refer foo > foo.ref
 time perl5 tconv.pl ad425.ref > foo2.ref
---

 version notes:

 0.2.2  perl5/ad425.ref->ad425.bib  in 20.0 seconds
	perl5/timetest.pl  2.91s total, 0.92, 0.94, 3.11

	Added fontcheck to validate font changes.
	All new Medline module.
	Changed options parsing.
	Unimplemented and unsupported methods have new stdbib routines.
	Modified name_to_canon routine.
	Added EndNote support to Refer module.
	Fixed nasty typo in bp.pl's tocanon.
	(changed perl version to 5.002 and slight machine changes)

 0.2.0	perl5/ad425.ref->ad425.bib  in 22.1 seconds
	perl5/timetest.pl  2.95s total, 1.05, 1.00, 3.32

	Changed BibTeX string parsing.
	Minor cleanup to distribution.

 0.1.4	perl5/ad425.ref->ad425.bib  in 22.6 seconds
	perl5/timetest.pl  2.89s total, 1.09, 1.02, 3.66

	General cleanup.
	Put refer tocanon code through optimizer and generally cleaned up.
	All formats and character sets use the 8bit code.
	New apple and ISO 8859-1 character sets.
	Some changes to HTML output.

 0.1.3	perl5/ad425.ref->ad425.bib  in 30.7 seconds
	perl5/timetest.pl  2.71s total, 1.07, 1.09, 3.63

	Major changes to fonts.  We use 8bit internally with one escape char.
	Changed to use $cs_sep instead of $; everywhere.
	Names are seperated with $cs_sep and parts by $cs_sep2.

 0.1.2  perl5/ad425.ref->ad425.bib  in 25.0 seconds
	perl5/timetest.pl  2.36s total, 1.10, 1.10, 3.62

	Changed glb_filename to glb_Ifilename in bp-p-errors.pl.
	We open >>- instead of >- by default.
	bp-bibtex.pl sets glb_vloc to the line number.
	Changed format option parsing routine in bp-p-options.
	added opt_reverseauthor to bp-refer, and added options parsing.

 0.1.1  perl5/ad425.ref->ad425.bib  in 25.3 seconds
	perl5/timetest.pl  2.31s total, 1.03, 1.01, 3.52

	Added p-output.pl.  HTML output calls this.  Minor changes to other
	programs.  Added example (lousy) Melvyl reader.

 0.1.0  Added html charset and output format.
	XXX check out RFC 1345 on charsets.  It would be nice to let us read
	these descriptions in.  That may be another project though.

 0.0.13 perl5/ad425.ref->ad425.bib  in 24.9 seconds
	perl5/timetest.pl  2.07s total, 1.12, 1.14, 3.53

        Added new formats (ieee, medline, tib).  Fleshed out TeX charset
        conversion including changes to fonts.  Minor changes in library.
        New programs count, sort, and rdup.

 0.0.12 perl5/ad425.ref->ad425.bib  in 22.9 seconds
	perl5/timetest.pl  1.94s total, 1.20, 1.22, 3.68

	MAJOR changes in the way files are handled internally.  Input and
	Output files are now stored in seperate locations.  There is no
	openwrite function any more -- open is called with a prefixed mode,
	just like the system open, i.e. "open('foo')" vs. "open('>foo')".
	The system filehandles are not supported -- use '-' for STDIN or
	STDOUT.

	Changed the way the auto format handles character sets (and formats to
	a lesser degree).  Debugging dumps with glb_debug==2.  More dump info
	printed.  bp_find_files supports 'rehash' to re-read directories.

	Finally removed the global rfmt and rcset info.  Use the specific
	data for the global input and output files, which change each time
	an I/O operation is done, rather than when format is called.

	Tried stripping all the comments and debugging calls from the main
	package, and stuffing everything into one file.  (About half of the
	code size is comments).  Package loading time went down maybe .1 - .2
	seconds (base of .75).  No noticible change in running time, which
	means the debugging statements aren't slowing us down enough to worry
	about dropping them.

 0.0.11 perl5/ad425.ref->ad425.bib  in 22.7 seconds
	perl5/timetest.pl  1.98 total, 1.02, 1.01, 3.39

	Added format registration.  Key generation and registration added to
	bp_util, and the _output_ format is responsible for calling it if it
	wants a key.  This means we don't have to be generating keys if we
	don't need them.

	changes to TeX protection.  Unrolled cs-tex tocanon map_simp. We
	now only call parse_format in non-time-critical routines like format
	and open.

 0.0.10 perl5/ad425.ref->ad425.bib  in 23.7 seconds
	perl5/timetest.pl  1.89s total, 0.93, 0.95, 3.33

	Revamped option and stdarg parsing.  Added glb_in/out cset so we
	make fewer calls to parse_format, which is why our time went back
	down.  Also changed debugging levels a bit, to reserve 0 and 1 for
	true and false.

	Split into modules, as the bp.pl file got over 40k.  I should
	probably split it even further.

 0.0.9  perl5/ad425.ref->ad425.bib  in 25.3 seconds
	perl5/timetest.pl  1.89s total, 0.96, 0.96, 3.38

	major changes to font parsing.  Time suffers because we now protect
	TeX characters, so we have to do that for _every_ string we get,
	instead of doing a quick '<|>' check to skip it.  That one change
	makes the 2608 record time go from 127 seconds to 140 seconds.

	Lots of changes to BibTeX and Refer parsers also, but little to the
	main package.

 0.0.8  perl5/ad425.ref->ad425.bib  in 23.5 seconds
	perl5/timetest.pl  1.59s total, 0.94, 0.91, 3.35

 0.0.7  perl5/ad425.ref->ad425.bib  in 28.6 seconds
	perl5/timetest.pl  1.36s total, 0.92, 0.91, 3.11

	passes tests.

 0.0.5  perl5/ad425.ref->ad425.bib  in 55.6 seconds

 time perl5 tconv.pl -canon -informat=refer -outformat=bibtex ad425.ref >f.bib
 time perl5 tconv.pl -canon -informat=bibtex -outformat=refer f.bib > f.ref
 time perl5 tconv.pl ad425.ref > fo.ref

 diff fo.ref f.ref


Optimization issues:
>	One of the hardest issues I've had to deal with is the slowness of
	function calls in perl.  A null procedure call is about 150 - 700
	times slower than C, depending on the platform, and whether you've
	compiled the perl code (yes, even compiled perl code still does
	function calls very slow as of perl compiler a3).  This means there
	is a tradeoff between modular code and speed.  In 0.3.0, setting
	options once per record adds about 10% to the run time!  This is one
	reason I put something like "&debugs(message) if $glb_debug" -- the
	test for $glb_debug saves a _lot_ of time.


 Issues to deal with:
>      Anything with an XXXXX

>      Look into the possibility of using references for our associative
       array calls.  This might save us some time, but opens up some interesting
       problems as well.  In 0.0.10, it made the time go from 24.6 to 23.9
       seconds to change the calls in to and from canon.  It should be safe
       there, as we're passing in a reference to our local copy of the users
       data, but I'd want to do some more tests to make sure.

>      Do we force them to call &format, or set a default?  Right now, we
       set a default.  If we remove the default, then we must check in
       open that the format is loaded.  And hope they don't call read without
       opening a file...

>      Right now we do:
	      tocanon:        ics/ifm -> ccs/ifm -> ccs/cfm
	      fromcanon:      ccs/cfm -> ccs/ofm -> ocs/ofm

       Consider:
	      tocanon:        ics/ifm -> ics/cfm -> ccs/cfm
	      fromcanon:      ccs/cfm -> ocs/cfm -> ocs/ofm

       where cs=character set, fm=format, i=input, o=output, c=canonical.

         The advantage of our current system is that the format routines are
       independent of the character set used.  This is a big plus when we
       want to program tib, which can then reuse most of the refer routines.
         The disadvantage is that some formats can be better written if they
       can see their special characters.
          My feeling as of 0.2.2 is that the current system is what we ought
       to do.

>      Complexity option, standardized.  We should have some sort of scale:
	         0    Stupid as can be
	        10    Simple -- assume best case
	       100    Optimistic -- formatted correctly, but complicated
	      1000    Pessimistic -- Try to handle anything thrown at you
       Maybe a scale that can be returned which has an indication of how
       complex of stuff can be handled.  This might be useful when deciding
       whether to make to in->canon or the canon->out converter use
       heuristics.

>      Character sets are second-class.  For instance, we allow special
       converters for formats, but not for charsets.  We don't allow options
       to be passed to them.  Setting the correct charset is flakey.
       (version 0.0.12 looks like it improved internal handling, but they're
       still 2nd class)

>      Check for variable suicide.  We use fmt and cset a lot.

>      Add a new function, 'formatfunction' that calls a special function
       for that format.  Some formats may wish to export functions that
       might be useful, such as validation checkers.

>      What if we have a special converter, but no information on the formats
       involved?  Could we somehow set that up so we could just call convert
       and be happy?

>

