languages/regex: Compile regular expressions into optimized bytecode.

RUNNING
=======

To run a bunch of tests:

  make test

To compile an arbitrary regexp down to code:

  perl regex.pl '(a*|b)a'

To turn off optimization so you tell what's going on a little better:

  perl regex.pl --no-optimize '(a*|b)a'

Actually, there are two optimization phases: one for tree-based
optimizations (reorganizing the parse tree, adding and deleting nodes)
and list-based optimizations (really just peephole optimizations). You
can enable just one or the other using 't' for Tree and 'l' for List:

  perl regex.pl --optimize=t '(a*|b)a'
  perl regex.pl --optimize=l '(a*|b)a'

For development, it may also be useful to dump out the tree (remember,
by default it will be the optimized tree):

  perl regex.pl '(a*|b)a' dump

Or you can have it render the tree back to a regular expression:

  perl regex.pl '(a*|b)a' render

All of this really ought to be in a usage message.

STATUS
======

Everything should be more or less working, though many operators are
untested. Theoretically, this release should support:

 RS      - sequences
 R|S     - alternation
 R*      - greedy Kleene closure
 R*?     - nongreedy/parsimonious Kleene closure
 R?      - greedy optional
 R??     - nongreedy/parsimonious optional
 R+      - greedy one or more ("more or one")
 R+?     - nongreedy one or more
 (R)     - capturing groups
 (?:R)   - noncapturing grouping
 a       - codepoint literals
 [a-z]   - character classes
 {n,m}   - greedily match n..m times
 {n,m}?  - nongreedily match n..m times

Missing perl5 features:
 \w      - named character classes
 \U      - arbitrary codepoints
 \p      - properties
 \b      - word boundary
 /ism    - flags
 \1      - back references
 \G      - start position
 (?=R)   - look-ahead assertion
 (?!R)   - negative look-ahead assertion
 (?<=)   - look-behind assertion
 (?<!)   - negative look-behind assertion
 (?{ })  - embedded code
 (??{ }) - match-time evaluated subexpression
 (?>R)   - independent subexpression
 (?(cond)R|S)
         - conditional expression
 (R?)*   - empty match suppression

Missing perl6 features:
:
::
:::
<commit>
<cut>
hypotheticals - code refs with two entry points (enter and backtrack)
@var     - run-time alternatives
<@var>   - run-time alternative regexps (requires run-time regex compiler)
%var     - hash key alternatives
<%var>   - as above, but values are always regexps
<!...>   - negation (what is this?) (is this just negative lookahead?)
<rule>   - rule invocation
...      - and many, many, many more

Other features I'd like to have:
 (?1)    - recursive expressions
 @A =~   - array matching
 ?       - reentrant, suspendable, coroutinish regexes
 ?       - two-dimensional regexes

Regular expressions are compiled down to regular opcodes (not the
rx_* set of opcodes). P0 and P1 are PerlArrays containing the starting
and ending indexes, respectively, of () groups. The user stack is used
as the backtracking stack. See rx.ops for a good description of how
operators are converted to code sequences (except I don't use the rx_*
ops that it describes). Marks are the value -1; indices are
nonnegative integers. (Except in debugging mode, when marks are
instead strings describing what they're marking.) For details on
exactly how things are translated with this compiler, read Rewrite.pm
and Rewrite/Stackless.pm.

Register Usage:
 I0 - temporary, not preserved between tree op rewrites
      (it's just a very short-term temp register)
 I1 - current position within the input string
 I2 - length of the input string
 P0 - array of start indices for parenthesis matches
 P1 - array of end indices for parenthesis matches
 I3.. - callee-saved temporary registers

Optimizations implemented (notation: parentheses here are non-capturing):

 aR|aS    -> a(R|S)
 Rb|Sb    -> (R|S)b
 R|       -> R?
 |R       -> R??

Future plans:

test.pl generates stack cleanup code, but regex.pl does not. I'm still
not sure exactly where that code ought to go, because especially when
I really start on Perl6 regular expressions, the ability to resume a
match will be critical.

Relatively soon, I would like to add array-based regular
expressions. A simple cut of this should be nearly trivial.

I'd really like to put in recursive regular expressions. See
<http://www.puffinry.freeserve.co.uk/regex-extension.html>.
UPDATE: Looks like Larry liked the idea too. Perl6 has 'em.

Near-term optimizations planned:

 Simple subexpression alternation: the code for alternations can be
 simplified if the subexpressions do not contain backtrack points.

 Disjunctive alternation: if you see R|S, and know that only one of R
 or S will ever hold at a given point in any input, then no
 backtracking information needs to be kept. For example, consider
 cat|fish (or somewhat more generally, cR|fS). The input cannot both
 start with c and f, so just matching 'c' first. If it matches, keep
 it and never go back to trying 'f'. Otherwise, forget about it
 completely and try 'f'.

 As a follow-on to the above, implement jump tables.
    c    -> $start_R
    f    -> $start_S
    else -> backtrack

 Multi-character literals: currently, "abc" expands to "match a then
 match b then match c". I don't plan to do a substring match anytime
 soon, but I would like to eliminate two of the three end-of-input tests.

Longer-term optimization vague ideas:

 Find maximal subsequences of regex ops that can be converted to
 DFAs. Translate them into in-line DFAs. The jump tables above are a
 primitive form of this. The hard part is figuring out whether a DFA
 would produce exactly the same results as an NFA for a given
 expression. (It's a cost/benefit game. Some expressions
 trivially behave the same, some trivially behave differently, and some
 are difficult to determine.)

BUGS
====

- The miserable state of documentation
- The ad-hoc list op data structure
- The lack of complex test cases

DEVELOPER NOTES
===============

If you make changes to Grammar.y, you'll need Parse::Yapp to
regenerate Grammar.pm. Run 'make' with no options to pass the correct
command-line parameters.

Original author: Steve Fink <steve@fink.com>
