languages/regex: Compile regular expressions into optimized bytecode.

RUNNING
=======

To run a bunch of tests:

  make test

To compile an arbitrary regexp down to code:

  perl regex.pl '(a*|b)a'


STATUS
======

Everything should be more or less working, though many operators are
untested. Theoretically, this release should support:

 RS      - sequences
 R|S     - alternation
 R*      - greedy Kleene closure
 R*?     - nongreedy/parsimonious Kleene closure
 R?      - greedy optional
 R??     - nongreedy/parsimonious optional
 R+      - greedy one or more (more or one?)
 R+?     - nongreedy one or more
 (R)     - capturing groups
 (?:R)   - noncapturing grouping
 a       - codepoint literals
 {n,m}   - greedily match n..m times
 {n,m}?  - nongreedily match n..m times

Missing perl5 features:
 [...]   - character classes
 \w      - character classes
 \U      - arbitrary codepoints
 \p      - properties
 \b      - word boundary
 /ism    - flags
 \1      - back references
 \G      - start position
 (?=R)   - look-ahead assertion
 (?!R)   - negative look-ahead assertion
 (?<=)   - look-behind assertion
 (?<!)   - negative look-behind assertion
 (?{ })  - embedded code
 (??{ }) - match-time evaluated subexpression
 (?>R)   - independent subexpression
 (?(cond)R|S)
         - conditional expression
 (R?)*   - empty match suppression

Other features I'd like to have:
 (?1)    - recursive expressions
 @A =~   - array matching
 ?       - reentrant, suspendable, coroutinish regexes
 ?       - two-dimensional regexes

Regular expressions are compiled down to regular opcodes, not to the
rx_* set of opcodes. P0 and P1 are PerlArrays containing the starting
and ending indexes, respectively, of () groups. The user stack is used
as the backtracking stack. See rx.ops for a good description of how
operators are converted to code sequences (except I don't use the rx_*
ops that it describes. You can work it out.) Marks are the value -1;
indices are nonnegative integers. (Except in debugging mode, when
marks are instead strings describing what they're marking.)

Optimizations implemented (notation: parentheses here are non-capturing):

 aR|aS    -> a(R|S)
 R|       -> R?
 |R       -> R??

Future plans:

Primary plan is to stop messing up the user stack! Right now, the
compiler is generating invalid subroutines because they don't put
things back the way they found them. I'm still undecided on whether to
clean up the stack when done (not so good for reentrant regexen), make
a new PMC, or reuse array.pmc or perlarray.pmc. Or a hybrid -- use the
user stack, but stuff it into a PMC before returning.

Relatively soon, I would like to add array-based regular
expressions. A simple cut of this should be nearly trivial.

I'd really like to put in recursive regular expressions. See
<http://www.puffinry.freeserve.co.uk/regex-extension.html>.

Near-term optimizations planned:

 Length-check suppression: /xyz/ should check the length of the string
 once, not three times.

 Simple subexpression alternation: the code for alternations can be
 simplified if the subexpressions do not contain backtrack points.

 Disjunctive alternation: if you see R|S, and know that only one of R
 or S will ever hold at a given point in any input, then no
 backtracking information needs to be kept. For example, consider
 cat|fish (or somewhat more generally, cR|fS). The input cannot both
 start with c and f, so just matching 'c' first. If it matches, keep
 it and never go back to trying 'f'. Otherwise, forget about it
 completely and try 'f'.

 As a follow-on to the above, implement jump tables.
    c    -> $start_R
    f    -> $start_S
    else -> backtrack

 Multi-character literals: currently, "abc" expands to "match a then
 match b then match c". I don't plan to do a substring match anytime
 soon, but I would like to eliminate two of the three end-of-input tests.

Longer-term optimization vague ideas:

 Find maximal subsequences of regex ops that can be converted to
 DFAs. Translate them into in-line DFAs. The jump tables above are a
 primitive form of this. The hard part is figuring out whether a DFA
 would produce exactly the same results as an NFA for a given
 expression. (It's a cost/benefit game. Some expressions
 trivially behave the same, some trivially behave differently, and some
 are difficult to determine.)

BUGS
====

Trashing the user stack is a bug.

The miserable lack of documentation is a bug.

The ad-hoc list op data structure is a bug.

The lack of complex test cases is a bug.

DEVELOPER NOTES
===============

If you make changes to Grammar.y, you'll need Parse::Yapp to
regenerate Grammar.pm. Run 'make' with no options to pass the correct
command-line parameters.

Original author: Steve Fink <steve@fink.com>
