languages/regex: Compile regular expressions into optimized bytecode.

NOTE: This Is Not The Matching Engine You're Looking For. See the
bottom of this readme, in the section "WHAT ABOUT P6GE?"

RUNNING
=======

To run a bunch of tests:

  make test

To compile an arbitrary regexp down to code:

  perl regex.pl '(a*|b)a'

(Note that on some platforms, including Windows, you may need to use
double quotes in place of the single quotes.)

To turn off optimization so you tell what's going on a little better:

  perl regex.pl --no-optimize '(a*|b)a'

Actually, there are two optimization phases: one for tree-based
optimizations (reorganizing the parse tree, adding and deleting nodes)
and list-based optimizations (really just peephole optimizations). You
can enable just one or the other using 't' for Tree and 'l' for List:

  perl regex.pl --optimize=t '(a*|b)a'
  perl regex.pl --optimize=l '(a*|b)a'

To do the same thing, only generating perl5 backend code instead of
PIR code:

  perl regex.pl --language=perl5 '(a*|b)a'

For development, it may also be useful to dump out the tree (remember,
by default it will be the optimized tree):

  perl regex.pl '(a*|b)a' dump

Or you can have it render the tree back to a (Perl5 syntax) regular
expression:

  perl regex.pl '(a*|b)a' render

All of this really ought to be in a usage message.

To run any of the generated code, you need the Match and MatchRange
PMCs. They should be generated automatically now by the root makefile,
but if for some reason they aren't, run 'make' in src/dynpmc/
to generate them.

The above commands only generate subroutines for matching regular
expressions. You will need to provide your own code to call the
matching routine with some input. Or, just to try it out, you can use
the --main argument to regex.pl to generate a sample driver program
that accepts an input string on the command line:

  perl regex.pl --main -o mytest.pir 'a*x'
  ../../parrot mytest.pir baax
 (output)
  Match found
  0: 1..3

Or for the Perl5 backend:

  perl regex.pl --language=perl5 --main -o mytest.pl 'a*x'
  perl mytest.pl baax

Note that the output in this case is a bunch of nasty Perl data
structures encoding the matching information.

New stuff: now you can use the compiler as an embedded Parrot
compiler. Run 'make regex-compiler.pbc' to generate
regex-compiler.pbc. Then run ../../parrot 01_basic.pir, ../../parrot
02_date.pir, etc. These will dynamically load in the regex-compiler as
the compiler for the "regex" language, and then use it to compile some
code and execute it.

Also, I have hacked multiple rule definitions into the silly frontend
parser. I don't intend to stick with this syntax for long, since I
want to get to a point soon where I can bootstrap this engine's own
parser, but for now you can define rules with '&rulename=...'. So to
compile a simple expression parser, put this in a file 'expr.rx':

  &expr = <?term> \* <?expr>
        | <?term> \/ <?expr>
        | <?term>

  &term = <?atom> \+ <?term>
        | <?atom> \- <?term>
        | <?atom>

  &atom = [0-9]+
        | \( <?expr> \)

  &default = <?expr>

and run (on Unix):

  perl regex.pl --main --language=perl5 --file=expr.rx > expr.pl
  perl expr.pl '3+(45*2)'

All of the <?foo> things are there so that the rules get nicely
captured -- by default, nested rule calls are *not* capturing.

You can also insert actions (coded in Perl5, or the language of
whichever backend you are using), but currently there is no nice way
of setting the value of a rule, so they are of limited usefulness. An
example:

  perl regex.pl --language=perl5 --main 'a* { print "Matched some as\n" } b'

In some cases, you can also access the current state of the match:

  perl regex.pl --language=perl5 --main 'a* { print "Matched $MATCH{0}\n" } b'

The first example should now also work with --language=pir (the
default) as well. The second example wouldn't, because it isn't valid
PIR.

Examples of using regex.pl to generate code that is compiled on the
fly are in 01_basic.pir and 02_date.pir.

If you use the -d flag, you'll get a very verbose description of the
matching progress as you run.

STATUS
======

Everything should be more or less working, though many operators are
untested. Theoretically, this release should support (using Perl5
syntax here):

 RS      - sequences
 R|S     - alternation
 R*      - greedy Kleene closure
 R*?     - nongreedy (parsimonious) Kleene closure
 R?      - greedy optional
 R??     - nongreedy (parsimonious) optional
 R+      - greedy one or more ("more or one")
 R+?     - nongreedy one or more
 (R)     - capturing groups
 (?:R)   - noncapturing grouping
 a       - codepoint literals
 [a-z]   - character classes
 {n,m}   - greedily match n..m times
 {n,m}?  - nongreedily match n..m times
 <rule>  - rule invocation (Perl6 feature)
 (?{ })  - embedded code (just {...} in Perl6)

Missing perl5 features:
 \w      - named character classes
 \U      - arbitrary codepoints
 \p      - properties
 \b      - word boundary
 /ism    - flags
 \1      - back references
 \G      - start position
 (?=R)   - look-ahead assertion
 (?!R)   - negative look-ahead assertion
 (?<=)   - look-behind assertion
 (?<!)   - negative look-behind assertion
 (??{ }) - match-time evaluated subexpression
 (?>R)   - independent subexpression
 (?(cond)R|S)
         - conditional expression
 (R?)*   - empty match suppression

Missing perl6 features:
:
::
:::
<commit>
<cut>
hypotheticals - code refs with two entry points (enter and backtrack)
@var     - run-time alternatives
<@var>   - run-time alternative regexps (requires run-time regex compiler)
%var     - hash key alternatives
<%var>   - as above, but values are always regexps
<!...>   - negation (what is this?) (is this just negative lookahead?)
...      - and many, many, many more

Other features I'd like to have:
 @A =~   - array matching
 ?       - reentrant, suspendable, coroutinish regexes
 ?       - two-dimensional regexes

Regular expressions are compiled down to regular opcodes (not the rx_*
set of opcodes). A *Array PMC of some sort is used as the backtracking
stack. See rx.ops for a good description of how operators are
converted to code sequences (except I don't use the rx_* ops that it
describes). Marks are the value -1; indices are nonnegative integers.
(Except in debugging mode, when marks are instead strings describing
what they're marking.) For details on exactly how things are
translated with this compiler, read Rewrite.pm.

"Local" variables:
 * One integer temporary, not preserved between tree op rewrites
      (it's just a very short-term temp register)
 * The current position within the input string
 * The length of the input string
 * A Match PMC, for holding start and stop positions for matched groups
 * Other stuff, but this list is hopelessly out of date at the moment

Optimizations implemented (notation: parentheses here are non-capturing):

 aR|aS    -> a(R|S)
 Rb|Sb    -> (R|S)b
 R|       -> R?
 |R       -> R??

Future plans:

Relatively soon, I would like to add array-based regular
expressions. A simple cut of this should be nearly trivial.

Near-term optimizations planned:

 Simple subexpression alternation: the code for alternations can be
 simplified if the subexpressions do not contain backtrack points.

 Disjunctive alternation: if you see R|S, and know that only one of R
 or S will ever hold at a given point in any input, then no
 backtracking information needs to be kept. For example, consider
 cat|fish (or somewhat more generally, cR|fS). The input cannot both
 start with c and f, so just matching 'c' first. If it matches, keep
 it and never go back to trying 'f'. Otherwise, forget about it
 completely and try 'f'.

 As a follow-on to the above, implement jump tables.
    c    -> $start_R
    f    -> $start_S
    else -> backtrack

 Length check suppression: if you try to match /abc/, the unoptimized
 engine will check to be sure there's at least one character left,
 then match 'a', then check to be sure there's still at least one
 character left, then match 'b', etc. The tree optimization pass will
 transform this into: check if there are at least 3 characters left,
 then match 'a', then match 'b', then match 'c'. Obviously, this
 particular case might be further optimized by doing some kind of
 fixed substring match (or a Boyer-Moore "skip a number of characters
 determined by the input and the fixed substring"), but this
 optimization works for more general cases (such as
 C<< a.[0-9](x|\w) >>).

Longer-term optimization vague ideas:

 Find maximal subtrees of regex ops that can be converted to DFAs.
 Translate them into in-line DFAs. The jump tables above are a
 primitive form of this. The hard part is figuring out whether a DFA
 would produce exactly the same results as an NFA for a given
 expression. (It's a cost/benefit game. Some expressions trivially
 behave the same, some trivially behave differently, and some are
 difficult to determine.)

BUGS
====

- The miserable state of documentation
- The ad-hoc list op data structure
- The lack of complex test cases

DEVELOPER NOTES
===============

If you make changes to Grammar.y, you'll need Parse::Yapp to
regenerate Grammar.pm. Run 'make' with no options to pass the correct
command-line parameters.

For a painfully detailed, blow-by-blow description of the latter part
of the development of this compiler, see
http://0xdeadbeef.net/wiki/wiki.pl?FinkBlog

Original author: Steve Fink <steve@fink.com>

WHAT ABOUT P6GE?
================

This rule compiler is not the officially blessed one. That would be
Patrick Michaud's PGE (Parrot Grammar Engine) , which you can find in
compilers/pge. This engine (in languages/regex, referred to in the
following text as l/rx) predates pge by a year or two, but never
managed to generate sufficient interest to get anyone else involved.

I am assuming that PGE is going to get the momentum of the community
behind it, so I would advise anyone interested in working on a rule
engine to look there first. (Look for discussion on the
perl6-internals and perl6-compiler mailing lists.) However, I still
intend to work on this engine for a while longer, and welcome any
interested participants. (Send any requests/comments/suggestions
either to perl6-internals or directly to me at steve@fink.com.) So
far, I have only briefly looked at PGE, but I think this
languages/regex engine has enough of a different approach that it is
still valuable for gathering lessons -- and may still make the most
sense in the long run.

My take on the comparison between l/rx and PGE as of the first public
release of PGE:

It sounds like l/rx handles pretty much exactly the same things as
PGE, probably a few more and a few less. I haven't actually looked at
the code, but from the description I'd guess that the main differences
are:
 
  - PGE is implemented in C; l/rx in Perl5. Both are reaching towards
    the "bootstrap point", when they'll be implemented in PIR.
  - PGE generates PIR; l/rx has both PIR and Perl5 backends
  - PGE uses coroutines and continuations; I have always been too
    wary of their stability, so I use plain subs (with a 'mode'
    parameter to tell it whether to try to match or backtrack)
  - Both allow you to "continue" a match to find all other possible
    matches, but that's really because it falls out of any sane
    implementation (you have to keep all that state around somehow
    anyway)
  - l/rx uses match objects (src/dynpmc/match.pmc) and automatically
    generates a parse tree out of them. PGE has a dynamically created
    Parrot class "PGE::Match" that I assume does something similar.
  - PGE has built-in "dump out the matching info" routines; I make my
    test harnesses generate their own. I'm jealous.
  - The feature sets are nearly identical. Makes sense, I suppose --
    low-hanging fruit and all that.
  - It sounds like the internal design is rather different. I try hard
    to compile down to very minimalistic PIR ops. It sounds like PGE
    uses lots of higher-level operations, to do things like processing
    a whole chunk of input at a time. (Although on the other hand, PGE
    uses more native Parrot flow control mechanisms than I do.) (And
    I really haven't looked closely enough to substantiate this.)
  - Closely related to the above, I have a number of optimizations
    already implemented, but I suspect PGE will end up with a very
    different set of optimizations.
