2008-02-09

The slowest regexp engine in the world

Much has been (and still is being) written about the speed and efficiency of regular expression matching. I find this sort of competition a somewhat trivial exercise in optimization. Writing the slowest regexp engine, that's a challenge!

A regular expression is an expression that compactly denotes a regular language. A regular language, in turn, is a set LS of strings over an alphabet S such that the test for membership of a string w in L can be carried out by a finite automaton (from now on I'll imply the alphabet and drop the subscript.)

In my quest for inefficiency, I will denote regular languages by extension. In other words, a regular language L will be exactly the set of the strings belonging to it. This set may be very large, and in fact it is not unexpected for it to be infinite; so I will need lazy lists or sequences:

type 'a node = Nil | Cons of 'a * 'a seq
and  'a seq  = 'a node Lazy.t

The simplest of all regular languages is the empty language:

let nil = lazy Nil

It has no strings. The next simplest language is the one containing just the simplest of strings, the empty one:

let eps = lazy (Cons ("", nil))

Any member c of the alphabet S can be regarded as a string. I can form regular languages from characters, too:

let sym a = lazy (Cons (String.make 1 a, nil))

Now, even though it is a start, I can't do much with this. Given regular languages L and L', I can form the union M = LL', where a string belongs to it if it belongs either to L or to L'.

The standard way to represent sets as lists is to keep them sorted somehow, so that no element appears more than once. Then set union is a so-called natural merge on these lists. I will sort strings not simply lexicographically, since that would put the string "aaa…" formed by an arbitrary number of a's before the string "b", for instance. I'll instead sort shorter strings first, and strings of equal length in lexicographical order:

let shortlex s t =
  Pervasives.compare (String.length s, s) (String.length t, t)

let rec alt r r' =
  let step = function
  | Nil        , r'            -> r'
  | r          , Nil           -> r
  | Cons (h, t), Cons (h', t') ->
    let c = shortlex h h' in
    if c < 0 then Cons (h , alt t r') else
    if c > 0 then Cons (h', alt r t') else
    Cons (h, alt t t')
  in lazy (step (Lazy.force r, Lazy.force r'))

I must be careful in executing the reductions, forcing just as much as needed. Now I can represent a simple language, for instance ε|a, where ε denotes the empty string:

# let r = alt eps (sym 'a') ;;
val r : string seq = <lazy>

Well, regular sets are opaque. I need a way to extract some values from them, for instance out to a list:

let rec take n s =
  if n = 0 then [] else match Lazy.force s with
  | Nil -> []
  | Cons (h, t) -> h :: take (pred n) t

Now, let's see what r contains:

# take 10 r ;;
- : string list = [""; "a"]

As is to be expected. Operating on sequences is straightforward, apart from having to administer laziness carefully. For instance, mapping a function over a sequence is analogous to its list counterpart:

let rec map f s =
  let step = function
  | Nil         -> Nil
  | Cons (h, t) -> Cons (f h, map f t)
  in lazy (step (Lazy.force s))

Given regular languages L and L', I can form the concatenation M = L · L', where a string belongs to it if it can be decomposed in a prefix belonging to L and a suffix belonging to L'. In other words, the strings of M are the concatenation of all the strings of L with those of L'. This Cartesian shuffle of sequences is somewhat tricky. Fortunately, union distributes over concatenation; so that (v :: s)·(w :: t) = vw :: ([vts·(w :: t)):

let rec cat r r' =
  let step = function
  | Nil, _     | _, Nil        -> Nil
  | Cons (h, t), Cons (h', t') ->
    Cons (h ^ h', alt (map ((^) h) t') (cat t r'))
  in lazy (step (Lazy.force r, Lazy.force r'))

This inlines the concatenation of the singleton set [v] directly as a map. Let's test how it works:

# take 10 ( cat (alt (sym 'a') (sym 'b')) (alt eps (sym 'c'))) ;;
- : string list = ["a"; "b"; "ac"; "bc"]

Up to now, there is no way to represent anything other than (very) small languages. Given a regular language L, I can form the reflexive-transitive closure M = L*, where a string belongs to it if it is formed by zero or more concatenations of one of the strings in L. In other words, M is the smallest set M containing the empty string ε and the concatenation of L with M itself:

let rec clo r =
  let rec step = function
  | Nil          -> Cons ("", nil)
  | Cons ("", t) -> step (Lazy.force t)
  | _            -> Cons ("", cat r (clo r))
  in lazy (step (Lazy.force r))

The closure of the null regular language is the empty one. If the regular language being closed already contains the empty string, it must keep it and close over the rest. Otherwise, it contains the empty string and the concatenation of every one of its strings with the closure. Let's try it:

# take 10 (cat (clo (sym 'a')) (sym 'b')) ;;
- : string list =
["b"; "ab"; "aab"; "aaab"; "aaaab"; "aaaaab"; "aaaaaab"; "aaaaaaab";
 "aaaaaaaab"; "aaaaaaaaab"]

It works, and it is clear that the shortlex ordering is instrumental in generating the languages. Now, what I've accomplished up to here is to devise an effective enumeration of the regular language denoted by a regular expression. This incidentally proves constructively that regular languages are recursively enumerable. To match a string against a regular expression amounts to testing it for membership in the regular set denoted by the expression:

let rec matches s r = match Lazy.force r with
| Nil         -> false
| Cons (h, t) ->
  let c = shortlex s h in
  c = 0 || c > 0 && matches s t

I must be careful to use the well-ordering on the regular language; otherwise membership becomes semi-decidable, taking literally forever to look for a string that is simply not there. Now let me be realistic: I'll try to match my e-mail address (not really). I need a way to compactly generate a character set:

let rec set x y =
  if x > y then nil else
  alt (sym x) (set (Char.chr (succ (Char.code x))) y)
 

Now the set of names is comprised of the concatenation of one or more alphabetic characters (it could be more complete, but that would be unwieldy):

let rep r = cat r (clo r)

let name = rep (set 'a' 'z')

The ten-thousandth name is still four letters long! Now the set of emails is simply:

let email = cat name (cat (sym '@') (cat name (rep (cat (sym '.') name))))

It (sort of) works:

# take 10 email ;;
- : string list =
["a@a.a"; "a@a.b"; "a@a.c"; "a@a.d"; "a@a.e"; "a@a.f"; "a@a.g"; "a@a.h";
 "a@a.i"; "a@a.j"]

I'll leave the test to you.

1 comment:

Mark Dominus said...

There's a very similar regex engine on pages 272­–287 of Higher-Order Perl. The main point of the section is to devlop a system which can enumerate the strings matched by a regex, but the author observes on page 286 that this can be used to actually do the matching.

I may write a blog entry comparing the two implementations, since they're essentially the same program.