2008-10-29

Extreme Recursion

For a data processing application, I needed a way to group data records together according to some criteria. This is the "reduce" phase of the map-reduce transformation, or the group-by phase of a select-project-group query.

My specific problem called for a way to group records in an input list into output sublists by testing them against the first such record considered. More specifically, the records were timestamped, and I had to group them into one-hour segments. If the leading record was timestamped on, say, 12:23:15PM October 29th, 2008, it would be the first in a group including records timestamped up to 1:23:14PM October 29th, 2008.

There's a venerable technique, the control break with advanced reading to solve this problem. I reasoned that I needed to keep track of the current group, and of the list of as-yet-unexamined records. Using accumulating parameters, the following wrote itself:

let group ~by = function
| []      -> []
| x :: xs ->
  let rec scan g gs e = function
  | []      -> List.rev ((e :: List.rev g)  ::  gs)
  | x :: xs ->
    if by e x
    then scan (x :: g) gs e xs
    else scan [] ((e :: List.rev g)  ::  gs) x xs
  in scan [] [] x xs

The grouping function is by, and an empty list contains no groups. Otherwise, the first element x in the list would be the group leader e in the "advanced read", used to find how far that group would extend with a call to scan. This function keeps track of the current group g and the list of found groups gs, and examines the input list so far. If it's empty, it puts the group leader e at the head of the current group, and outputs the result as the last group of the output list. Since the accumulating parameters are built head-first, this requires a couple of reversals. Otherwise, the current record x is tested to see if it belongs in the same group as the leader e; if so, it is added to the current group; otherwise, the current group is closed and a new one is started with that record as the new group leader. In any case, recursion is in tail position, which betrays that this algorithm is imperative and sequential in nature.

All in all, correct by design:

# group ~by:(fun e x -> x - e < 3) [1;2;3;4;5;6;7;8;9;10;11;12];;
- : int list list = [[1; 2; 3]; [4; 5; 6]; [7; 8; 9]; [10; 11; 12]]

and deeply unsatisfactory. What would a Haskeller do? That is, what would be a non-operational, divide-and-conquer truly recursive solution look like? Well, I reasoned, given a record e, split the remaining records into the group led by it and the remainder, and output the group and the result of grouping the remainder by its group leader. After some back-and-forth refactoring and a couple of tries, I came up with:

let group ~by =
  let rec split e l = match l with
  | []      -> [], []
  | x :: xs ->
    if not (by e x) then [], l else
    let g, ys = split e xs in
    x :: g, ys
  in
  let rec scan e l =
    let g, ys = split e l in
    (e :: g) :: match ys with
    | []      -> []
    | x :: xs -> scan x xs
  in function [] -> [] | x :: xs -> scan x xs

I actually sketched scan first: I supposed the function split was given as I described above, recurred on the remainder, and tacked the found group on the result. Refactoring the common e :: g that I originally wrote on both arms of the match ys with… resulted in the rather odd-looking function above. Then, writing split was relatively easy: an empty list gives an empty group and remainder, but a list with a record must be tested for membership on the current group. If not a member, the group is closed and the remainder includes that record; otherwise, split the rest and tack the record at the beginning of the found group.

In truth, I reversed that conditional on a later rewrite, because the arms of the if looked really unbalanced. I didn't think I had improved matters, however: first, OCaml, not being lazy, doesn't really benefit from using right recursion; second, this code is both longer and —to my old-fashioned, structured-programming brain— more roundabout.

But a bit of refactoring never hurts: note that the top-level match (the very last one) and that inside scan are exactly the same. In this case, factoring out that code involved a recursive call made directly to group; this made scan a non-recursive helper that I manually inlined, thus:

let rec group ~by =
  let rec split e l = match l with
  | []      -> [], []
  | x :: xs ->
    if not (by e x) then [], l else
    let g, ys = split e xs in
    x :: g, ys
  in function
  | []      -> []
  | x :: xs ->
    let g, ys = split x xs in
    (x :: g) :: group ~by ys

A guard pattern makes things a bit more functional, even to the point where I can merge two patterns into one:

let rec group ~by =
  let rec split e = function
  | x :: xs when by e x ->
    let g, ys = split e xs in
    x :: g, ys
  | l       -> [], l
  in function
  | []      -> []
  | x :: xs ->
    let g, ys = split x xs in
    (x :: g) :: group ~by ys

Which is exactly right, and what I should've come up with in the first place! For comparison, here's again the first function, with a bit of clean-up done to it:

let group ~by =
  let snoc x xs = x :: List.rev xs in
  let rec scan g gs e = function
  | []      -> snoc e g  ::  gs
  | x :: xs ->
    if by e x
    then scan (x :: g) gs e xs
    else scan [] (snoc e g  ::  gs) x xs
  in function
  | []      -> []
  | x :: xs -> List.rev (scan [] [] x xs)

A clear mongrel of a function. Refactoring functional code is not complete, I think, until the code is "just right", which sometimes is never. Hence the need to step back and rethink the whole point to the function in question; there shouldn't be a single ugly function in a program. The problem with this is that whereas I couldn't explain what I mean by "ugly" or "pretty" function, I know when I've reached aesthetic closure; and yet I'd like to know, what is the "process" by which great Haskell programmers write their code?

No comments: