2010-10-17

Text to PDF

I find that the best way for me to learn is to reexpress what I read in a different but equivalent form. Transcribing notes and translation are two applications of this heuristic. In this case, I've translated this simple and pretty Factor PDF text formatter into OCaml. The result is almost as compact, at less than 200 lines of code, thanks to the compositional nature that OCaml and Factor share, even if OCaml's library of combinators is rather poor. As a conscious style decision, I try to keep all definitions as short as possible. As usual, I start with an operator for functional composition:

let ( % ) f g x = f (g x)

PDF is essentially a textual format, so it's convenient to perform the conversion in memory by using strings. A PDF object is a string or stream tagged by a reference number:

let pdf_object n str =
  String.concat "\n" [string_of_int n ^ " 0 obj"; str; "endobj"]

These objects can be referenced later by such a number:

let pdf_ref n = string_of_int n ^ " 0 R"

PDF strings are quoted by parentheses like in Postscript, so they must be escaped. The function escape_with is somewhat crude but sufficient for the small number of escapes required:

let escape_with escapes str =
  let buf = Buffer.create (String.length str) in
  String.iter (fun c ->
    try Buffer.add_string buf (List.assoc c escapes)
    with Not_found -> Buffer.add_char buf c) str;
  Buffer.contents buf

let pdf_escapes = [
  '\008', "\\b";
  '\t'  , "\\t";
  '\n'  , "\\n";
  '\012', "\\f";
  '\r'  , "\\r";
  '('   , "\\(";
  ')'   , "\\)";
  '\\'  , "\\\\";
]

let pdf_string str = "(" ^ escape_with pdf_escapes str ^ ")"

A PDF dictionary is a list of key-value pairs with special syntax. I indulge in a bit of special-casing for the sake of better formatting:

let pdf_dict kvs =
  let buf = Buffer.create 16 in
  Buffer.add_string buf "<<";
  begin match kvs with
  | []     -> ()
  | [k, v] -> Printf.bprintf buf " /%s %s " k v
  | kvs    ->
    Buffer.add_char buf '\n';
    List.iter (fun (k, v) -> Printf.bprintf buf "/%s %s\n" k v) kvs
  end;
  Buffer.add_string buf ">>";
  Buffer.contents buf

This way short dictionaries occupy a single line. Every PDF file is annotated by metadata describing the particulars of its creation:

let iso_time () =
  let tm = Unix.gmtime (Unix.time ()) in
  Printf.sprintf "%04d%02d%02d%02d%02d%02d"
    (tm.Unix.tm_year + 1900) (tm.Unix.tm_mon + 1) tm.Unix.tm_mday
    tm.Unix.tm_hour tm.Unix.tm_min tm.Unix.tm_sec

let app_name = Filename.basename Sys.executable_name
and usr_name = try Sys.getenv "USER" with Not_found -> "unknown"

let pdf_info () =
  pdf_dict [
    "CreationDate", "D:" ^ iso_time ();
    "Producer"    , pdf_string app_name;
    "Author"      , pdf_string usr_name;
    "Creator"     , pdf_string "created with OCaml";
  ]

The pages in a PDF document form a tree rooted in a special page table referenced by a Catalog object. Here, as in the original program, the reference numbers are hardwired since the document structure is statically known:

let pdf_catalog () =
  pdf_dict [
    "Type" , "/Catalog";
    "Pages", pdf_ref 4;
  ]

All fonts in use must be similarly declared beforehand:

let pdf_font () =
  pdf_dict [
    "Type"    , "/Font";
    "Subtype" , "/Type1";
    "BaseFont", "/Courier";
  ]

As with dictionaries, arrays have their own syntax, very much like OCaml lists:

let pdf_array vs =
  let buf = Buffer.create 16
  and any = ref false in
  Buffer.add_string buf "[";
  List.iter (fun v -> Printf.bprintf buf " %s" v; any := true) vs;
  if !any then Buffer.add_char buf ' ';
  Buffer.add_string buf "]";
  Buffer.contents buf

PDF pages have definite sizes represented by a media box which serves as a coordinate system and clipping region:

let paper_size = function
| `A4     -> ( 595,  842)
| `Letter -> ( 612,  792)
| _       -> failwith "paper_size"

let media_box paper =
  let (w, h) = paper_size paper in
  pdf_array ["0"; "0"; string_of_int w; string_of_int h]

As I mentioned before, pages are rooted on a page table that forward references each child. Each page is comprised of a page dictionary and of its contents themselves, hence the need to reference every other object:

let rec table f n =
  let rec go i =
    if i = n then [] else f i :: go (succ i)
  in go 0

let pdf_pages ?(paper=`Letter) npages =
  pdf_dict [
    "Type"    , "/Pages";
    "MediaBox", media_box paper;
    "Count"   , string_of_int npages;
    "Kids"    , pdf_array (table (fun i -> pdf_ref (5 + 2 * i)) npages);
  ]

The page dictionary links back to the page table it depends on, the contents that follow it and any resources it needs:

let pdf_page pageno =
  pdf_dict [
    "Type"     , "/Page";
    "Parent"   , pdf_ref 4;
    "Contents" , pdf_ref (succ pageno);
    "Resources", pdf_dict ["Font", pdf_dict ["F1", pdf_ref 3]];
  ]

The page contents are represented by a PDF stream. In this case, I use the simplest formatting I can get away with:

let pdf_stream str =
  String.concat "\n" [
    pdf_dict ["Length", string_of_int (String.length str + 1)];
    "stream";
    str;
    "endstream"
  ]

Textual content is a kind of PDF stream comprised of text layout commands. Lines of text are shipped out by the ' (quote) operator one by one, after defining the characteristics (origin, font, etc) of the text box they are displayed in:

let pdf_text text =
  pdf_stream (String.concat "\n" ([
    "BT";
    "54 738 Td";
    "/F1 10 Tf";
    "12 TL";
  ] @ List.map (fun l -> pdf_string l ^ "'") text @ [
    "ET"
  ]))

All a PDF file contains is a sequence of cross-referenced objects. Given a collection of lines of text grouped into pages, the program writes the metadata, the catalog, the required resources, the page table and the sequence of pages given by their dictionary and contents:

let mapi f =
  let rec go i = function
  | [] -> []
  | x :: xs -> f i x :: go (succ i) xs
  in go 0

let objects_of_pages pages =
  mapi (pdf_object % succ) (List.concat ([
    pdf_info ();
    pdf_catalog ();
    pdf_font ();
    pdf_pages (List.length pages);
  ] :: mapi (fun i p -> [pdf_page (5 + 2 * i); pdf_text p]) pages))

Again, since the structure of the PDF file is known the references can be hardwired. Once the content is written, the PDF file ends with a trailer that cross-references the positions of all its contained objects by stream length. I fold over the list of objects to compute the file offsets at which they begin, starting at 9 to account for the PDF header:

let pdf_trailer objects =
  let len, tot, xref = List.fold_left ( fun (i, n, l) s ->
    (succ i, String.length s + 1 + n, Printf.sprintf "%010x 00000 n" n :: l) )
    (1, 9, []) objects in
  String.concat "\n" ([
    "xref";
    "0 " ^ string_of_int len;
    "0000000000 65535 f";
  ] @ List.rev xref @ [
    "trailer";
    pdf_dict [
      "Size", string_of_int len;
      "Info", pdf_ref 1;
      "Root", pdf_ref 2;
    ];
    "startxref";
    string_of_int tot;
    "%%EOF";
  ])

In this simple case the cross-reference table is written at the end; a stream-aware PDF document would have the table at the beginning, but that would need a two-pass algorithm. That is all that it's needed to write the PDF document:

let pdf_of_objects objects =
  String.concat "\n" (["%PDF-1.4"] @ objects @ [pdf_trailer objects])

(the header has length 9, as indicated above). It remains to format the text into lines. Lines will be split by terminator, accounting for all possibilities and taking care not to lose empty lines at the beginning or end of the text:

let split_lines = Str.split_delim (Str.regexp "\n\\|\r\n?")

In the same way that the original Factor code tab expansion is rather naïve:

let expand_tabs =
  String.concat "    " % Str.split_delim (Str.regexp_string "\t")

Composing both, making sure like in Factor that empty lines have at least something to print:

let lines_of_string =
  List.map (fun s -> if s = "" then " " else s) % split_lines % expand_tabs

Line wrapping and page layout are performed by grouping content lists by length. In the case of lines, I group lists of characters, so I need to convert from strings to lists and back. This is unnecessary in Factor since the word <groups> is generic on sequences, and so it would be in Haskell, but this is OCaml:

let explode str =
  let res = ref [] in
  String.iter (fun c -> res := c :: !res) str;
  List.rev !res

and implode cs =
  let buf = Buffer.create 16 in
  List.iter (Buffer.add_char buf) cs;
  Buffer.contents buf

Grouping lists by length is a simple function with a number of special cases:

let groups_of n =
  let rec go gss gs i = function
  | [] when gs = [] -> List.rev                 gss
  | []              -> List.rev (List.rev gs :: gss)
  | xs when i = n   -> go (List.rev gs :: gss) [] 0  xs
  | x :: xs         -> go gss (x :: gs)     (succ i) xs
  in go [] [] 0

As with the Factor code, I format text in pages of 57 lines of 84 characters. Except for the reading direction and use of ordnance, this function reads exactly like the original:

let pages_of_lines =
  groups_of 57 % List.concat % List.map (List.map implode % groups_of 84 % explode)

That's it. Formatting a text into a PDF document is a pipeline connecting all the pieces:

let pdf_of_text =
  pdf_of_objects % objects_of_pages % pages_of_lines % lines_of_string

As a bonus, a simple file filter:

let unwind ~(protect:'a -> unit) f (x : 'a) =
  try let y = f x in protect x; y with e -> protect x; raise e

let with_input_file  f file = unwind ~protect:close_in  f (open_in_bin file)
and with_output_file f file = unwind ~protect:close_out f (open_out    file)

let file_contents = with_input_file (fun inch ->
  let len = in_channel_length inch in
  let buf = String.create len in
  really_input inch buf 0 len;
  buf)

let pdf_of_file inf =
  let pdf = pdf_of_text (file_contents inf) in
  with_output_file (fun otch -> output_string otch pdf)

The entire program comes down to less than four pages in a formatted PDF document. As the original author remarks, this solution also compares favorably to a 600 line Python version and a 450 line C version. I expect that a Haskell version would be much nearer the 140 lines in the original program.