2012-07-02

Assessing Abstractions

Back to OCaml! Catching up with StackOverflow's OCaml questions, I found an interesting one about private type abbreviations in module signatures. One thing in that conversation that struck me as odd was the assertion that the compiler optimizes single-constructor variants, thereby subsuming the semantics of Haskell's all three declarators, data, type and newtype, into one. Definitive proof would be obtainable by diving into the compiler's source. I decided instead to read the output of ocamlopt to try to catch a glimpse of this, even if I know it doesn't constitute definitive proof. What I found is both good and bad, or as we say here una de cal y una de arena.

Consider the following module definitions:

module S = struct
  type t = P of int * int
  let make x y = P (x, y)
end

module T = struct
  type t = int * int
  let make x y = x, y
end

Module S (for "single sum type") declares a single-variant type, while T (for "tuple type") declares an abbreviation. Putting them in a file and compiling it with ocamlopt -S -c generates the following listing (prelude and other specificities omitted):

S.makeT.make
_camlAbbrev__make_1033:
        subq    $8, %rsp
        movq    %rax, %rdi
.L101:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L102
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    %rdi, (%rax)
        movq    %rbx, 8(%rax)
        addq    $8, %rsp
        ret
.L102:  call    _caml_call_gc
        jmp     .L101
_camlAbbrev__make_1038:
        subq    $8, %rsp
        movq    %rax, %rdi
.L105:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L106
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    %rdi, (%rax)
        movq    %rbx, 8(%rax)
        addq    $8, %rsp
        ret
.L106:  call    _caml_call_gc
        jmp     .L105

(ocamlopt -version is 3.12.1). The compiler generates the exact same code in both cases. The bolded instructions build the heap-allocated 3-word pair as tag, first component, second component. In order to test modular abstraction, I coerce these implementations into the following signatures:

module A : sig
  type t = int * int
  val make : int -> int -> t
end = T

module B : sig
  type t = private int * int
  val make : int -> int -> t
end = T

module X : sig
  type t = P of int * int
  val make : int -> int -> t
end = S

module Y : sig
  type t = private P of int * int
  val make : int -> int -> t
end = S

Modules A and B coerce module T with a public and a private type definition, respectively. Modules X and Y do the same with module S. The following code shows how to use each variant of type t in a destructuring pattern match:

let a () = A.(let   (x, y) =  make 2 3               in x + y)
let b () = B.(let   (x, y) = (make 2 3 :> int * int) in x + y)
let x () = X.(let P (x, y) =  make 2 3               in x + y)
let y () = Y.(let P (x, y) =  make 2 3               in x + y)

As explained in the manual, a private type abbreviation requires an explicit coercion for the types to be compatible. The use of a single-variant type makes for not only safer coding (the intent behind Haskell's newtype), it also allows for lighter syntax. I was surprised by the assembly code generated by ocamlopt (somewhat simplified):

ab
_camlAbbrev__a_1060:
        subq    $8, %rsp
.L109:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L110
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L110:  call    _caml_call_gc
        jmp     .L109
_camlAbbrev__b_1063:
        subq    $8, %rsp
.L113:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L114
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L114:  call    _caml_call_gc
        jmp     .L113
cd
_camlAbbrev__x_1066:
        subq    $8, %rsp
.L117:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L118
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L118:  call    _caml_call_gc
        jmp     .L117
_camlAbbrev__y_1070:
        subq    $8, %rsp
.L121:  subq    $24, %r15
        movq    _caml_young_limit@GOTPCREL(%rip), %rax
        cmpq    (%rax), %r15
        jb      .L122
        leaq    8(%r15), %rax
        movq    $2048, -8(%rax)
        movq    $5, (%rax)
        movq    $7, 8(%rax)
        movq    8(%rax), %rbx
        movq    (%rax), %rax
        leaq    -1(%rax, %rbx), %rax
        addq    $8, %rsp
        ret
.L122:  call    _caml_call_gc
        jmp     .L121

Identical machine code generated in all four cases, which I find very encouraging. Here the compiler inlines the tupling constructor make (the three first bolded lines; note that an integer n is tagged as 2·n + 1 so that 2 is represented by 5 and 3 by 7). What I find a bummer is that the tuple is immediately destructured (into rbx and rax, next two bolded lines) and operated on its components (last bolded line; note in passing how addition of tagged integers is done with leaq, since 2·(m + n) + 1 = (2·m + 1) + (2·n + 1) - 1), without any kind of CSE performed. This is perfectly summarized by Pascal Cuoq in a comment to another StackOverflow question:

[…] Last time I checked it didn't even optimize the allocation of a, b in match a, b with x, y -> .... If you wish to check by yourself, I found that reading the x86 assembly generated by ocamlopt -S was convenient for me because I didn't have to learn a new representation.

This, I suspect, is an artifact of inlining in the presence of code emission that must be correct in the face of separate compilation. Disturbingly enough, ocamlopt seems to inline even across module boundaries. Separating the test into an abbrev.ml implementation:

module S = struct
  type t = P of int * int
  let make x y = P (x, y)
end

module T = struct
  type t = int * int
  let make x y = x, y
end

module A = T
module B = T
module X = S
module Y = S

with interface abbrev.mli:

module A : sig type t =              int * int val make : int -> int -> t end
module B : sig type t = private      int * int val make : int -> int -> t end
module X : sig type t =         P of int * int val make : int -> int -> t end
module Y : sig type t = private P of int * int val make : int -> int -> t end

the following code in test.ml:

open Abbrev

let a () = A.(let   (x, y) =  make 2 3               in x + y)
let b () = B.(let   (x, y) = (make 2 3 :> int * int) in x + y)
let x () = X.(let P (x, y) =  make 2 3               in x + y)
let y () = Y.(let P (x, y) =  make 2 3               in x + y)

compiles to the same assembly as above! Xavier Leroy spells out the conditions for ocamlopt to do cross-module inlining; I think it obvious that these tests do fulfill them. An interesting tidbit in that message is the command-line argument -dcmm to dump a RTL-like representation of the generated code:

(function camlTest__a_1038 (param/1056: addr)
 (let match/1057 (alloc 2048 5 7)
   (+ (+ (load match/1057) (load (+a match/1057 8))) -1)))

(function camlTest__b_1041 (param/1058: addr)
 (let match/1059 (alloc 2048 5 7)
   (+ (+ (load match/1059) (load (+a match/1059 8))) -1)))

(function camlTest__x_1044 (param/1060: addr)
 (let match/1061 (alloc 2048 5 7)
   (+ (+ (load match/1061) (load (+a match/1061 8))) -1)))

(function camlTest__y_1048 (param/1062: addr)
 (let match/1063 (alloc 2048 5 7)
   (+ (+ (load match/1063) (load (+a match/1063 8))) -1)))

I think it is not very productive to worry too much about the cost of abstractions in OCaml, as it probably is offset by the naïveté of the code generator.

2 comments:

gasche said...

Pattern matching is compiled away (and, in the process, optimizes things such as `match 2,3 with (x,y) -> ...`) in an early stage of OCaml's compilation, that is common to both bytecode and native compilers. Inlining is performed later, in the native compiler only.

This means that the pattern matching compiler won't get access to additional optimizations that will be made possible by the inlining pass. Ordering or repeating compiler passes is a delicate problem (should be performed on the source, a low-level intermediate language, both?) in compiler design, and OCaml makes the choice of simplicity here.

If you can produce real-world code example where optimizing this kind of codes makes sensible difference in the performances (even better if it's already existing code), you should submit a request on the OCaml bugtracker.

Matías Giovannini said...

It clearly is not an issue with the pattern matching compiler; the tuple is allocated and built just to be destructured and used well after that. Whether this optimization should be performed on the RTL or as a peephole pass over the generated instructions, I don't know. I don't know what the "real-word" (them fighting words, Dijkstra was famously wary of those) impact of this is, either, and frankly it is not something that concerns me here; the exercise was admittedly artificial and done purely to satisfy my curiosity, and I derive no value judgements from it.