Enter the void *

What's in an ADT ?

Etienne Millon — Wed, 14 Dec 2011 00:00:00 UT

Introduction

Algebraic Data Types, or ADTs for short, are a core feature of functional languages such as OCaml or Haskell. They are a handy model of closed disjoint unions and unfortunately, outside of the functional realm, they are only seldom used.

In this article, I will explain what ADTs are, how they are used in OCaml and what trimmed-down versions of them exist in other languages. I will use OCaml, but the big picture is about the same in Haskell.

Principles

Functional languages offer a myriad of types for the programmer.

some base types, such as int, char or bool.
functions, ie arrow types. A function with domain a and codomain b has type a -> b.
tuples, ie product types. A tuple is an heterogeneous, fixed-width container type (its set-theoretic counterpart is the cartesian product) For example, (2, true, 'x') has type int * bool * char. record types are a (mostly) syntactic extension to give name to their fields.
some parametric types. For example, if t is a type, t list is the type of homogeneous linked list of elements having type t.
what we are talking about today, algebraic types (or sum types, or variant types).

If product types represent the cartesian product, algebraic types represent the disjoint union. In another words, they are very adapted for a case analysis.

We will take the example of integer ranges. One can say that an integer range is either :

the empty range
of the form ]-∞;a]
of the form [a;+∞[
an interval of the form [a;b] (where a ≤ b)
the whole range (ie, ℤ)

With the following properties :

Disjunction : no range can be of two forms at a time.
Injectivity : if [a;b] = [c;d], then a = c and b = d (and similarly for other forms).
Exhaustiveness : it cannot be of another form.

Syntax & semantics

This can be encoded as an ADT :

type range =
  | Empty
  | HalfLeft of int
  | HalfRight of int
  | Range of int * int
  | FullRange

Empty, HalfLeft, HalfRight, Range and FullRange are t’s constructors. They are the only way to build a value of type t. For example, Empty, HalfLeft 3 and Range (2, 5) are all values of type t¹. They each have a specific arity (the number of arguments they take).

To deconstruct a value of type t, we have to use a powerful construct, pattern matching, which is about matching a value against a sequence of patterns (yes, that’s about it).

To illustrate this, we will write a function that computes the minimum value of such a range. Of course, this can be ±∞ too, so we have to define a type to represent the return value.

type ext_int =
  | MinusInfinity
  | Finite of int
  | PlusInfinity

In a math textbook, we would write the case analysis as :

min ∅ = +∞
min ]-∞;a] = -∞
min [a;+∞[ = a
min [a;b] = a
min ℤ = -∞

That translates to the following (executable !) OCaml code :

let range_min x =
  match x with
  | Empty -> PlusInfinity
  | HalfLeft a -> MinusInfinity
  | HalfRight a -> Finite a
  | Range (a, b) -> Finite a
  | FullRange -> MinusInfinity

In the pattern HalfLeft a, a is a variable name, so it get bounds to the argument’s value. In other words, match (HalfLeft 2) with HalfLeft x -> e bounds x to 2 in e.

It’s functions all the way down

Pattern matching seems magical at first, but it is only a syntactic trick. Indeed, the definition of the above type is equivalent to the following definition :

type range

(* The following is not syntactically correct *)
val Empty : range
val HalfLeft : int -> range
val HalfRight : int -> range
val Range : int * int -> range
val FullRange : range
(* Moreover, we know that they are injective and mutually disjoint *)

val deconstruct_range :
  (unit -> 'a) ->
  (int -> 'a) ->
  (int -> 'a) ->
  (int * int -> 'a) ->
  (unit -> 'a) ->
  range ->
  'a

deconstruct_range is what replaces pattern matching. It also embodies the notion of exhaustiveness, because given any value of type range, we can build a deconstructed value out of it.

Its type looks scary at first, but if we look closer, its arguments are a sequence of case-specific deconstructors² and the value to get “matched” against.

To show the equivalence, we can implement deconstruct_range using pattern patching and range_min using deconstruct_range³ :

let deconstruct_range
      f_empty
      f_halfleft
      f_halfright
      f_range
      f_fullrange
      x
    =
  match x with
  | Empty -> f_empty ()
  | HalfLeft a -> f_halfleft a
  | HalfRight a -> f_halfright a
  | Range (a, b) -> f_range (a, b)
  | FullRange -> f_fullrange ()

let range_min' x =
  deconstruct_range
    (fun () -> PlusInfinity)
    (fun a -> MinusInfinity)
    (fun a -> Finite a)
    (fun (a, b) -> Finite a)
    (fun () -> MinusInfinity)
    x

Implementation

After this trip in denotational-land, let’s get back to operational-land : how is this implemented ?

In OCaml, no type information exists at runtime. Everything exists with a uniform representation and is either an integer or a pointer to a block. Each block starts with a tag, a size and a number of fields.

With the Obj module (kids, don’t try this at home), it is possible to inspect blocks at runtime. Let’s write a dumper for range value and watch outputs :

(* Range of integers between a and b *)
let rec rng a b =
  if a > b then
    []
  else
    a :: rng (a+1) b

let view_block o =
  if (Obj.is_block o) then
    begin
      let tag = Obj.tag o in
      let sz = Obj.size o in
      let f n =
        let f = Obj.field o n in
        assert (Obj.is_int f);
        Obj.obj f
      in
      tag :: List.map f (rng 0 (sz-1))
    end
  else if Obj.is_int o then
    [Obj.obj o]
  else
    assert false

let examples () =
  let p_list l =
    String.concat ";" (List.map string_of_int l)
  in
  let explore_range r =
    print_endline (p_list (view_block (Obj.repr r)))
  in
  List.iter explore_range
    [ Empty
    ; HalfLeft 8
    ; HalfRight 13
    ; Range (2, 5)
    ; FullRange
    ]

When we run examples (), it outputs :

0
0;8
1;13
2;2;5
1

We can see the following distinction :

0-ary constructors (Empty and FullRange) are encoded are simple integers.
other ones are encoded blocks with a constructor number as tag (0 for HalfLeft, 1 for HalfRight and 2 for Range) and their argument list afterwards.

Thanks to this uniform representation, pattern-matching is straightforward : the runtime system will only look at the tag number to decide which constructor has been used, and if there are arguments to be bound, they are just after in the same block.

Conclusion

Algebraic Data Types are a simple model of disjoint unions, for which case analyses are the most natural. In more mainstream languages, some alternatives exist but they are more limited to model the same problem.

For example, in object-oriented languages, the Visitor pattern is the natural way to do it. But class trees are inherently “open”, thus breaking the exhaustivity property.

The closest implementation is tagged unions in C, but they require to roll your own solution using enums, structs and unions. This also means that all your hand-allocated blocks will have the same size.

Oh, and I would love to know how this problem is solved with other paradigms !

Unfortunately, so is Range (10, 2). The invariant that a ≤ b has to be enforced by the programmer when using this constructor.↩︎
For 0-ary constructors, the type has to be unit -> 'a instead of 'a to allow side effects to happen during pattern matching.↩︎
More precisely, we would have to show that any function written with pattern matching can be adapted to use the deconstructor instead. I hope that this example is general enough to get the idea.↩︎

Making type inference explode

Etienne Millon — Wed, 21 May 2014 00:00:00 UT

Hindley-Milner type systems are in a sweet spot in that they are both expressive and easy to infer. For example, type inference can turn this program:

let rec length = function
  | [] -> 0 
  | x::xs -> 1 + length xs

into this one (the top-level type 'a list -> int is usually what is interesting but the compiler has to infer the type of every subexpression):

let rec length : 'a list -> int = function
  | [] -> (0 : int)
  | (x:'a)::(xs : 'a list) -> (1 : int)
        + ((length : 'a list -> int) (xs : 'a list) : int)

Because the compiler does so much work, it is reasonable to wonder whether it is efficient. The theoretical answer to this question is that type inference is EXP-complete, but given reasonable constraints on the program, it can be done in quasi-linear time (n log n where n is the size of the program).

Still, one may wonder what kind of pathological cases show this exponential effect. Here is one such example:

let p x y = fun z -> z x y ;;

let r () =
let x1 = fun x -> p x x in
let x2 = fun z -> x1 (x1 z) in
let x3 = fun z -> x2 (x2 z) in
x3 (fun z -> z);;

The type signature of r is already daunting:

% ocamlc -i types.ml
val p : 'a -> 'b -> ('a -> 'b -> 'c) -> 'c
val r :
  unit ->
  (((((((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) ->
       ((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) -> 'c) ->
      'c) ->
     ((((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) ->
       ((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) -> 'c) ->
      'c) ->
     'd) ->
    'd) ->
   ((((((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) ->
       ((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) -> 'c) ->
      'c) ->
     ((((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) ->
       ((('a -> 'a) -> ('a -> 'a) -> 'b) -> 'b) -> 'c) ->
      'c) ->
     'd) ->
    'd) ->
   'e) ->
  'e

But what’s interesting about this program is that we can add (or remove) lines to study how input size can alter the processing time and output type size. It explodes:

n	wc -c	time	leaves(n)
1	98	15ms	1
2	167	15ms	2
3	610	15ms	8
4	11630	38ms	128
5	4276270	6.3s	32768

Observing the number of ('a -> 'a) leaves in the output type reveals that it is is squared and doubled at each step, leading to an exponential growth.

In practice, this effect does not appear in day-to-day programs because programmers annotate the top-level declarations with their types. In that case, the size of the types would be merely proportional to the size of the program, because the type annotation would be gigantic.

Also, programmers tend to write functions that do something useful, which these do not seem to do ☺.

NaBoMaMo 2016 writeup

Etienne Millon — Wed, 01 Feb 2017 00:00:00 UT

Hello! It’s 2016, it’s November, and apparently it rhymes with #NaBoMaMo 2016, the National Bot Making Month. I made a bot!.

Full disclosure: it’s actually 2017, but I started writing this in 2016 so it’s OK. Also I’m not actually from the US, but I’ll relax the definition a bit and let’s pretend it means International Bot Making Year. Close enough!

Bots are all the rage - Twitter bots, IRC bots, Telegram bots… I decided to make a Slack bot to get more familiar with that API.

I wanted this to be a small project - write and forget, basically. I started by defining some specs and lock those down:

that bot works on Slack
it uses the “will it rain in the next hour” API from Météo France.
the bot understands 3 commands:
- tell you whether it will rain or not.
- show you a graph of rain level over the next hour.
- tell you when to go out to avoid the rain.

The next step was choosing the tech stack. For hosting itself I was sold on using Heroku from previous projects (or another PaaS host, for what it’s worth)

As for the programming language itself, I hesitated between three choices:

focus on the all-included experience: something that has libraries, tooling, but somehow boring;
focus on the shipping experience: stuff that I use daily, but looking to get something online quickly;
focus on learning something new.

The first one means something like Python or Ruby. I am familiar with the languages and am pretty sure that there are libraries that can take care of the Slack API without me having to ever worry about HTTP endpoints. That means also first-class deployment and hosting.

The second one is about OCaml: it’s a programming language I use daily at work, but the real goal would be to focus on shipping: create a project, write tests, write implementation, deploy, repeat for new features, forget.

The third one means a totally new programming language. I heard a lot of good things about Elixir for backend applications and figured that it would be a good intro project. Learning a new language is always an interesting experience, because it makes you a better programmer in all languages, and having clear specs would make this manageable.

The Python/Ruby solution seemed a bit boring. I probably would not learn a lot, only, maybe add a couple libraries to my toolbelt at most.

Elixir sounds great, but learning a new language and a new project at the same time is too hard and too time consuming. I would rather write in a new language something I previously wrote in another language. Though for something small and focused like this, that could have worked.

I first created the project structure: github repo, ocaml project (topkg, opam, etc). I like to use TDD for this kind of projects, so I added a small alcotest suite. I also created the 12factor separation: a Procfile, a small bin/ shell that reads the application configuration from the environment and starts a bot from lib/.

I asked myself what to test: the cohttp library is nice, because servers and clients are built using normal functions that take a request and returns a response. That makes it possible to test almost everything at the ocaml level without having to go to the HTTP level. This is especially important since there is no way to mock values and functions in ocaml. Everything has to be real objects.

However, even if it was possible to test everything, I decided to just focus on the domain logic without testing the HTTP part: for example, I would pass data structures directly to my bot object rather than building a cohttp request.

A part that is important for me even for a small project like that, is to have some sort of CI: have travis run my test suite, and make a binary ready to be deployed to Heroku. That way, it is impossible to forget how to make changes, test and deploy, since this is all in a script.

The other part that needed work is the actual Slack integration. The “slash” command API is pretty simple: it is possible to configure a Slack team such that typing /rain will hit a particular URL. Some options are passed as POST data and whatever is returned is displayed in Slack.

I set up the Slack integration, wrote a function to distinguish between /rain and /rain list (using the POST data), and by the end of the second iteraton I had my second feature implemented, working, and deployed.

All in all, that was pretty great. The code or the bot itself are not particularly fantastic, but I learned some important lessons:

When you do not want to spend a lot of time on a task, invest in planning and keep the list of features short. That is pretty obvious in the context of paid work, but this is applies well to hobby programming too.
Know what to test and what not to. Tests are useful to ensure that changes can be made without breaking everything, but testing that your HTTP library can parse POST data is a waste of time.
In languages where it is not possible to mock or monkey patch functions, dependency injection is still possible. One may even argue that it leads to a better solution, since it removes the coupling between the different components.

You can find the source of this bot on Github. See you next year, #NaBoMaMo! And thanks to Tully Hansen for organizing this.

Fuzzing OCamlFormat with AFL and Crowbar

Etienne Millon — Mon, 03 Aug 2020 00:00:00 UT

This article has been first published on the Tarides blog.

AFL (and fuzzing in general) is often used to find bugs in low-level code like parsers, but it also works very well to find bugs in high level code, provided the right ingredients. We applied this technique to feed random programs to OCamlFormat and found many formatting bugs.

OCamlFormat is a tool to format source code. To do so, it parses the source code to an Abstract Syntax Tree (AST) and then applies formatting rules to the AST.

It can be tricky to correctly format the output. For example, say we want to format (a+b)*c. The corresponding AST will look like Apply("*", Apply ("+", Var "a", Var "b"), Var "c"). A naive formatter would look like this:

let rec format = function
  | Var s -> s
  | Apply (op, e1, e2) ->
      Printf.sprintf "%s %s %s" (format e1) op (format e2)

But this is not correct, as it will print (a+b)*c as a+b*c, which is a different program. In this particular case, the common solution would be to track the relative precedence of the expressions and to emit only necessary parentheses.

OCamlFormat has similar cases. To make sure we do not change a program when formatting it, there is an extra check at the end to parse the output and compare the output AST with the input AST. This ensures that, in case of bugs, OCamlFormat exits with an error rather than changing the meaning of the input program.

When we consider the whole OCaml language, the rules are complex and it is difficult to make sure that we are correctly handling all programs. There are two main failure modes: either we put too many parentheses, and the program does not look good, or we do not put enough, and the AST changes (and OCamlFormat exits with an error). We need a way to make sure that the latter does not happen. Tests work to some extent, but some edge cases happen only when a certain combination of language features is used. Because of this combinatorial explosion, it is impossible to get good coverage using tests only.

Fortunately there is a technique we can use to automatically explore the program space: fuzzing. For a primer on using this technique on OCaml programs, one can refer to this article.

To make this work we need two elements: a random program generator, and a property to check. Here, we are interested in programs that are valid (in the sense that they parse correctly) but do not format correctly. We can use the OCamlFormat internals to do the following:

try to parse input: in case of a parse error, just reject this input as invalid.
otherwise, with have a valid program. try to format it. If this happens with no error at all, reject this input as well.
otherwise, it means that the AST changed, comments moved, or something similar, in a valid program. This is what we are after.

Generating random programs is a bit more difficult. We can feed random strings to AFL, but even with a corpus of existing valid code it will generate many invalid programs. We are not interested in these for this project, we would rather start from valid programs.

A good way to do that is to use Crowbar to directly generate AST values. Thanks to ppx_deriving_crowbar and ppx_import it is possible to generate random values for an external type like Parsetree.structure (the contents of .ml files). Even more fortunately somebody already did the work. Thanks, Mindy!

This approach works really well: it generates 5k-10k programs per second, which is very good performance (AFL starts complaining below 100/s).

Quickly, AFL was able to find crashes related to attributes. These are “labels” attached to various nodes of the AST. For example the expression (x || y) [@a] (logical or between x and y, attach attribute a to the “or” expression) would get formatted as x || y [@a] (attribute a is attached to the y variable). Once again, there is a check in place in OCamlFormat to make sure that it does not save the file in this case, but it would exit with an error.

After the fuzzer has run for a bit longer, it found crashes where comments would jump around in expressions like f (*a*) (*bb*) x. Wait, what? We never told the program generator how to generate comments. Inspecting the intermediate AST, the part in the middle is actually an integer literal with value "(*a*) (*bb*)" (integer literals are represented as strings so that a third party library could add literals for arbitrary precision numbers for example).

AFL comes with a program called afl-tmin that is used to minimize a crash. It will try to find a smaller example of a program that crashes OCamlFormat. It works well even with Crowbar in between. For example it is able to turn (new aaaaaa & [0;0;0;0])[@aaaaaaaaaa] into (0&0)[@a] (neither AFL nor OCamlFormat knows about types, so they can operate on nonsensical programs. Finding a well-typed version of a crash is usually not very difficult, but it has to be done manually).

In total, letting AFL run overnight on a single core (that is relatively short in terms of fuzzing) caused 453 crashes. After minimization and deduplication, this corresponded to about 30 unique issues.

Most of them are related to attributes that OCamlFormat did not try to include in the output, or where it forgot to add parentheses. Fortunately, there are safeguards in OCamlFormat: since it checks that the formatting preserves the AST structure, it will exit with an error instead of outputting a different program.

Once again, fuzzing has proved itself as a powerful technique to find actual bugs (including high-level ones). A possible approach for a next iteration is to try to detect more problems during formatting, such as finding cases where lines are longer than allowed. It is also possible to extend the random program generator so that it tries to generate comments, and let OCamlFormat check that they are all laid out correctly in the output. We look forward to employing fuzzing more extensively for OCamlFormat development in future.

Introducing tree-sitter-dune

Etienne Millon — Fri, 26 Jul 2024 00:00:00 UT

I made a tree-sitter plugin for dune files. It is available on GitHub.

Tree-sitter is a parsing system that can be used in text editors. Dune is a build system for OCaml projects. Its configuration language lives in dune files which use a s-expression syntax.

This makes highlighting challenging: the lexing part of the language is very simple (atoms, strings, parentheses), but it is not enough to make a good highlighter.

In the following example, with-stdout-to and echo are “actions” that we could highlight in a special way, but these names can also appear in places where they are not interpreted as actions, and doing so would be confusing (for example, we could write to a file named echo instead of foo.txt.

(rule
 (action
  (with-stdout-to
   foo.txt
   (echo "testing"))))

Tree-sitter solves this, because it creates an actual parser that goes beyond lexing.

In this example, I created grammar rules that parse the contents of (action ...) as an action, recognizing the various constructs of this DSL.

The output of the parser is this syntax tree with location information (for some reason, line numbers start at 0 which is normal and unusual at the same time).

(source_file [0, 0] - [5, 0]
  (stanza [0, 0] - [4, 22]
    (stanza_name [0, 1] - [0, 5])
    (field_name [1, 2] - [1, 8])
    (action [2, 2] - [4, 20]
      (action_name [2, 3] - [2, 17])
      (file_name_target [3, 3] - [3, 10]
        (file_name [3, 3] - [3, 10]))
      (action [4, 3] - [4, 19]
        (action_name [4, 4] - [4, 8])
        (quoted_string [4, 9] - [4, 18])))))

The various strings are annotated with their type: we have stanza names (rule), field names (action), action names (with-stdout-to, echo), file names (foo.txt), and plain strings ("testing").

By itself, that is not useful, but it’s possible to write queries to make this syntax tree do interesting stuff.

The first one is highlighting: we can set styles for various “patterns” (in practice, I only used node names) by defining queries:

(stanza_name) @function
(field_name) @property
(quoted_string) @string
(multiline_string) @string
(action_name) @keyword

The parts with @ map to “highlight groups” used in text editors.

Another type of query is called “injections”. It is used to link different types of grammars together. For example, dune files can start with a special comment that indicates that the rest of the file is an OCaml program. In that case, the parser emits a single ocaml_syntax node and the following injection indicates that this file should be parsed using an OCaml parser:

((ocaml_syntax) @injection.content
 (#set! injection.language "ocaml"))

Another use case for this is system actions: these strings in dune files could be interpreted using a shell parser.

In the other direction, it is possible to inject dune files into another document. For example, a markdown parser can use injections to highlight code blocks.

I’m happy to have explored this technology. The toolchain seemed complex at first: there’s a compiler which seems to be a mix of node and rust, which generates C, which is compiled into a dynamically loaded library; but this is actually pretty well integrated in nix and neovim to the details are made invisible.

The testing mechanism is similar to the cram tests we use in Dune, but I was a bit confused with the colors at first: when the output of a test changes, Dune considers that the new output is a + in the diff, and highlights it in green; while tree-sitter considers that the “expected output” is green.

There are many ways to improve this prototype: either by adding queries (it’s possible to define text objects, folding expressions, etc), or by improving coverage for dune files (in most cases, the parser uses a s-expression fallback). I’m also curious to see if it’s possible to use this parser to provide a completion source. Since the strings are tagged with their type (are we expecting a library name, a module name, etc), I think we could use that to provide context-specific completions, but that’s probably difficult to do.

Thanks teej for the initial idea and the useful resources.