<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Enter the void *</title>
        <link>http://blog.emillon.org</link>
        <description><![CDATA[Yet another random hacker]]></description>
        <atom:link href="http://blog.emillon.org/feeds/ocaml.xml" rel="self"
                   type="application/rss+xml" />
        <lastBuildDate>Wed, 14 Dec 2011 00:00:00 UT</lastBuildDate>
        <item>
    <title>What's in an ADT ?</title>
    <link>http://blog.emillon.org/posts/2011-12-14-what-s-in-an-adt.html</link>
    <description><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Algebraic Data Types, or ADTs for short, are a core feature of functional
languages such as OCaml or Haskell. They are a handy model of closed disjoint
unions and unfortunately, outside of the functional realm, they are only seldom
used.</p>
<p>In this article, I will explain what ADTs are, how they are used in OCaml and
what trimmed-down versions of them exist in other languages. I will use OCaml,
but the big picture is about the same in Haskell.</p>
<h2 id="principles">Principles</h2>
<p>Functional languages offer a myriad of types for the programmer.</p>
<ul>
<li>some <em>base</em> types, such as <code>int</code>, <code>char</code> or <code>bool</code>.</li>
<li>functions, ie <em>arrow</em> types. A function with domain <code>a</code> and codomain <code>b</code> has
type <code>a -&gt; b</code>.</li>
<li>tuples, ie <em>product</em> types. A tuple is an heterogeneous, fixed-width
container type (its set-theoretic counterpart is the cartesian product) For
example, <code>(2, true, 'x')</code> has type <code>int * bool * char</code>. <em>record</em> types are a
(mostly) syntactic extension to give name to their fields.</li>
<li>some <em>parametric</em> types. For example, if <code>t</code> is a type, <code>t list</code> is the type
of homogeneous linked list of elements having type <code>t</code>.</li>
<li>what we are talking about today, <em>algebraic</em> types (or <em>sum</em> types, or
<em>variant</em> types).</li>
</ul>
<p>If product types represent the cartesian product, algebraic types represent the
disjoint union. In another words, they are very adapted for a case
analysis.</p>
<p>We will take the example of integer ranges. One can say that an integer range is
either :</p>
<ul>
<li>the empty range</li>
<li>of the form <code>]-∞;a]</code></li>
<li>of the form <code>[a;+∞[</code></li>
<li>an interval of the form <code>[a;b]</code> (where a ≤ b)</li>
<li>the whole range (ie, ℤ)</li>
</ul>
<p>With the following properties :</p>
<ul>
<li>Disjunction : no range can be of two forms at a time.</li>
<li>Injectivity : if <code>[a;b]</code> = <code>[c;d]</code>, then <code>a</code> = <code>c</code> and <code>b</code> = <code>d</code> (and
similarly for other forms).</li>
<li>Exhaustiveness : it cannot be of another form.</li>
</ul>
<h2 id="syntax-semantics">Syntax &amp; semantics</h2>
<p>This can be encoded as an ADT :</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">type</span> range =</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>  | Empty</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  | HalfLeft <span class="kw">of</span> <span class="dt">int</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>  | HalfRight <span class="kw">of</span> <span class="dt">int</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>  | Range <span class="kw">of</span> <span class="dt">int</span> * <span class="dt">int</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>  | FullRange</span></code></pre></div>
<p><code>Empty</code>, <code>HalfLeft</code>, <code>HalfRight</code>, <code>Range</code> and <code>FullRange</code> are <code>t</code>’s
<em>constructors</em>. They are the only way to build a value of type <code>t</code>. For example,
<code>Empty</code>, <code>HalfLeft 3</code> and <code>Range (2, 5)</code> are all values of type <code>t</code><a href="#fn1" class="footnote-ref" id="fnref1" role="doc-noteref"><sup>1</sup></a>. They
each have a specific <em>arity</em> (the number of arguments they take).</p>
<p>To <em>deconstruct</em> a value of type <code>t</code>, we have to use a powerful construct,
<em>pattern matching</em>, which is about matching a value against a sequence of
patterns (yes, that’s about it).</p>
<p>To illustrate this, we will write a function that computes the minimum value of
such a range. Of course, this can be ±∞ too, so we have to define a type to
represent the return value.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="kw">type</span> ext_int =</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>  | MinusInfinity</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>  | Finite <span class="kw">of</span> <span class="dt">int</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>  | PlusInfinity</span></code></pre></div>
<p>In a math textbook, we would write the case analysis as :</p>
<ul>
<li>min ∅ = +∞</li>
<li>min ]-∞;a] = -∞</li>
<li>min [a;+∞[ = a</li>
<li>min [a;b] = a</li>
<li>min ℤ = -∞</li>
</ul>
<p>That translates to the following (executable !) OCaml code :</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> range_min x =</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">match</span> x <span class="kw">with</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>  | Empty -&gt; PlusInfinity</span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>  | HalfLeft a -&gt; MinusInfinity</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>  | HalfRight a -&gt; Finite a</span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>  | Range (a, b) -&gt; Finite a</span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>  | FullRange -&gt; MinusInfinity</span></code></pre></div>
<p>In the pattern <code>HalfLeft a</code>, <code>a</code> is a variable name, so it get bounds to the
argument’s value. In other words, <code>match (HalfLeft 2) with HalfLeft x -&gt; e</code>
bounds <code>x</code> to 2 in <code>e</code>.</p>
<h2 id="its-functions-all-the-way-down">It’s functions all the way down</h2>
<p>Pattern matching seems magical at first, but it is only a syntactic trick.
Indeed, the definition of the above type is equivalent to the following
definition :</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">type</span> range</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="co">(* The following is not syntactically correct *)</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> Empty : range</span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> HalfLeft : <span class="dt">int</span> -&gt; range</span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> HalfRight : <span class="dt">int</span> -&gt; range</span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> Range : <span class="dt">int</span> * <span class="dt">int</span> -&gt; range</span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> FullRange : range</span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="co">(* Moreover, we know that they are injective and mutually disjoint *)</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a><span class="kw">val</span> deconstruct_range :</span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a>  (<span class="dt">unit</span> -&gt; &#39;a) -&gt;</span>
<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a>  (<span class="dt">int</span> -&gt; &#39;a) -&gt;</span>
<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a>  (<span class="dt">int</span> -&gt; &#39;a) -&gt;</span>
<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a>  (<span class="dt">int</span> * <span class="dt">int</span> -&gt; &#39;a) -&gt;</span>
<span id="cb4-16"><a href="#cb4-16" aria-hidden="true" tabindex="-1"></a>  (<span class="dt">unit</span> -&gt; &#39;a) -&gt;</span>
<span id="cb4-17"><a href="#cb4-17" aria-hidden="true" tabindex="-1"></a>  range -&gt;</span>
<span id="cb4-18"><a href="#cb4-18" aria-hidden="true" tabindex="-1"></a>  &#39;a</span></code></pre></div>
<p><code>deconstruct_range</code> is what replaces pattern matching. It also embodies the notion of
exhaustiveness, because given any value of type <code>range</code>, we can build a
deconstructed value out of it.</p>
<p>Its type looks scary at first, but if we look closer, its arguments are a
sequence of case-specific deconstructors<a href="#fn2" class="footnote-ref" id="fnref2" role="doc-noteref"><sup>2</sup></a> and the value to get “matched”
against.</p>
<p>To show the equivalence, we can implement <code>deconstruct_range</code> using pattern
patching and <code>range_min</code> using <code>deconstruct_range</code><a href="#fn3" class="footnote-ref" id="fnref3" role="doc-noteref"><sup>3</sup></a> :</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> deconstruct_range</span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>      f_empty</span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>      f_halfleft</span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>      f_halfright</span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>      f_range</span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>      f_fullrange</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a>      x</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a>    =</span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>  <span class="kw">match</span> x <span class="kw">with</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a>  | Empty -&gt; f_empty ()</span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a>  | HalfLeft a -&gt; f_halfleft a</span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>  | HalfRight a -&gt; f_halfright a</span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a>  | Range (a, b) -&gt; f_range (a, b)</span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a>  | FullRange -&gt; f_fullrange ()</span></code></pre></div>
<div class="sourceCode" id="cb6"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> range_min&#39; x =</span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>  deconstruct_range</span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>    (<span class="kw">fun</span> () -&gt; PlusInfinity)</span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>    (<span class="kw">fun</span> a -&gt; MinusInfinity)</span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>    (<span class="kw">fun</span> a -&gt; Finite a)</span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>    (<span class="kw">fun</span> (a, b) -&gt; Finite a)</span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>    (<span class="kw">fun</span> () -&gt; MinusInfinity)</span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>    x</span></code></pre></div>
<h2 id="implementation">Implementation</h2>
<p>After this trip in denotational-land, let’s get back to operational-land : how
is this implemented ?</p>
<p>In OCaml, no type information exists at runtime. Everything exists with a
uniform representation and is either an integer or a pointer to a block. Each
block starts with a tag, a size and a number of fields.</p>
<p>With the <code>Obj</code> module (kids, don’t try this at home), it is possible to inspect
blocks at runtime. Let’s write a dumper for <code>range</code> value and watch outputs :</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co">(* Range of integers between a and b *)</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> <span class="kw">rec</span> rng a b =</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>  <span class="kw">if</span> a &gt; b <span class="kw">then</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>    []</span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">else</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>    a :: rng (a+<span class="dv">1</span>) b</span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> view_block o =</span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>  <span class="kw">if</span> (<span class="dt">Obj</span>.is_block o) <span class="kw">then</span></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>    <span class="kw">begin</span></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a>      <span class="kw">let</span> tag = <span class="dt">Obj</span>.tag o <span class="kw">in</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a>      <span class="kw">let</span> sz = <span class="dt">Obj</span>.size o <span class="kw">in</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>      <span class="kw">let</span> f n =</span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a>        <span class="kw">let</span> f = <span class="dt">Obj</span>.field o n <span class="kw">in</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a>        <span class="kw">assert</span> (<span class="dt">Obj</span>.is_int f);</span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>        <span class="dt">Obj</span>.obj f</span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a>      <span class="kw">in</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a>      tag :: <span class="dt">List</span>.map f (rng <span class="dv">0</span> (sz<span class="dv">-1</span>))</span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a>    <span class="kw">end</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a>  <span class="kw">else</span> <span class="kw">if</span> <span class="dt">Obj</span>.is_int o <span class="kw">then</span></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a>    [<span class="dt">Obj</span>.obj o]</span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a>  <span class="kw">else</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>    <span class="kw">assert</span> <span class="kw">false</span></span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> examples () =</span>
<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a>  <span class="kw">let</span> p_list l =</span>
<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a>    <span class="dt">String</span>.concat <span class="st">&quot;;&quot;</span> (<span class="dt">List</span>.map <span class="dt">string_of_int</span> l)</span>
<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a>  <span class="kw">in</span></span>
<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a>  <span class="kw">let</span> explore_range r =</span>
<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a>    <span class="dt">print_endline</span> (p_list (view_block (<span class="dt">Obj</span>.repr r)))</span>
<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a>  <span class="kw">in</span></span>
<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a>  <span class="dt">List</span>.iter explore_range</span>
<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a>    [ Empty</span>
<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a>    ; HalfLeft <span class="dv">8</span></span>
<span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a>    ; HalfRight <span class="dv">13</span></span>
<span id="cb7-36"><a href="#cb7-36" aria-hidden="true" tabindex="-1"></a>    ; Range (<span class="dv">2</span>, <span class="dv">5</span>)</span>
<span id="cb7-37"><a href="#cb7-37" aria-hidden="true" tabindex="-1"></a>    ; FullRange</span>
<span id="cb7-38"><a href="#cb7-38" aria-hidden="true" tabindex="-1"></a>    ]</span></code></pre></div>
<p>When we run <code>examples ()</code>, it outputs :</p>
<pre><code>0
0;8
1;13
2;2;5
1</code></pre>
<p>We can see the following distinction :</p>
<ul>
<li>0-ary constructors (<code>Empty</code> and <code>FullRange</code>) are encoded are simple
integers.</li>
<li>other ones are encoded blocks with a constructor number as tag (0 for
<code>HalfLeft</code>, 1 for <code>HalfRight</code> and 2 for <code>Range</code>) and their argument list
afterwards.</li>
</ul>
<p>Thanks to this uniform representation, pattern-matching is straightforward : the
runtime system will only look at the tag number to decide which constructor has
been used, and if there are arguments to be bound, they are just after in the
same block.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Algebraic Data Types are a simple model of disjoint unions, for which
case analyses are the most natural. In more mainstream languages, some
alternatives exist but they are more limited to model the same problem.</p>
<p>For example, in object-oriented languages, the Visitor pattern is the natural
way to do it. But class trees are inherently “open”, thus breaking the
exhaustivity property.</p>
<p>The closest implementation is tagged unions in C, but they require to roll your
own solution using <code>enum</code>s, <code>struct</code>s and <code>union</code>s. This also means that all
your hand-allocated blocks will have the same size.</p>
<p>Oh, and I would love to know how this problem is solved with other paradigms !</p>
<section id="footnotes" class="footnotes footnotes-end-of-document" role="doc-endnotes">
<hr />
<ol>
<li id="fn1"><p>Unfortunately, so is <code>Range (10, 2)</code>. The invariant that a ≤ b has to be
enforced by the programmer when using this constructor.<a href="#fnref1" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn2"><p>For 0-ary constructors, the type has to be <code>unit -&gt; 'a</code> instead of <code>'a</code> to
allow side effects to happen during pattern matching.<a href="#fnref2" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
<li id="fn3"><p>More precisely, we would have to show that any function written with
pattern matching can be adapted to use the deconstructor instead. I hope
that this example is general enough to get the idea.<a href="#fnref3" class="footnote-back" role="doc-backlink">↩︎</a></p></li>
</ol>
</section>]]></description>
    <pubDate>Wed, 14 Dec 2011 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2011-12-14-what-s-in-an-adt.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>
<item>
    <title>Making type inference explode</title>
    <link>http://blog.emillon.org/posts/2014-05-21-making-type-inference-explode.html</link>
    <description><![CDATA[<p>Hindley-Milner type systems are in a sweet spot in that they are both expressive
and easy to infer. For example, type inference can turn this program:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> <span class="kw">rec</span> length = <span class="kw">function</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>  | [] -&gt; <span class="dv">0</span> </span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  | x::xs -&gt; <span class="dv">1</span> + length xs</span></code></pre></div>
<p>into this one (the top-level type <code>'a list -&gt; int</code> is usually what is
interesting but the compiler has to infer the type of every subexpression):</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> <span class="kw">rec</span> length : &#39;a <span class="dt">list</span> -&gt; <span class="dt">int</span> = <span class="kw">function</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a>  | [] -&gt; (<span class="dv">0</span> : <span class="dt">int</span>)</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a>  | (x:&#39;a)::(xs : &#39;a <span class="dt">list</span>) -&gt; (<span class="dv">1</span> : <span class="dt">int</span>)</span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a>        + ((length : &#39;a <span class="dt">list</span> -&gt; <span class="dt">int</span>) (xs : &#39;a <span class="dt">list</span>) : <span class="dt">int</span>)</span></code></pre></div>
<p>Because the compiler does so much work, it is reasonable to wonder whether it is
efficient. The theoretical answer to this question is that type inference is
EXP-complete, but given reasonable constraints on the program, it can be done in
quasi-linear time (<span class="math inline"><em>n</em> log  <em>n</em></span> where <span class="math inline"><em>n</em></span> is the size of the program).</p>
<p>Still, one may wonder what kind of pathological cases show this exponential
effect. Here is one such example:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> p x y = <span class="kw">fun</span> z -&gt; z x y ;;</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> r () =</span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> x1 = <span class="kw">fun</span> x -&gt; p x x <span class="kw">in</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> x2 = <span class="kw">fun</span> z -&gt; x1 (x1 z) <span class="kw">in</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> x3 = <span class="kw">fun</span> z -&gt; x2 (x2 z) <span class="kw">in</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>x3 (<span class="kw">fun</span> z -&gt; z);;</span></code></pre></div>
<p>The type signature of <code>r</code> is already daunting:</p>
<pre><code>% ocamlc -i types.ml
val p : &#39;a -&gt; &#39;b -&gt; (&#39;a -&gt; &#39;b -&gt; &#39;c) -&gt; &#39;c
val r :
  unit -&gt;
  ((((((((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt;
       (((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt; &#39;c) -&gt;
      &#39;c) -&gt;
     (((((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt;
       (((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt; &#39;c) -&gt;
      &#39;c) -&gt;
     &#39;d) -&gt;
    &#39;d) -&gt;
   (((((((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt;
       (((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt; &#39;c) -&gt;
      &#39;c) -&gt;
     (((((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt;
       (((&#39;a -&gt; &#39;a) -&gt; (&#39;a -&gt; &#39;a) -&gt; &#39;b) -&gt; &#39;b) -&gt; &#39;c) -&gt;
      &#39;c) -&gt;
     &#39;d) -&gt;
    &#39;d) -&gt;
   &#39;e) -&gt;
  &#39;e</code></pre>
<p>But what’s interesting about this program is that we can add (or remove) lines
to study how input size can alter the processing time and output type size. It
explodes:</p>
<table>
<thead>
<tr>
<th>n</th>
<th style="text-align: right;">wc -c</th>
<th style="text-align: right;">time</th>
<th style="text-align: right;">leaves(n)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td style="text-align: right;">98</td>
<td style="text-align: right;">15ms</td>
<td style="text-align: right;">1</td>
</tr>
<tr>
<td>2</td>
<td style="text-align: right;">167</td>
<td style="text-align: right;">15ms</td>
<td style="text-align: right;">2</td>
</tr>
<tr>
<td>3</td>
<td style="text-align: right;">610</td>
<td style="text-align: right;">15ms</td>
<td style="text-align: right;">8</td>
</tr>
<tr>
<td>4</td>
<td style="text-align: right;">11630</td>
<td style="text-align: right;">38ms</td>
<td style="text-align: right;">128</td>
</tr>
<tr>
<td>5</td>
<td style="text-align: right;">4276270</td>
<td style="text-align: right;">6.3s</td>
<td style="text-align: right;">32768</td>
</tr>
</tbody>
</table>
<p>Observing the number of <code>('a -&gt; 'a)</code> leaves in the output type reveals that it
is is squared and doubled at each step, leading to an exponential growth.</p>
<p>In practice, this effect does not appear in day-to-day programs because
programmers annotate the top-level declarations with their types. In that case,
the size of the types would be merely proportional to the size of the program,
because the type annotation would be gigantic.</p>
<p>Also, programmers tend to write functions that do something useful, which these
do not seem to do ☺.</p>]]></description>
    <pubDate>Wed, 21 May 2014 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2014-05-21-making-type-inference-explode.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>
<item>
    <title>NaBoMaMo 2016 writeup</title>
    <link>http://blog.emillon.org/posts/2017-02-01-nabomamo-2016-writeup.html</link>
    <description><![CDATA[<p>Hello! It’s 2016, it’s November, and apparently it rhymes with <a href="http://nabomamo.botally.net/">#NaBoMaMo</a> 2016,
the National Bot Making Month. <a href="https://github.com/emillon/rain-bot">I made a bot!</a>.</p>
<p><em>Full disclosure:</em> it’s actually 2017, but I started writing this in 2016 so
it’s OK. Also I’m not actually from the US, but I’ll relax the definition a bit
and let’s pretend it means International Bot Making Year. Close enough!</p>
<p>Bots are all the rage - Twitter bots, IRC bots, Telegram bots… I decided to
make a Slack bot to get more familiar with that API.</p>
<p>I wanted this to be a small project - write and forget, basically. I started by
defining some specs and lock those down:</p>
<ul>
<li>that bot works on Slack</li>
<li>it uses the “will it rain in the next hour” API from Météo France.</li>
<li>the bot understands 3 commands:
<ul>
<li>tell you whether it will rain or not.</li>
<li>show you a graph of rain level over the next hour.</li>
<li>tell you when to go out to avoid the rain.</li>
</ul></li>
</ul>
<p>The next step was choosing the tech stack. For hosting itself I was sold on
using Heroku from previous projects (or another PaaS host, for what it’s worth)</p>
<p>As for the programming language itself, I hesitated between three choices:</p>
<ol type="1">
<li>focus on the all-included experience: something that has libraries, tooling,
but somehow boring;</li>
<li>focus on the shipping experience: stuff that I use daily, but looking to get
something online quickly;</li>
<li>focus on learning something new.</li>
</ol>
<p>The first one means something like Python or Ruby. I am familiar with the
languages and am pretty sure that there are libraries that can take care of the
Slack API without me having to ever worry about HTTP endpoints. That means also
first-class deployment and hosting.</p>
<p>The second one is about OCaml: it’s a programming language I use daily at work,
but the real goal would be to focus on shipping: create a project, write tests,
write implementation, deploy, repeat for new features, forget.</p>
<p>The third one means a totally new programming language. I heard a lot of good
things about Elixir for backend applications and figured that it would be a good
intro project. Learning a new language is always an interesting experience,
because it makes you a better programmer in all languages, and having clear
specs would make this manageable.</p>
<p>The Python/Ruby solution seemed a bit boring. I probably would not learn a lot,
only, maybe add a couple libraries to my toolbelt at most.</p>
<p>Elixir sounds great, but learning a new language and a new project at the same
time is too hard and too time consuming. I would rather write in a new language
something I previously wrote in another language. Though for something small and
focused like this, that could have worked.</p>
<p>I first created the project structure: github repo, ocaml project (topkg, opam,
etc). I like to use TDD for this kind of projects, so I added a small <a href="https://github.com/mirage/alcotest">alcotest</a>
suite. I also created the 12factor separation: a <code>Procfile</code>, a small <code>bin/</code>
shell that reads the application configuration from the environment and starts a
bot from <code>lib/</code>.</p>
<p>I asked myself what to test: the <a href="https://github.com/mirage/ocaml-cohttp">cohttp</a> library is nice, because servers and
clients are built using normal functions that take a request and returns a
response. That makes it possible to test almost everything at the ocaml level
without having to go to the HTTP level. This is especially important since there
is no way to mock values and functions in ocaml. Everything has to be real
objects.</p>
<p>However, even if it was possible to test everything, I decided to just focus on
the domain logic without testing the HTTP part: for example, I would pass data
structures directly to my bot object rather than building a cohttp request.</p>
<p>A part that is important for me even for a small project like that, is to have
some sort of CI: have travis run my test suite, and make a binary ready to be
deployed to Heroku. That way, it is impossible to forget how to make changes,
test and deploy, since this is all in a script.</p>
<p>The other part that needed work is the actual Slack integration. The “slash”
command API is pretty simple: it is possible to configure a Slack team such that
typing <code>/rain</code> will hit a particular URL. Some options are passed as <code>POST</code> data
and whatever is returned is displayed in Slack.</p>
<p>I set up the Slack integration, wrote a function to distinguish between
<code>/rain</code> and <code>/rain list</code> (using the POST data), and by the end of the second
iteraton I had my second feature implemented, working, and deployed.</p>
<p>All in all, that was pretty great. The code or the bot itself are not
particularly fantastic, but I learned some important lessons:</p>
<ul>
<li>When you do not want to spend a lot of time on a task, invest in planning and
keep the list of features short. That is pretty obvious in the context of paid
work, but this is applies well to hobby programming too.</li>
<li>Know what to test and what not to. Tests are useful to ensure that changes can
be made without breaking everything, but testing that your HTTP library can
parse POST data is a waste of time.</li>
<li>In languages where it is not possible to mock or monkey patch functions,
dependency injection is still possible. One may even argue that it leads to
a better solution, since it removes the coupling between the different
components.</li>
</ul>
<p>You can find <a href="https://github.com/emillon/rain-bot">the source of this bot on Github</a>.
See you next year, <a href="http://nabomamo.botally.net/">#NaBoMaMo</a>!
And thanks to Tully Hansen for organizing this.</p>]]></description>
    <pubDate>Wed, 01 Feb 2017 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2017-02-01-nabomamo-2016-writeup.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>
<item>
    <title>Fuzzing OCamlFormat with AFL and Crowbar</title>
    <link>http://blog.emillon.org/posts/2020-08-03-fuzzing-ocamlformat-with-afl-and-crowbar.html</link>
    <description><![CDATA[<p><em>This article has been first published on the <a href="https://tarides.com/blog/2020-08-03-fuzzing-ocamlformat-with-afl-and-crowbar/">Tarides blog</a>.</em></p>
<p><a href="https://lcamtuf.coredump.cx/afl/">AFL</a> (and fuzzing in general) is often used
to find bugs in low-level code like parsers, but it also works very well to find
bugs in high level code, provided the right ingredients. We applied this
technique to feed random programs to OCamlFormat and found many formatting bugs.</p>
<p>OCamlFormat is a tool to format source code. To do so, it parses the source code
to an Abstract Syntax Tree (AST) and then applies formatting rules to the AST.</p>
<p>It can be tricky to correctly format the output. For example, say we want to
format <code>(a+b)*c</code>. The corresponding AST will look like <code>Apply("*", Apply ("+", Var "a", Var "b"), Var "c")</code>. A naive formatter would look like this:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode ocaml"><code class="sourceCode ocaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">let</span> <span class="kw">rec</span> <span class="dt">format</span> = <span class="kw">function</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>  | Var s -&gt; s</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  | Apply (op, e1, e2) -&gt;</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>      <span class="dt">Printf</span>.sprintf <span class="st">&quot;%s %s %s&quot;</span> (<span class="dt">format</span> e1) op (<span class="dt">format</span> e2)</span></code></pre></div>
<p>But this is not correct, as it will print <code>(a+b)*c</code> as <code>a+b*c</code>, which is a
different program. In this particular case, the common solution would be to
track the relative precedence of the expressions and to emit only necessary
parentheses.</p>
<p>OCamlFormat has similar cases. To make sure we do not change a program when
formatting it, there is an extra check at the end to parse the output and
compare the output AST with the input AST. This ensures that, in case of bugs,
OCamlFormat exits with an error rather than changing the meaning of the input
program.</p>
<p>When we consider the whole OCaml language, the rules are complex and it is
difficult to make sure that we are correctly handling all programs. There are
two main failure modes: either we put too many parentheses, and the program does
not look good, or we do not put enough, and the AST changes (and OCamlFormat
exits with an error). We need a way to make sure that the latter does not
happen. Tests work to some extent, but some edge cases happen only when a
certain combination of language features is used. Because of this combinatorial
explosion, it is impossible to get good coverage using tests only.</p>
<p>Fortunately there is a technique we can use to automatically explore the program
space: fuzzing. For a primer on using this technique on OCaml programs, one can
refer to <a href="https://tarides.com/blog/2019-09-04-an-introduction-to-fuzzing-ocaml-with-afl-crowbar-and-bun">this article</a>.</p>
<p>To make this work we need two elements: a random program generator, and a
property to check. Here, we are interested in programs that are valid (in the
sense that they parse correctly) but do not format correctly. We can use the
OCamlFormat internals to do the following:</p>
<ol type="1">
<li>try to parse input: in case of a parse error, just reject this input as
invalid.</li>
<li>otherwise, with have a valid program. try to format it. If this happens with
no error at all, reject this input as well.</li>
<li>otherwise, it means that the AST changed, comments moved, or something
similar, in a valid program. This is what we are after.</li>
</ol>
<p>Generating random programs is a bit more difficult. We can feed random strings
to AFL, but even with a corpus of existing valid code it will generate many
invalid programs. We are not interested in these for this project, we would
rather start from valid programs.</p>
<p>A good way to do that is to use Crowbar to directly generate AST values. Thanks
to <a href="https://github.com/yomimono/ppx_deriving_crowbar"><code>ppx_deriving_crowbar</code></a> and <a href="https://github.com/ocaml-ppx/ppx_import"><code>ppx_import</code></a>
it is possible to generate random values for an external type like
<code>Parsetree.structure</code> (the contents of <code>.ml</code> files). Even more fortunately
<a href="https://github.com/yomimono/ocaml-test-omp/blob/d086037027537ba4e23ce027766187979c85aa3d/test/parsetree_405.ml">somebody already did the work</a>. Thanks, Mindy!</p>
<p>This approach works really well: it generates 5k-10k programs per second, which
is very good performance (AFL starts complaining below 100/s).</p>
<p>Quickly, AFL was able to find crashes related to attributes. These are “labels”
attached to various nodes of the AST. For example the expression <code>(x || y) [@a]</code>
(logical or between <code>x</code> and <code>y</code>, attach attribute <code>a</code> to the “or” expression)
would get formatted as <code>x || y [@a]</code> (attribute <code>a</code> is attached to the <code>y</code>
variable). Once again, there is a check in place in OCamlFormat to make sure
that it does not save the file in this case, but it would exit with an error.</p>
<p>After the fuzzer has run for a bit longer, it found crashes where comments would
jump around in expressions like <code>f (*a*) (*bb*) x</code>. Wait, what? We never told
the program generator how to generate comments. Inspecting the intermediate AST,
the part in the middle is actually an integer literal with value <code>"(*a*) (*bb*)"</code> (integer literals are represented as strings so that <a href="https://github.com/Drup/Zarith-ppx">a third party
library could add literals for arbitrary precision numbers</a> for
example).</p>
<p>AFL comes with a program called <code>afl-tmin</code> that is used to minimize a crash. It
will try to find a smaller example of a program that crashes OCamlFormat. It
works well even with Crowbar in between. For example it is able to turn <code>(new aaaaaa &amp; [0;0;0;0])[@aaaaaaaaaa]</code> into <code>(0&amp;0)[@a]</code> (neither AFL nor OCamlFormat
knows about types, so they can operate on nonsensical programs. Finding a
well-typed version of a crash is usually not very difficult, but it has to be
done manually).</p>
<p>In total, letting AFL run overnight on a single core (that is relatively short
in terms of fuzzing) caused 453 crashes. After minimization and deduplication,
this corresponded to <a href="https://github.com/ocaml-ppx/ocamlformat/issues?q=label%3Afuzz">about 30 unique issues</a>.</p>
<p>Most of them are related to attributes that OCamlFormat did not try to include
in the output, or where it forgot to add parentheses. Fortunately, there are
safeguards in OCamlFormat: since it checks that the formatting preserves the AST
structure, it will exit with an error instead of outputting a different program.</p>
<p>Once again, fuzzing has proved itself as a powerful technique to find actual
bugs (including high-level ones). A possible approach for a next iteration is to
try to detect more problems during formatting, such as finding cases where lines
are longer than allowed. It is also possible to extend the random program
generator so that it tries to generate comments, and let OCamlFormat check that
they are all laid out correctly in the output. We look forward to employing
fuzzing more extensively for OCamlFormat development in future.</p>]]></description>
    <pubDate>Mon, 03 Aug 2020 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2020-08-03-fuzzing-ocamlformat-with-afl-and-crowbar.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>
<item>
    <title>Introducing tree-sitter-dune</title>
    <link>http://blog.emillon.org/posts/2024-07-26-introducing-tree-sitter-dune.html</link>
    <description><![CDATA[<p>I made a <a href="https://tree-sitter.github.io/tree-sitter/">tree-sitter</a> plugin for
<code>dune</code> files. It is available <a href="https://github.com/emillon/tree-sitter-dune">on
GitHub</a>.</p>
<p>Tree-sitter is a parsing system that can be used in text editors.
<a href="https://dune.build/">Dune</a> is a build system for OCaml projects.
Its configuration language lives in <code>dune</code> files which use a s-expression
syntax.</p>
<p>This makes highlighting challenging: the lexing part of the language is very
simple (atoms, strings, parentheses), but it is not enough to make a good
highlighter.</p>
<p>In the following example, <code>with-stdout-to</code> and <code>echo</code> are “actions” that we
could highlight in a special way, but these names can also appear in places
where they are not interpreted as actions, and doing so would be confusing (for
example, we could write to a file named <code>echo</code> instead of <code>foo.txt</code>.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>(rule</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a> (action</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  (with-stdout-to</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>   foo.txt</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>   (echo <span class="st">&quot;testing&quot;</span>))))</span></code></pre></div>
<p>Tree-sitter solves this, because it creates an actual parser that goes beyond
lexing.</p>
<p>In this example, I created grammar rules that parse the contents of <code>(action ...)</code> as an action, recognizing the various constructs of this DSL.</p>
<p>The output of the parser is this syntax tree with location information (for
some reason, line numbers start at 0 which is normal and unusual at the same
time).</p>
<pre><code>(source_file [0, 0] - [5, 0]
  (stanza [0, 0] - [4, 22]
    (stanza_name [0, 1] - [0, 5])
    (field_name [1, 2] - [1, 8])
    (action [2, 2] - [4, 20]
      (action_name [2, 3] - [2, 17])
      (file_name_target [3, 3] - [3, 10]
        (file_name [3, 3] - [3, 10]))
      (action [4, 3] - [4, 19]
        (action_name [4, 4] - [4, 8])
        (quoted_string [4, 9] - [4, 18])))))</code></pre>
<p>The various strings are annotated with their type: we have stanza names
(<code>rule</code>), field names (<code>action</code>), action names (<code>with-stdout-to</code>, <code>echo</code>), file
names (<code>foo.txt</code>), and plain strings (<code>"testing"</code>).</p>
<p>By itself, that is not useful, but it’s possible to write <em>queries</em> to make
this syntax tree do interesting stuff.</p>
<p>The first one is highlighting: we can set styles for various “patterns” (in
practice, I only used node names) by defining queries:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>(stanza_name) @function</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>(field_name) @property</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>(quoted_string) @string</span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>(multiline_string) @string</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>(action_name) @keyword</span></code></pre></div>
<p>The parts with <code>@</code> map to “highlight groups” used in text editors.</p>
<p>Another type of query is called “injections”. It is used to link different
types of grammars together. For example, <code>dune</code> files can start with a special
comment that indicates that the rest of the file is an OCaml program. In that
case, the parser emits a single <code>ocaml_syntax</code> node and the following injection
indicates that this file should be parsed using an OCaml parser:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>((ocaml_syntax) @injection.content</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> (#<span class="kw">set!</span> injection.language <span class="st">&quot;ocaml&quot;</span>))</span></code></pre></div>
<p>Another use case for this is <code>system</code> actions: these strings in <code>dune</code> files
could be interpreted using a shell parser.</p>
<p>In the other direction, it is possible to inject <code>dune</code> files into another
document. For example, a markdown parser can use injections to highlight code
blocks.</p>
<p>I’m happy to have explored this technology. The toolchain seemed complex at
first: there’s a compiler which seems to be a mix of node and rust, which
generates C, which is compiled into a dynamically loaded library; but this is
actually pretty well integrated in nix and neovim to the details are made
invisible.</p>
<p>The testing mechanism is similar to the cram tests we use in Dune, but I was a
bit confused with the colors at first: when the output of a test changes, Dune
considers that the new output is a <code>+</code> in the diff, and highlights it in green;
while tree-sitter considers that the “expected output” is green.</p>
<p>There are many ways to improve this prototype: either by adding queries (it’s
possible to define text objects, folding expressions, etc), or by improving
coverage for <code>dune</code> files (in most cases, the parser uses a s-expression
fallback). I’m also curious to see if it’s possible to use this parser to
provide a completion source. Since the strings are tagged with their type (are
we expecting a library name, a module name, etc), I think we could use that to
provide context-specific completions, but that’s probably difficult to do.</p>
<p>Thanks <a href="https://x.com/teej_dv">teej</a> for the initial idea and the useful
resources.</p>]]></description>
    <pubDate>Fri, 26 Jul 2024 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2024-07-26-introducing-tree-sitter-dune.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>

    </channel>
</rss>
