<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"
    xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Enter the void *</title>
        <link>http://blog.emillon.org</link>
        <description><![CDATA[Yet another random hacker]]></description>
        <atom:link href="http://blog.emillon.org/feeds/tree-sitter.xml" rel="self"
                   type="application/rss+xml" />
        <lastBuildDate>Fri, 26 Jul 2024 00:00:00 UT</lastBuildDate>
        <item>
    <title>Introducing tree-sitter-dune</title>
    <link>http://blog.emillon.org/posts/2024-07-26-introducing-tree-sitter-dune.html</link>
    <description><![CDATA[<p>I made a <a href="https://tree-sitter.github.io/tree-sitter/">tree-sitter</a> plugin for
<code>dune</code> files. It is available <a href="https://github.com/emillon/tree-sitter-dune">on
GitHub</a>.</p>
<p>Tree-sitter is a parsing system that can be used in text editors.
<a href="https://dune.build/">Dune</a> is a build system for OCaml projects.
Its configuration language lives in <code>dune</code> files which use a s-expression
syntax.</p>
<p>This makes highlighting challenging: the lexing part of the language is very
simple (atoms, strings, parentheses), but it is not enough to make a good
highlighter.</p>
<p>In the following example, <code>with-stdout-to</code> and <code>echo</code> are “actions” that we
could highlight in a special way, but these names can also appear in places
where they are not interpreted as actions, and doing so would be confusing (for
example, we could write to a file named <code>echo</code> instead of <code>foo.txt</code>.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a>(rule</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a> (action</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>  (with-stdout-to</span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>   foo.txt</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>   (echo <span class="st">&quot;testing&quot;</span>))))</span></code></pre></div>
<p>Tree-sitter solves this, because it creates an actual parser that goes beyond
lexing.</p>
<p>In this example, I created grammar rules that parse the contents of <code>(action ...)</code> as an action, recognizing the various constructs of this DSL.</p>
<p>The output of the parser is this syntax tree with location information (for
some reason, line numbers start at 0 which is normal and unusual at the same
time).</p>
<pre><code>(source_file [0, 0] - [5, 0]
  (stanza [0, 0] - [4, 22]
    (stanza_name [0, 1] - [0, 5])
    (field_name [1, 2] - [1, 8])
    (action [2, 2] - [4, 20]
      (action_name [2, 3] - [2, 17])
      (file_name_target [3, 3] - [3, 10]
        (file_name [3, 3] - [3, 10]))
      (action [4, 3] - [4, 19]
        (action_name [4, 4] - [4, 8])
        (quoted_string [4, 9] - [4, 18])))))</code></pre>
<p>The various strings are annotated with their type: we have stanza names
(<code>rule</code>), field names (<code>action</code>), action names (<code>with-stdout-to</code>, <code>echo</code>), file
names (<code>foo.txt</code>), and plain strings (<code>"testing"</code>).</p>
<p>By itself, that is not useful, but it’s possible to write <em>queries</em> to make
this syntax tree do interesting stuff.</p>
<p>The first one is highlighting: we can set styles for various “patterns” (in
practice, I only used node names) by defining queries:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>(stanza_name) @function</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>(field_name) @property</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>(quoted_string) @string</span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a>(multiline_string) @string</span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>(action_name) @keyword</span></code></pre></div>
<p>The parts with <code>@</code> map to “highlight groups” used in text editors.</p>
<p>Another type of query is called “injections”. It is used to link different
types of grammars together. For example, <code>dune</code> files can start with a special
comment that indicates that the rest of the file is an OCaml program. In that
case, the parser emits a single <code>ocaml_syntax</code> node and the following injection
indicates that this file should be parsed using an OCaml parser:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode scheme"><code class="sourceCode scheme"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>((ocaml_syntax) @injection.content</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> (#<span class="kw">set!</span> injection.language <span class="st">&quot;ocaml&quot;</span>))</span></code></pre></div>
<p>Another use case for this is <code>system</code> actions: these strings in <code>dune</code> files
could be interpreted using a shell parser.</p>
<p>In the other direction, it is possible to inject <code>dune</code> files into another
document. For example, a markdown parser can use injections to highlight code
blocks.</p>
<p>I’m happy to have explored this technology. The toolchain seemed complex at
first: there’s a compiler which seems to be a mix of node and rust, which
generates C, which is compiled into a dynamically loaded library; but this is
actually pretty well integrated in nix and neovim to the details are made
invisible.</p>
<p>The testing mechanism is similar to the cram tests we use in Dune, but I was a
bit confused with the colors at first: when the output of a test changes, Dune
considers that the new output is a <code>+</code> in the diff, and highlights it in green;
while tree-sitter considers that the “expected output” is green.</p>
<p>There are many ways to improve this prototype: either by adding queries (it’s
possible to define text objects, folding expressions, etc), or by improving
coverage for <code>dune</code> files (in most cases, the parser uses a s-expression
fallback). I’m also curious to see if it’s possible to use this parser to
provide a completion source. Since the strings are tagged with their type (are
we expecting a library name, a module name, etc), I think we could use that to
provide context-specific completions, but that’s probably difficult to do.</p>
<p>Thanks <a href="https://x.com/teej_dv">teej</a> for the initial idea and the useful
resources.</p>]]></description>
    <pubDate>Fri, 26 Jul 2024 00:00:00 UT</pubDate>
    <guid>http://blog.emillon.org/posts/2024-07-26-introducing-tree-sitter-dune.html</guid>
    <dc:creator>Etienne Millon</dc:creator>
</item>

    </channel>
</rss>
