Enter the void *

Introducing tree-sitter-dune

by Etienne Millon on July 26, 2024

Tagged as: , , , .

I made a tree-sitter plugin for dune files. It is available on GitHub.

Tree-sitter is a parsing system that can be used in text editors. Dune is a build system for OCaml projects. Its configuration language lives in dune files which use a s-expression syntax.

This makes highlighting challenging: the lexing part of the language is very simple (atoms, strings, parentheses), but it is not enough to make a good highlighter.

In the following example, with-stdout-to and echo are “actions” that we could highlight in a special way, but these names can also appear in places where they are not interpreted as actions, and doing so would be confusing (for example, we could write to a file named echo instead of foo.txt.

(rule
 (action
  (with-stdout-to
   foo.txt
   (echo "testing"))))

Tree-sitter solves this, because it creates an actual parser that goes beyond lexing.

In this example, I created grammar rules that parse the contents of (action ...) as an action, recognizing the various constructs of this DSL.

The output of the parser is this syntax tree with location information (for some reason, line numbers start at 0 which is normal and unusual at the same time).

(source_file [0, 0] - [5, 0]
  (stanza [0, 0] - [4, 22]
    (stanza_name [0, 1] - [0, 5])
    (field_name [1, 2] - [1, 8])
    (action [2, 2] - [4, 20]
      (action_name [2, 3] - [2, 17])
      (file_name_target [3, 3] - [3, 10]
        (file_name [3, 3] - [3, 10]))
      (action [4, 3] - [4, 19]
        (action_name [4, 4] - [4, 8])
        (quoted_string [4, 9] - [4, 18])))))

The various strings are annotated with their type: we have stanza names (rule), field names (action), action names (with-stdout-to, echo), file names (foo.txt), and plain strings ("testing").

By itself, that is not useful, but it’s possible to write queries to make this syntax tree do interesting stuff.

The first one is highlighting: we can set styles for various “patterns” (in practice, I only used node names) by defining queries:

(stanza_name) @function
(field_name) @property
(quoted_string) @string
(multiline_string) @string
(action_name) @keyword

The parts with @ map to “highlight groups” used in text editors.

Another type of query is called “injections”. It is used to link different types of grammars together. For example, dune files can start with a special comment that indicates that the rest of the file is an OCaml program. In that case, the parser emits a single ocaml_syntax node and the following injection indicates that this file should be parsed using an OCaml parser:

((ocaml_syntax) @injection.content
 (#set! injection.language "ocaml"))

Another use case for this is system actions: these strings in dune files could be interpreted using a shell parser.

In the other direction, it is possible to inject dune files into another document. For example, a markdown parser can use injections to highlight code blocks.

I’m happy to have explored this technology. The toolchain seemed complex at first: there’s a compiler which seems to be a mix of node and rust, which generates C, which is compiled into a dynamically loaded library; but this is actually pretty well integrated in nix and neovim to the details are made invisible.

The testing mechanism is similar to the cram tests we use in Dune, but I was a bit confused with the colors at first: when the output of a test changes, Dune considers that the new output is a + in the diff, and highlights it in green; while tree-sitter considers that the “expected output” is green.

There are many ways to improve this prototype: either by adding queries (it’s possible to define text objects, folding expressions, etc), or by improving coverage for dune files (in most cases, the parser uses a s-expression fallback). I’m also curious to see if it’s possible to use this parser to provide a completion source. Since the strings are tagged with their type (are we expecting a library name, a module name, etc), I think we could use that to provide context-specific completions, but that’s probably difficult to do.

Thanks teej for the initial idea and the useful resources.