Introducing tree-sitter-dune
by Etienne Millon on July 26, 2024
Tagged as: ocaml, dune, hacking-days, tree-sitter.
I made a tree-sitter plugin for
dune
files. It is available on
GitHub.
Tree-sitter is a parsing system that can be used in text editors.
Dune is a build system for OCaml projects.
Its configuration language lives in dune
files which use a s-expression
syntax.
This makes highlighting challenging: the lexing part of the language is very simple (atoms, strings, parentheses), but it is not enough to make a good highlighter.
In the following example, with-stdout-to
and echo
are “actions” that we
could highlight in a special way, but these names can also appear in places
where they are not interpreted as actions, and doing so would be confusing (for
example, we could write to a file named echo
instead of foo.txt
.
(rule
(action
(with-stdout-to
foo.txt"testing")))) (echo
Tree-sitter solves this, because it creates an actual parser that goes beyond lexing.
In this example, I created grammar rules that parse the contents of (action ...)
as an action, recognizing the various constructs of this DSL.
The output of the parser is this syntax tree with location information (for some reason, line numbers start at 0 which is normal and unusual at the same time).
(source_file [0, 0] - [5, 0]
(stanza [0, 0] - [4, 22]
(stanza_name [0, 1] - [0, 5])
(field_name [1, 2] - [1, 8])
(action [2, 2] - [4, 20]
(action_name [2, 3] - [2, 17])
(file_name_target [3, 3] - [3, 10]
(file_name [3, 3] - [3, 10]))
(action [4, 3] - [4, 19]
(action_name [4, 4] - [4, 8])
(quoted_string [4, 9] - [4, 18])))))
The various strings are annotated with their type: we have stanza names
(rule
), field names (action
), action names (with-stdout-to
, echo
), file
names (foo.txt
), and plain strings ("testing"
).
By itself, that is not useful, but it’s possible to write queries to make this syntax tree do interesting stuff.
The first one is highlighting: we can set styles for various “patterns” (in practice, I only used node names) by defining queries:
(stanza_name) @function
(field_name) @property
(quoted_string) @string
(multiline_string) @string (action_name) @keyword
The parts with @
map to “highlight groups” used in text editors.
Another type of query is called “injections”. It is used to link different
types of grammars together. For example, dune
files can start with a special
comment that indicates that the rest of the file is an OCaml program. In that
case, the parser emits a single ocaml_syntax
node and the following injection
indicates that this file should be parsed using an OCaml parser:
((ocaml_syntax) @injection.contentset! injection.language "ocaml")) (#
Another use case for this is system
actions: these strings in dune
files
could be interpreted using a shell parser.
In the other direction, it is possible to inject dune
files into another
document. For example, a markdown parser can use injections to highlight code
blocks.
I’m happy to have explored this technology. The toolchain seemed complex at first: there’s a compiler which seems to be a mix of node and rust, which generates C, which is compiled into a dynamically loaded library; but this is actually pretty well integrated in nix and neovim to the details are made invisible.
The testing mechanism is similar to the cram tests we use in Dune, but I was a
bit confused with the colors at first: when the output of a test changes, Dune
considers that the new output is a +
in the diff, and highlights it in green;
while tree-sitter considers that the “expected output” is green.
There are many ways to improve this prototype: either by adding queries (it’s
possible to define text objects, folding expressions, etc), or by improving
coverage for dune
files (in most cases, the parser uses a s-expression
fallback). I’m also curious to see if it’s possible to use this parser to
provide a completion source. Since the strings are tagged with their type (are
we expecting a library name, a module name, etc), I think we could use that to
provide context-specific completions, but that’s probably difficult to do.
Thanks teej for the initial idea and the useful resources.