rehype-tree-sitter
during the migration of my blog from handwritten html to a fancier web framework, i noticed that there weren't many code block highlighting options to choose from. in terms of astro, there are only two plugins that are capable of syntax highlighting: shiki and prismjs. shiki uses textmate grammars. i'm not the hugest fan of regex-based syntax highlighting engines and i've spent way too much time creating custom grammars for new languages before, only to be met with a buggy mess in the end. (skill issue?) prism also uses regex rules to do its highlighting, so that's also not an option.
in my opinion, tree-sitter is currently the best option for highlighting source code. there's no contest. not only is it performant, but the dsl is much easier to wrap your head around. i knew right away that this was the tool i wanted to use to power the pretty little colors on my website. the real question was figuring out how.
after searching around, i discovered that astro uses unified as its underlying engine for working with html and markdown/mdx. to get what i want, i'd have to create a plugin that fits somewhere in-between these conversion steps and produces the correct output.
iteration one – command line interface
in the beginning, i had a hard time deciding on which part of the
html/markdown processing chain i wanted to inject my logic into.
rehype
is responsible for transforming and working with html syntax
trees (which they call hast
). remark
is the same, but for markdown
(mast
). in the end i decided upon rehype
, because to get an mvp
out, i wanted to use the tree-sitters built-in html highlight
functionality.
to create a rehype
plugin, i first needed to create a node package
that will export the function responsible for transforming the
relevant syntax tree.
import { visit } from "unist-util-visit"; import { rehype } from "rehype"; const isCodeBlockElement = (node) => node.tagName === "code"; export function rehypeTreeSitter(options) { return function (tree) { visit(tree, isCodeBlockElement, (node, index, parent) => { console.log(tree); }); }; }
before messing around with astro and real data, i created a [jest](https://jestjs.io) test to make sure my code was correct. this also allows for faster iteration speed, since we only execute code we care about.
import { rehypeTreeSitter } from "./index.js"; import { rehype } from "rehype"; test("basic math", () => { expect(2 + 2).toBe(4); });
now that i had some tooling in place, i could begin coding. i knew
that if i executed tree-sitter highlight -h code.js
, it'd spit some
html out like below.
<body> <table> <tr> <td class="line-number">1</td> <td class="line"> <span style="color: #5f00d7">function</span> <span style="color: #005fd7">sum</span><span style="color: #4e4e4e">(</span> <span style="text-decoration: underline;">a</span><span style="color: #4e4e4e">,</span> <span style="text-decoration: underline;">b</span><span style="color: #4e4e4e">)</span> <span style="color: #4e4e4e">{</span> </td> </tr> <tr> <td class="line-number">2</td> <td class="line"> <span style="color: #5f00d7">return</span> <span style="text-decoration: underline;">a</span> <span style="font-weight: bold;color: #4e4e4e">+</span> <span style="text-decoration: underline;">b</span> <span style="color: #4e4e4e">;</span> </td> </tr> <tr> <td class="line-number">3</td> <td class="line"><span style="color: #4e4e4e">}</span></td> </tr> </table> </body>
barring my initial concerns with the per-element styling, this looked
workable! in my mind, i simply had to re-use whatever facilities
unified has for parsing html into hast
, and replace the code
block
in the original hast
with the tree-sitter version. this ended up
looking like this:
const exampleScopeMap = { "language-javascript": "source.js", "language-sh": "source.bash", "language-xml": "source.xml", }; function doesNodeHaveChildTable(node) { for (const child of node.children) { if (child.tagName === "table") return true; } return false; } export function rehypeTreeSitter(options) { return function (tree) { visit(tree, isCodeBlockElement, (node, index, parent) => { if (Object.keys(node.properties).length === 0) return; const code = node.children[0].value; const language = node.properties.className[0]; if (!(language in exampleScopeMap)) return; const temporarySourceFile = temporaryFile(); writeFileSync(temporarySourceFile, code); try { const result = execSync( `tree-sitter highlight --scope ${exampleScopeMap[language]} -H ${temporarySourceFile}` ).toString(); const resultTree = rehype().parse(result); const resultBody = resultTree.children[1].children[2]; if (!doesNodeHaveChildTable(resultBody)) { console.error(`No 'highlights.scm' found for ${language}`); return; } const codeTable = resultTree.children[1].children[2].children[1]; parent.children[0].children = [codeTable]; } catch (error) { console.error(error); } unlinkSync(temporarySourceFile); }); }; }
here's a high-level breakdown of the process.
- create a temporary file to store the source code in
- execute
tree-sitter highlight --scope <language> -h <temporary file location>
to get the html - reparse the html into
hast
(with some safety checks) - set the parent's child to contain the code table
this code works, and for a couple of hours of work, produces stellar results.
however, there were a couple bugs that i encountered that made this approach unacceptable.
- whatever was converting the final
hast
into html inserted as many newlines into thepre
element as there were lines of code in the original source block. this took a while to diagnose, and even longer to fix. 1 - the colors were determined by an external tool (the tree-sitter configuration). this made it difficult to customize and control. i also wanted to have css change the color scheme based on the user's system appearance.
- the produced code utilized a table to get the lines of code to match up with their line number. while this did work, i had suspicions that it wasn't the correct approach for supporting line numbers.
wanting more flexibility, i decided it was time to dig deeper into the
underlying code. having studied how that exact command line invocation
worked, i now knew that tree-sitter ships libraries that give you more
control over the process. notably, tree-sitter-highlight
which gives
you an iterator over your code annotating spans with their highlight,
and tree-sitter-loader
, which handles compilation behind the scenes
when supplying grammars.
iteration two – native rust node plugin
the big question (again) is how. how can i execute some rust code from within the node process? the answer is the node-api the node-api lets you create c/c++ add-ons, which you can import into your javascript code and call like normal. knowing this, i began working on the rust code first.
here's the test program i used for verifying everything worked before attempting to port it to the node-api.
fn main() -> Result<(), Box<dyn std::error::Error>> { let language_root_path = std::env::var("TREE_SITTER_LANGUAGE_ROOT")?; let mut loader = tree_sitter_loader::Loader::new()?; loader.find_all_languages(&tree_sitter_loader::Config { parser_directories: { let mut vec = Vec::new(); vec.push(PathBuf::from(language_root_path)); vec }, ..Default::default() })?; let mut process_arg_iter = std::env::args().skip(1); let root_scope = format!( "source.{}", process_arg_iter.next().ok_or(Error::NoScopeProvided)? ); let source = std::io::read_to_string(std::io::stdin())?; let (language, language_configuration) = loader .language_configuration_for_scope(&*root_scope)? .ok_or(Error::MissingLanguageConfiguration)?; let highlight_config = language_configuration .highlight_config(language)? .ok_or(Error::MissingHighlightConfiguration)? .clone(); let mut highlighter = Highlighter::new(); let mut highlight_iter = highlighter.highlight(highlight_config, source.as_bytes(), None, |_| None)?; while let Some(Ok(event)) = highlight_iter.next() { match event { tree_sitter_highlight::HighlightEvent::Source { start, end } => { print!("{}", &source[start..end]) } tree_sitter_highlight::HighlightEvent::HighlightStart(_) | tree_sitter_highlight::HighlightEvent::HighlightEnd => {} } } println!(""); Ok(()) }
the real meat of the function is near the end. using while let
i'm
able to extract the HighlightEvent
's from that particular session
and act accordingly. i don't translate any of the highlight calls, but
being able to recreate the text from the indices is a good start. now
that i can react to each 'chunk' of data produced by tree-sitter, it's
time to port this to node.
node-bindgen
is an excellent package for this. It allows me to
simply annotate a function with a decorator (proc-macro
) and use the
resulting binary in my node project. Below is all the code I needed to
change.
#[node_bindgen] fn driver<F: Fn(SerializableHighlightEvent)>( language_root_path: String, root_scope: String, source: String, callback: F, ) -> Result<(), NjError> { // ... }
i could then require
my dylib
and call driver
. awesome!
you might be wondering what callback
is doing there. the general
idea behind this approach is that node hands control over the
highlighting process to rust. rust is responsible for executing the
code that parses the incoming text, but the business logic is
determined by the callback
. in rehype-tree-sitter
's case, it needs
to react to the event and modify the hast
tree accordingly. let's
jump back to the plugin.
export default function rehypeTreeSitter(options) { if (options === undefined) throw new Error("Need to provide `options.treeSitterGrammarRoot`"); if (options.treeSitterGrammarRoot === undefined) throw new Error("Need to provide `options.treeSitterGrammarRoot`"); return function (tree) { visit(tree, isCodeBlockElement, (node, index, parent) => { if (Object.keys(node.properties).length === 0) return; const code = node.children[0].value; const language = node.properties.className[0]; if (!(language in (options.scopeMap || exampleScopeMap))) return; node.children = []; const highlightStack = []; core.driver( options.treeSitterGrammarRoot, (options.scopeMap || exampleScopeMap)[language], code, (event) => { if (event.source !== undefined) { const sourceChunk = stringByteSlice( code, Number(event.source.start), Number(event.source.end) ); if (highlightStack.length === 0) { node.children.push({ type: "text", value: sourceChunk }); } else { node.children.push(h("span", { class: highlightStack.join(" ") }, sourceChunk)); } } else if (event.highlightStart !== undefined) { highlightStack.push(event.highlightStart.highlightName); } else if (event === "HighlightEnd") { highlightStack.pop(); } } ); }); }; }
there's a bunch of extra functionality that wasn't here before.
options.treeSitterGrammarRoot
: this lets the consumer specify which directorytree-sitter-loader
should load and compile its grammars from.options.scopeMap
: this option lets consumers specify a mapping between the language class markdown parsers insert into thecode
element and the tree-sitter scope. this is how we communicate to tree-sitter which grammar to use.highlightStack
keeps track of the highlight classes. this is helpful for denoting tokens that belong to more than one class.stringByteSlice
: originally, i usedsource.splice()
, but that broke on utf-8 code. this was because javascript strings are encoded in utf-16, which means that the byte indices tree-sitter gives us are incompatible with the code point nature ofsplice
.
aside from some packaging work2, this was all that i needed to get comprehensive syntax highlighting suitable for static site generators. i'm not quite sure (but curious) about server-side rendered components. in theory, if whatever is producing the html has access to the tree-sitter grammars, then it should work there as well. the code snippets you've been reading have been produced by this plugin. try changing your desktop color scheme! 🤩
how to use rehype-tree-sitter
in your astro project
to use rehype-tree-sitter
, install it through npm
:
npm install --save rehype-tree-sitter
once you have the plugin installed, add it to your astro config.
import { defineConfig } from "astro/config"; import rehypeTreeSitter from "rehype-tree-sitter"; // https://astro.build/config export default defineConfig({ markdown: { syntaxHighlight: false, rehypePlugins: [ [ rehypeTreeSitter, { treeSitterGrammarRoot: "/Users/haze/tsg", scopeMap: { "language-javascript": "source.js", "language-sh": "source.bash", "language-xml": "source.xml", "language-rust": "source.rust", "language-html": "text.html.basic", }, }, ], ], }, });
things i learned when publishing my first npm
package
- don't delete your package if you push garbage. i did this and had to wait a day to re-publish. (i got a 403 unauthorized when attempting to publish the fixed variant.)
- i was able to get away with a
cjs
stylerequire
in aesm
project. i had to usecreateImport
which the docs suck for. - using rust code from node is way easier than i thought.
- this technique is powerful. i can control the css for every token, with extreme granularity, for every language. for some languages (like hare or html) the syntax is pretty simple, and the theme can be relatively simple as well. for languages like rust, it helps to have more colors distinguishing things like lifetimes and loop labels.
Footnotes:
i hacked around this in a couple ways. one way was disabling the
usage of rehype-raw
internally. you could dig even deeper into
hast-util-to-html
and edit the text
callback to stop from
inserting newlines, but this is clearly not the solution, and will
break other things.
i had to figure out how to execute some commands to compile the
rust library. i learned about the beauties and dangers of
postinstall
npm scripts.