tree-sitter is currently the best option for highlighting source code. There’s no contest. Not only is it performant, but the DSL is much easier to wrap your head around. I knew right away that this was the tool I wanted to use to power the pretty little colors on my website. The real question was figuring out how.
After searching around, I discovered that Astro uses unified as its underlying engine for working with HTML and Markdown/MDX. To get what I want, I’d have to create a plugin that fits somewhere in-between these conversion steps and produces the correct output.
In the beginning, I had a hard time deciding on which part of the HTML/Markdown processing chain I wanted to inject my logic into. rehype
is responsible for transforming and working with HTML syntax trees (which they call hast
). remark
is the same, but for Markdown (mast
). In the end I decided upon rehype
, because to get an MVP out, I wanted to use the tree-sitters built-in HTML highlight functionality.
To create a rehype
plugin, I first needed to create a node package that will export the function responsible for transforming the relevant syntax tree.
import { visit } from "unist-util-visit";
import { rehype } from "rehype";
const isCodeBlockElement = (node) => node.tagName === "code";
export function rehypeTreeSitter(options) {
return function (tree) {
visit(tree, isCodeBlockElement, (node, index, parent) => {
console.log(tree);
});
};
}
Before messing around with Astro and real data, I created a Jest test to make sure my code was correct. This also allows for faster iteration speed, since we only execute code we care about.
import { rehypeTreeSitter } from "./index.js";
import { rehype } from "rehype";
test("basic math", () => {
expect(2 + 2).toBe(4);
});
Now that I had some tooling in place, I could begin coding. I knew that if I executed tree-sitter highlight -H code.js
, it’d spit some HTML out like below.
<body>
<table>
<tr>
<td class="line-number">1</td>
<td class="line">
<span style="color: #5f00d7">function</span>
<span style="color: #005fd7">sum</span><span style="color: #4e4e4e">(</span>
<span style="text-decoration: underline;">a</span><span style="color: #4e4e4e">,</span>
<span style="text-decoration: underline;">b</span><span style="color: #4e4e4e">)</span>
<span style="color: #4e4e4e">{</span>
</td>
</tr>
<tr>
<td class="line-number">2</td>
<td class="line">
<span style="color: #5f00d7">return</span>
<span style="text-decoration: underline;">a</span>
<span style="font-weight: bold;color: #4e4e4e">+</span>
<span style="text-decoration: underline;">b</span>
<span style="color: #4e4e4e">;</span>
</td>
</tr>
<tr>
<td class="line-number">3</td>
<td class="line"><span style="color: #4e4e4e">}</span></td>
</tr>
</table>
</body>
Barring my initial concerns with the per-element styling, this looked workable! In my mind, I simply had to re-use whatever facilities unified has for parsing HTML into hast
, and replace the code
block in the original hast
with the tree-sitter version. This ended up looking like this:
const exampleScopeMap = {
"language-javascript": "source.js",
"language-sh": "source.bash",
"language-xml": "source.xml",
};
function doesNodeHaveChildTable(node) {
for (const child of node.children) {
if (child.tagName === "table") return true;
}
return false;
}
export function rehypeTreeSitter(options) {
return function (tree) {
visit(tree, isCodeBlockElement, (node, index, parent) => {
if (Object.keys(node.properties).length === 0) return;
const code = node.children[0].value;
const language = node.properties.className[0];
if (!(language in exampleScopeMap)) return;
const temporarySourceFile = temporaryFile();
writeFileSync(temporarySourceFile, code);
try {
const result = execSync(
`tree-sitter highlight --scope -H `
).toString();
const resultTree = rehype().parse(result);
const resultBody = resultTree.children[1].children[2];
if (!doesNodeHaveChildTable(resultBody)) {
console.error(`No 'highlights.scm' found for `);
return;
}
const codeTable = resultTree.children[1].children[2].children[1];
parent.children[0].children = [codeTable];
} catch (error) {
console.error(error);
}
unlinkSync(temporarySourceFile);
});
};
}
Here’s a high-level breakdown of the process.
tree-sitter highlight --scope <language> -H <temporary file location>
to get the HTMLhast
(with some safety checks)This code works, and for a couple of hours of work, produces stellar results.
However, there were a couple bugs that I encountered that made this approach unacceptable.
hast
into HTML inserted as many newlines into the pre
element as there were lines of code in the original source block. This took a while to diagnose, and even longer to fix. 1Wanting more flexibility, I decided it was time to dig deeper into the underlying code. Having studied how that exact command line invocation worked, I now knew that tree-sitter ships libraries that give you more control over the process. Notably, tree-sitter-highlight
which gives you an iterator over your code annotating spans with their highlight, and tree-sitter-loader
, which handles compilation behind the scenes when supplying grammars.
The big question (again) is how. How can I execute some Rust code from within the Node process? The answer is the Node-API. The Node-API lets you create C/C++ add-ons, which you can import into your JavaScript code and call like normal. Knowing this, I began working on the Rust code first.
Here’s the test program I used for verifying everything worked before attempting to port it to the Node-API.
fn main() -> Result<(), Box<dyn std::error::Error>> {
let language_root_path = std::env::var("TREE_SITTER_LANGUAGE_ROOT")?;
let mut loader = tree_sitter_loader::Loader::new()?;
loader.find_all_languages(&tree_sitter_loader::Config {
parser_directories: {
let mut vec = Vec::new();
vec.push(PathBuf::from(language_root_path));
vec
},
..Default::default()
})?;
let mut process_arg_iter = std::env::args().skip(1);
let root_scope = format!(
"source.{}",
process_arg_iter.next().ok_or(Error::NoScopeProvided)?
);
let source = std::io::read_to_string(std::io::stdin())?;
let (language, language_configuration) = loader
.language_configuration_for_scope(&*root_scope)?
.ok_or(Error::MissingLanguageConfiguration)?;
let highlight_config = language_configuration
.highlight_config(language)?
.ok_or(Error::MissingHighlightConfiguration)?
.clone();
let mut highlighter = Highlighter::new();
let mut highlight_iter =
highlighter.highlight(highlight_config, source.as_bytes(), None, |_| None)?;
while let Some(Ok(event)) = highlight_iter.next() {
match event {
tree_sitter_highlight::HighlightEvent::Source { start, end } => {
print!("{}", &source[start..end])
}
tree_sitter_highlight::HighlightEvent::HighlightStart(_) |
tree_sitter_highlight::HighlightEvent::HighlightEnd => {}
}
}
println!("");
Ok(())
}
The real meat of the function is near the end. Using while let
I’m able to extract the HighlightEvent
’s from that particular session and act accordingly. I don’t translate any of the highlight calls, but being able to recreate the text from the indices is a good start. Now that I can react to each ‘chunk’ of data produced by tree-sitter, it’s time to port this to Node.
node-bindgen
is an excellent package for this. It allows me to simply annotate a function with a decorator (proc-macro
) and use the resulting binary in my node project. Below is all the code I needed to change.
#[node_bindgen]
fn driver<F: Fn(SerializableHighlightEvent)>(
language_root_path: String,
root_scope: String,
source: String,
callback: F,
) -> Result<(), NjError> {
// ...
}
I could then require
my dylib
and call driver
. Awesome!
You might be wondering what callback
is doing there. The general idea behind this approach is that Node hands control over the highlighting process to Rust. Rust is responsible for executing the code that parses the incoming text, but the business logic is determined by the callback
. In rehype-tree-sitter
’s case, it needs to react to the event and modify the hast
tree accordingly. Let’s jump back to the plugin.
export default function rehypeTreeSitter(options) {
if (options === undefined) throw new Error("Need to provide `options.treeSitterGrammarRoot`");
if (options.treeSitterGrammarRoot === undefined)
throw new Error("Need to provide `options.treeSitterGrammarRoot`");
return function (tree) {
visit(tree, isCodeBlockElement, (node, index, parent) => {
if (Object.keys(node.properties).length === 0) return;
const code = node.children[0].value;
const language = node.properties.className[0];
if (!(language in (options.scopeMap || exampleScopeMap))) return;
node.children = [];
const highlightStack = [];
core.driver(
options.treeSitterGrammarRoot,
(options.scopeMap || exampleScopeMap)[language],
code,
(event) => {
if (event.source !== undefined) {
const sourceChunk = stringByteSlice(
code,
Number(event.source.start),
Number(event.source.end)
);
if (highlightStack.length === 0) {
node.children.push({ type: "text", value: sourceChunk });
} else {
node.children.push(h("span", { class: highlightStack.join(" ") }, sourceChunk));
}
} else if (event.highlightStart !== undefined) {
highlightStack.push(event.highlightStart.highlightName);
} else if (event === "HighlightEnd") {
highlightStack.pop();
}
}
);
});
};
}
There’s a bunch of extra functionality that wasn’t here before.
options.treeSitterGrammarRoot
: This lets the consumer specify which directory tree-sitter-loader
should load and compile its grammars from.options.scopeMap
: This option lets consumers specify a mapping between the language class markdown parsers insert into the code
element and the tree-sitter scope. This is how we communicate to tree-sitter which grammar to use.highlightStack
keeps track of the highlight classes. This is helpful for denoting tokens that belong to more than one class.stringByteSlice
: Originally, I used source.splice()
, but that broke on UTF-8 code. This was because JavaScript strings are encoded in UTF-16, which means that the byte indices tree-sitter gives us are incompatible with the code point nature of splice
.Aside from some packaging work2, this was all that I needed to get comprehensive syntax highlighting suitable for static site generators. I’m not quite sure (but curious) about server-side rendered components. In theory, if whatever is producing the HTML has access to the tree-sitter grammars, then it should work there as well. The code snippets you’ve been reading have been produced by this plugin. Try changing your desktop color scheme! 🤩
rehype-tree-sitter
in your Astro projectTo use rehype-tree-sitter
, install it through npm
:
npm install --save rehype-tree-sitter
Once you have the plugin installed, add it to your Astro config.
import { defineConfig } from "astro/config";
import rehypeTreeSitter from "rehype-tree-sitter";
// https://astro.build/config
export default defineConfig({
markdown: {
syntaxHighlight: false,
rehypePlugins: [
[
rehypeTreeSitter,
{
treeSitterGrammarRoot: "/Users/haze/tsg",
scopeMap: {
"language-javascript": "source.js",
"language-sh": "source.bash",
"language-xml": "source.xml",
"language-rust": "source.rust",
"language-html": "text.html.basic",
},
},
],
],
},
});
npm
packagecjs
style require
in a ESM
project. I had to use createImport
which the docs suck for.I’d like to extend a warm thank you to veritas and Danny. Their help was instrumental to my success. They assisted me with creating a node package, publishing it to npm
, debugging, and testing. I’d also like to thank my wonderful girlfriend Katherine for proofreading.
I hacked around this in a couple ways. One way was disabling the usage of rehype-raw
internally. You could dig even deeper into hast-util-to-html
and edit the text
callback to stop from inserting newlines, but this is clearly not the solution, and will break other things. ↩
I had to figure out how to execute some commands to compile the Rust library. I learned about the beauties and dangers of postinstall
npm scripts. ↩