rehype-tree-sitter

during the migration of my blog from handwritten html to a fancier web framework, i noticed that there weren't many code block highlighting options to choose from. in terms of astro, there are only two plugins that are capable of syntax highlighting: shiki and prismjs. shiki uses textmate grammars. i'm not the hugest fan of regex-based syntax highlighting engines and i've spent way too much time creating custom grammars for new languages before, only to be met with a buggy mess in the end. (skill issue?) prism also uses regex rules to do its highlighting, so that's also not an option.

in my opinion, tree-sitter is currently the best option for highlighting source code. there's no contest. not only is it performant, but the dsl is much easier to wrap your head around. i knew right away that this was the tool i wanted to use to power the pretty little colors on my website. the real question was figuring out how.

after searching around, i discovered that astro uses unified as its underlying engine for working with html and markdown/mdx. to get what i want, i'd have to create a plugin that fits somewhere in-between these conversion steps and produces the correct output.

iteration one – command line interface

in the beginning, i had a hard time deciding on which part of the html/markdown processing chain i wanted to inject my logic into. rehype is responsible for transforming and working with html syntax trees (which they call hast). remark is the same, but for markdown (mast). in the end i decided upon rehype, because to get an mvp out, i wanted to use the tree-sitters built-in html highlight functionality.

to create a rehype plugin, i first needed to create a node package that will export the function responsible for transforming the relevant syntax tree.

import { visit } from "unist-util-visit";
import { rehype } from "rehype";

const isCodeBlockElement = (node) => node.tagName === "code";
export function rehypeTreeSitter(options) {
  return function (tree) {
    visit(tree, isCodeBlockElement, (node, index, parent) => {
      console.log(tree);
    });
  };
}

before messing around with astro and real data, i created a [jest](https://jestjs.io) test to make sure my code was correct. this also allows for faster iteration speed, since we only execute code we care about.

import { rehypeTreeSitter } from "./index.js";
import { rehype } from "rehype";

test("basic math", () => {
  expect(2 + 2).toBe(4);
});

now that i had some tooling in place, i could begin coding. i knew that if i executed tree-sitter highlight -h code.js, it'd spit some html out like below.

<body>
  <table>
    <tr>
      <td class="line-number">1</td>
      <td class="line">
        <span style="color: #5f00d7">function</span>
        <span style="color: #005fd7">sum</span><span style="color: #4e4e4e">(</span>
        <span style="text-decoration: underline;">a</span><span style="color: #4e4e4e">,</span>
        <span style="text-decoration: underline;">b</span><span style="color: #4e4e4e">)</span>
        <span style="color: #4e4e4e">{</span>
      </td>
    </tr>
    <tr>
      <td class="line-number">2</td>
      <td class="line">
        <span style="color: #5f00d7">return</span>
        <span style="text-decoration: underline;">a</span>
        <span style="font-weight: bold;color: #4e4e4e">+</span>
        <span style="text-decoration: underline;">b</span>
        <span style="color: #4e4e4e">;</span>
      </td>
    </tr>
    <tr>
      <td class="line-number">3</td>
      <td class="line"><span style="color: #4e4e4e">}</span></td>
    </tr>
  </table>
</body>

barring my initial concerns with the per-element styling, this looked workable! in my mind, i simply had to re-use whatever facilities unified has for parsing html into hast, and replace the code block in the original hast with the tree-sitter version. this ended up looking like this:

const exampleScopeMap = {
  "language-javascript": "source.js",
  "language-sh": "source.bash",
  "language-xml": "source.xml",
};

function doesNodeHaveChildTable(node) {
  for (const child of node.children) {
    if (child.tagName === "table") return true;
  }
  return false;
}

export function rehypeTreeSitter(options) {
  return function (tree) {
    visit(tree, isCodeBlockElement, (node, index, parent) => {
      if (Object.keys(node.properties).length === 0) return;
      const code = node.children[0].value;
      const language = node.properties.className[0];
      if (!(language in exampleScopeMap)) return;
      const temporarySourceFile = temporaryFile();
      writeFileSync(temporarySourceFile, code);
      try {
        const result = execSync(
          `tree-sitter highlight --scope ${exampleScopeMap[language]} -H ${temporarySourceFile}`
        ).toString();
        const resultTree = rehype().parse(result);
        const resultBody = resultTree.children[1].children[2];
        if (!doesNodeHaveChildTable(resultBody)) {
          console.error(`No 'highlights.scm' found for ${language}`);
          return;
        }
        const codeTable = resultTree.children[1].children[2].children[1];
        parent.children[0].children = [codeTable];
      } catch (error) {
        console.error(error);
      }
      unlinkSync(temporarySourceFile);
    });
  };
}

here's a high-level breakdown of the process.

  1. create a temporary file to store the source code in
  2. execute tree-sitter highlight --scope <language> -h <temporary file location> to get the html
  3. reparse the html into hast (with some safety checks)
  4. set the parent's child to contain the code table

this code works, and for a couple of hours of work, produces stellar results.

tree-sitter built-in HTML highlighter

however, there were a couple bugs that i encountered that made this approach unacceptable.

  1. whatever was converting the final hast into html inserted as many newlines into the pre element as there were lines of code in the original source block. this took a while to diagnose, and even longer to fix. 1
  2. the colors were determined by an external tool (the tree-sitter configuration). this made it difficult to customize and control. i also wanted to have css change the color scheme based on the user's system appearance.
  3. the produced code utilized a table to get the lines of code to match up with their line number. while this did work, i had suspicions that it wasn't the correct approach for supporting line numbers.

wanting more flexibility, i decided it was time to dig deeper into the underlying code. having studied how that exact command line invocation worked, i now knew that tree-sitter ships libraries that give you more control over the process. notably, tree-sitter-highlight which gives you an iterator over your code annotating spans with their highlight, and tree-sitter-loader, which handles compilation behind the scenes when supplying grammars.

iteration two – native rust node plugin

the big question (again) is how. how can i execute some rust code from within the node process? the answer is the node-api the node-api lets you create c/c++ add-ons, which you can import into your javascript code and call like normal. knowing this, i began working on the rust code first.

here's the test program i used for verifying everything worked before attempting to port it to the node-api.

fn main() -> Result<(), Box<dyn std::error::Error>> {
  let language_root_path = std::env::var("TREE_SITTER_LANGUAGE_ROOT")?;
  let mut loader = tree_sitter_loader::Loader::new()?;
  loader.find_all_languages(&tree_sitter_loader::Config {
    parser_directories: {
      let mut vec = Vec::new();
      vec.push(PathBuf::from(language_root_path));
      vec
    },
    ..Default::default()
  })?;

  let mut process_arg_iter = std::env::args().skip(1);
  let root_scope = format!(
    "source.{}",
    process_arg_iter.next().ok_or(Error::NoScopeProvided)?
  );
  let source = std::io::read_to_string(std::io::stdin())?;
  let (language, language_configuration) = loader
    .language_configuration_for_scope(&*root_scope)?
    .ok_or(Error::MissingLanguageConfiguration)?;
  let highlight_config = language_configuration
    .highlight_config(language)?
    .ok_or(Error::MissingHighlightConfiguration)?
    .clone();
  let mut highlighter = Highlighter::new();
  let mut highlight_iter =
    highlighter.highlight(highlight_config, source.as_bytes(), None, |_| None)?;
  while let Some(Ok(event)) = highlight_iter.next() {
    match event {
      tree_sitter_highlight::HighlightEvent::Source { start, end } => {
        print!("{}", &source[start..end])
      }
      tree_sitter_highlight::HighlightEvent::HighlightStart(_) |
      tree_sitter_highlight::HighlightEvent::HighlightEnd => {}
    }
  }
  println!("");
  Ok(())
}

the real meat of the function is near the end. using while let i'm able to extract the HighlightEvent's from that particular session and act accordingly. i don't translate any of the highlight calls, but being able to recreate the text from the indices is a good start. now that i can react to each 'chunk' of data produced by tree-sitter, it's time to port this to node.

node-bindgen is an excellent package for this. It allows me to simply annotate a function with a decorator (proc-macro) and use the resulting binary in my node project. Below is all the code I needed to change.

#[node_bindgen]
fn driver<F: Fn(SerializableHighlightEvent)>(
        language_root_path: String,
        root_scope: String,
        source: String,
        callback: F,
) -> Result<(), NjError> {
        // ...
}

i could then require my dylib and call driver. awesome!

you might be wondering what callback is doing there. the general idea behind this approach is that node hands control over the highlighting process to rust. rust is responsible for executing the code that parses the incoming text, but the business logic is determined by the callback. in rehype-tree-sitter's case, it needs to react to the event and modify the hast tree accordingly. let's jump back to the plugin.

export default function rehypeTreeSitter(options) {
  if (options === undefined) throw new Error("Need to provide `options.treeSitterGrammarRoot`");
  if (options.treeSitterGrammarRoot === undefined)
    throw new Error("Need to provide `options.treeSitterGrammarRoot`");
  return function (tree) {
    visit(tree, isCodeBlockElement, (node, index, parent) => {
      if (Object.keys(node.properties).length === 0) return;
      const code = node.children[0].value;
      const language = node.properties.className[0];
      if (!(language in (options.scopeMap || exampleScopeMap))) return;
      node.children = [];
      const highlightStack = [];
      core.driver(
        options.treeSitterGrammarRoot,
        (options.scopeMap || exampleScopeMap)[language],
        code,
        (event) => {
          if (event.source !== undefined) {
            const sourceChunk = stringByteSlice(
              code,
              Number(event.source.start),
              Number(event.source.end)
            );
            if (highlightStack.length === 0) {
              node.children.push({ type: "text", value: sourceChunk });
            } else {
              node.children.push(h("span", { class: highlightStack.join(" ") }, sourceChunk));
            }
          } else if (event.highlightStart !== undefined) {
            highlightStack.push(event.highlightStart.highlightName);
          } else if (event === "HighlightEnd") {
            highlightStack.pop();
          }
        }
      );
    });
  };
}

there's a bunch of extra functionality that wasn't here before.

  1. options.treeSitterGrammarRoot: this lets the consumer specify which directory tree-sitter-loader should load and compile its grammars from.
  2. options.scopeMap: this option lets consumers specify a mapping between the language class markdown parsers insert into the code element and the tree-sitter scope. this is how we communicate to tree-sitter which grammar to use.
  3. highlightStack keeps track of the highlight classes. this is helpful for denoting tokens that belong to more than one class.
  4. stringByteSlice: originally, i used source.splice(), but that broke on utf-8 code. this was because javascript strings are encoded in utf-16, which means that the byte indices tree-sitter gives us are incompatible with the code point nature of splice.

aside from some packaging work2, this was all that i needed to get comprehensive syntax highlighting suitable for static site generators. i'm not quite sure (but curious) about server-side rendered components. in theory, if whatever is producing the html has access to the tree-sitter grammars, then it should work there as well. the code snippets you've been reading have been produced by this plugin. try changing your desktop color scheme! 🤩

how to use rehype-tree-sitter in your astro project

to use rehype-tree-sitter, install it through npm:

npm install --save rehype-tree-sitter

once you have the plugin installed, add it to your astro config.

import { defineConfig } from "astro/config";
import rehypeTreeSitter from "rehype-tree-sitter";

// https://astro.build/config
export default defineConfig({
  markdown: {
    syntaxHighlight: false,
    rehypePlugins: [
      [
        rehypeTreeSitter,
        {
          treeSitterGrammarRoot: "/Users/haze/tsg",
          scopeMap: {
            "language-javascript": "source.js",
            "language-sh": "source.bash",
            "language-xml": "source.xml",
            "language-rust": "source.rust",
            "language-html": "text.html.basic",
          },
        },
      ],
    ],
  },
});

things i learned when publishing my first npm package

  1. don't delete your package if you push garbage. i did this and had to wait a day to re-publish. (i got a 403 unauthorized when attempting to publish the fixed variant.)
  2. i was able to get away with a cjs style require in a esm project. i had to use createImport which the docs suck for.
  3. using rust code from node is way easier than i thought.
  4. this technique is powerful. i can control the css for every token, with extreme granularity, for every language. for some languages (like hare or html) the syntax is pretty simple, and the theme can be relatively simple as well. for languages like rust, it helps to have more colors distinguishing things like lifetimes and loop labels.

special thanks

i'd like to extend a warm thank you to veritas and danny. their help was instrumental to my success. they assisted me with creating a node package, publishing it to npm, debugging, and testing. i'd also like to thank my wonderful girlfriend katherine for proofreading.

Footnotes:

1

i hacked around this in a couple ways. one way was disabling the usage of rehype-raw internally. you could dig even deeper into hast-util-to-html and edit the text callback to stop from inserting newlines, but this is clearly not the solution, and will break other things.

2

i had to figure out how to execute some commands to compile the rust library. i learned about the beauties and dangers of postinstall npm scripts.

Emacs 31.0.50 (Org mode 9.7.9)