Exporting Language Server Data to SQL

October 16, 2024

Let's do what we always do... let's export the data from our language server into a SQL database. After all, the Language Server has all of this information, but its query language is a tedious JSON-RPC situation.

This post was an excuse for me to learn a bit more about language servers; come along for the ride!

Much of our mutual drudgery is slurping data from one end of our systems to another through a tiny, awkwardly-shaped straw. Some examples, to wit:

Data	Straw	Query System	Alternative
Linux Process Information	`/proc` file system	`ps`	`osquery`
Open Files, Ports	`/proc` on Linux	`lsof` ...	`grep`
AWS Instance Metadata	`DescribeInstances` API	`aws` CLI with `--filters` flags	IASql?
File System Metadata	`listdir()`, `stat`	`find`	`fselect`?
Git	`git log`	`git log`	`mergestat-lite`?
Type Information about your Code	Language Servers	VSCode, compiler	🌟This blog post!🌟

Compared to OS X's infrastructure for listing processes, the /proc file system is lovely; we have tools (like grep) for querying file systems!

Let me know if you've got items to add to my handy table.

Language Server History #

Microsoft, with Visual Studio Code, led the way in standardizing Language Servers. They describe their history on github. The specification is at https://microsoft.github.io/language-server-protocol/.

With a nod to XKCD 927, VSCode doesn't actually use a Typescript Language Server! It uses a "TypeScript Server" (tsserver). The Language Server protocol evolved as a generalization, but the migration hasn't happened. For this project, I used one of the wrappers that wraps tsserver in a language server.

Language Server Protocol #

Language servers typically run as a separate process, and communicate with their parent via stdin/stdout pipes. This makes them behind the scenes, as they're managed by your IDE, but it means it's tricky to see their logs, and, at least for me, it made it a bit tricky to attach a debugger to the language server.

Using stdin/stdout, and needing to "read lines" to parse the content-length exposes you to some fun with Python's bytes versus strings, UTF8 encoding, blocking reads and writes, etc. I'm sure my script breaks if you're not using UTF-8, but surely you're using UTF8 everywhere.

The protocol pushes a Content-Length line, \r\n\r\n newlines, and then a JSON blob of content. This happens bi-directionally. There are notifications (which go one way and don't expect a response) and requests (which expect a response). The specification has a list of methods. A reasonable person would use a pre-existing client library with typed support. This blog post was not written reasonably.

The protocol is stateful. You pretend to be an editor and say that you "didOpen" a document, and then you can ask for hover annotations.

Often, you can script APIs by inspecting what they do in the Network tab and using "Copy as curl" and scripting with bash. This is not easy to do for language servers. Sometimes CLI tools for the API are the way to go, but I didn't find much (though maybe should have tried lsp-cli).

Having said that, I nerd-sniped myself, and here's querying language servers in bash. Note how counting content-length is a pain, but we have printf to pad our JSON for us. The sleep is insidious: the language server is happy to exit whenever its input stream is closed, and Node sys.stdout.write() is asynchronous. (See this note: "Writes may be synchronous depending on what the stream is connected to"... Or also this investigation.) Let me know if I've nerd-sniped you, and you figured out how to get rid of the sleep without resorting to reasonable approaches like python.

{
        printf "Content-Length: 1000\r\n\r\n%1000s" '{"jsonrpc":"2.0","method":"initialize","id":1,"params":{"capabilities":[],"processId":'$$'}}'
        printf "Content-Length: 10000\r\n\r\n%10000s" '{"jsonrpc":"2.0","method":"textDocument/didOpen","params":{"textDocument":{"uri":"file:///Users/philip/src/hono/src/utils/url.ts","languageId":"typescript","version":1,"text":'"$(cat /Users/philip/src/hono/src/utils/url.ts | jq -Rs .)"'}}}'
        printf "Content-Length: 1000\r\n\r\n%1000s" '{"jsonrpc":"2.0","method":"textDocument/documentSymbol","id":3,"params":{"textDocument":{"uri":"file:///Users/philip/src/hono/src/utils/url.ts"}}}'
        sleep 2
} | foo/node_modules/.bin/typescript-language-server --stdio | grep '"id":3,' | jq | head -n 10
{
  "jsonrpc": "2.0",
  "id": 3,
  "result": [
    {
      "name": "_decodeURI",
      "kind": 14,
      "location": {
        "uri": "file:///Users/philip/src/hono/src/utils/url.ts",
        "range": {

🥁🥁 Here's the same data, queried sensibly:

$duckdb language-server-db/sample/hono.db --line "
	SELECT *
	FROM symbols
	WHERE uri LIKE '%url.ts'
	ORDER BY symbol_start_line DESC
	LIMIT 1
"
                   uri = file:///Users/philip/src/hono/src//utils/url.ts
           symbol_name = decodeURIComponent_
           symbol_kind = Constant
     symbol_start_line = 309
symbol_start_character = 13
       symbol_end_line = 309
  symbol_end_character = 53
 symbol_container_name =

Visualization #

So, how should we look at this data? Probably the editors, with their "inlay hints" and their mouse-overs are doing it right, but maybe like an annotated ("glossed") medieval manuscript? Maybe like my student's copy of The Aeneid, with lots of hard to decipher vocabulary notes?

Medieval Gloss — Bibliothèque nationale de France, MS Latin 7980, detail of fol. 5v. from medievalcodes.ca

This is a long-winded blog post, so we also get the opportunity to quote Bret Victor:

We expect programmers to write code that manipulates variables, without ever seeing the values of those variables. We expect readers to understand code that manipulates variables, without ever seeing the values of the variables. The entire purpose of code is to manipulate data, and we never see the data. We write with blindfolds, and we read by playing pretend with data-phantoms in our imaginations.

One of the all-time most popular programming models is the spreadsheet. A spreadsheet is the dual of a conventional programming language -- a language shows all the code, but hides the data. A spreadsheet shows all the data, but hides the code. Some people believe that spreadsheets are popular because of their two-dimensional grid, but that's a minor factor. Spreadsheets rule because they show the data.

It's a stretch, but, in this case, the hidden data is the types the compiler knows but are hidden to us as the readers.

And, so, I fed hono into it. (Hono is like Express, but it's the default web framework for val.town...)

Sample of Glossed bit of Hono's url.ts

Source #

https://philz.github.io/language-server-db/ and https://github.com/philz/language-server-db have the Python script, a full page of sample output, etc.