Skip to main content
blog.philz.dev

Observability in Trouble

Writing down and sharing a tools and tricks that got us out of a jam. These tools and jams are quite generic, and I wouldn't hesitate to re-implement them in new contexts.

The Simplest of Profilers #

A specific kind of request for a very specific customer was really, really slow. We could see the slowness in the logs (hey, why is that taking 30+ seconds?), but we couldn't tell why, or what had changed. An incident formed, and folks started combing through commits. git log -p this-release ^previous-release and all that.

This particular system was NodeJS-based, and it turns out that if you send SIGUSR1 (via something like kill -usr1 $(pgrep -f my-process-name)) to a NodeJS process, it will start listening to the debugger. It'll spew something like the following to stderr:

Debugger listening on ws://127.0.0.1:9229/898e0871-03cf-4bcc-8e0a-17a5579e6dae
For help, see: https://nodejs.org/en/docs/inspector

Then, you use some ssh -L9229:9229 production-host incantations to tunnel your local port 9229 to production's 9229, and then you can use chrome://inspect to attach Chrome's debugger to your remote process.

And then you use the simplest of profilers: you hit pause, note the stack trace, and hit play. And you do that a few more times. And it's very surprising that the stack trace is always the same: that's your hot spot!

Once we had the stack trace, and we could inspect some variables and arguments for state, we quickly realized what was "accidentally quadratic" (hat tip to the Accidentally Quadratic blog) and the rest was history...

(Ctrl-C Profiling is in the same spirit.)

Profilers in Production #

The above--connecting to production, SSH tunnels, clicking around in your debugger--is a bit imposing. It can be finnicky. It should be gated with permissions and processes.

Instead, set up something that allows you to start a profiler via some HTTP path (preferably gated with permissions), possibly via a magic header. Have that profiler run for, say, 60 seconds, and dump its output to S3. Log the path to a proxy that will let you download the profile to your logging system.

Now, you can trigger a profile, trigger the errant action, and analyze the profile, all from the convenience of home! Since profiles only have function names and timing information, they are, unlike customer data, allowed to be on your dev machine.

If you want to go one step further, integrate directly with Speedscope and serve the profiler UI directly.

Canonical log lines #

See Stripe's write-up on Canonical Log Lines.

Have a structured log line for every request and point your log analysis tool (e.g., Kibana) at it. Just looking for the slowest requests can often pin point a source of trouble. Having p95 latencies in your monitoring stack is nice and all, but, at the end of the day, you need to find some actual requests that experienced those latencies, and it sure is handy when that's easy.

Sampled logging and SQL queries #

If your app talks to a database, a common source of latency and load is the SQL queries made in responding to a request. If you have a misbehaving query, it's hard to figure out what code path is invoking it (especially if it's in a library or an ORM layer is involved). If you have a slow request, it's hard to to know which queries are slowing it down. Sure, you can keep a query counter and a query latency total in your canonical log line, but you still need to figure out which queries are at the heart of the issue.

So, for every Nth request (N=10,000 is reasonable), log all the queries that request does! If you have enough QPS flowing through your system, you'll have a pretty effective sample of all those queries in your log system to do further analysis.

If you have a tracing header in your system (e.g., OpenTelemtry's trace-id), you can tie the sampling to the trace-id. If you're comfortable with the security implications, you can, perhaps, learn how to inject a trace-id that will make sure that your particular request that you're making right now with curl is one of the Nth requests that gets sampled.

It's often useful to know the stack trace that triggered a SQL query to be executed. Taking stack traces (via new Error() or the like) can be expensive in your runtime. Sampling can come to the rescue again: do it every Mth request that's already sampled, and your logs will give you stack traces.

I've also, in the past, built a CI job that annotated all SQL queries and shoved them into a SQLite file (viewable, e.g., with https://datasette.io). That worked, but logs turned out more useful. You can also put them stack trace into a query comment, as suggested by Henry.

Dynamic languages and profiles #

Sometimes, you want to sneak a bit of data into your profile. For (a made-up) example, say you're executing a query, and you want to replace scanTable() with scanTableT123456(), so that you can discern which table is being scanned. In a sufficiently dynamic language (Java, Python, JS all qualify), you can sneak a string into a profile like so:

function logStack() {
	console.log(new Error().stack);
}

const identifierRegex = /^[A-Za-z$_][A-Za-z0-9$_]*$/;

function named(name, f) {
	if (!identifierRegex.test(name)) {
		throw new Error('Invalid name.');
	}
	return eval(`let z = function ${name}(F) { F(); }; z`);
}

named("SUPER_DUPER_NAME")(logStack);

The output looks like:

$ node sneaky.js
Error
    at logStack (/Users/philip/src/blog/content/blog/observability-in-trouble/sneaky.js:2:14)
    at SUPER_DUPER_NAME (eval at named (/Users/philip/src/blog/content/blog/observability-in-trouble/sneaky.js:11:9), <anonymous>:1:40)
    at Object.<anonymous> (/Users/philip/src/blog/content/blog/observability-in-trouble/sneaky.js:14:26)
    at Module._compile (node:internal/modules/cjs/loader:1376:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1435:10)
    at Module.load (node:internal/modules/cjs/loader:1207:32)
    at Module._load (node:internal/modules/cjs/loader:1023:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:135:12)
    at node:internal/main/run_main_module:28:49

Note that Content Security Policies can disable the use of eval() and this trick.