Drawing flamegraphs with DuckDB and Vega-Lite

February 4, 2024

Having spent a good deal of time staring at profiles, I wanted to explore what sampling profiles and Flame Graphs are in a bit more detail. Brendan Gregg's ACM article on flamegraphs describes them:

A flame graph visualizes a collection of stack traces (aka call stacks), shown as an adjacency diagram with an inverted icicle layout.

The dataset produced by a sampling profiler is quite simple: it is a collection of call stacks. If a function appears frequently, it's hot, and may be a target for optimization. The simplest analysis might be to just look for common functions with cat stacks | sort | uniq -c (or the like), but visualizing the data helps weed out the frequent, correlated unimportant bits. A "Poor Man's Profiler" (e.g., jpmp for java) may be as simple as loop calling jstack.)

Brendan Gregg's flamegraph.pl takes a "semi-colon delimited stack with count" text file and produces an SVG. Can we do the same, but use DuckDB for the data munging and Vega-Lite (as wrapped by Altair) for the visualization? Sure we can!

For this example, we've borrowed a sample stack trace from Speedscope's repo.

Jupyter Notebook Viewer.

In [8]

%%capture
# Install dependencies. Capture cell magic hides the output.
%pip install duckdb notebook jupysql pandas matplotlib duckdb-engine pyarrow
%pip install altair vl-convert-python anywidget

In [9]

import duckdb
import pandas as pd
import altair as alt

In [10]

# Connect to in-memory DuckDB
%load_ext sql
conn = duckdb.connect()
%sql conn --alias duckdb

Load our file into DuckDB by pretending it's a CSV file. We use "|" as the delimiter, which doesn't exist in the data, so we get one row per line. We us the row_number() window function to assign ids to our stacks.

In [15]

%%sql
DROP TABLE IF EXISTS stack_data_lines;

CREATE TABLE
  stack_data_lines AS
SELECT *, row_number() OVER () AS line_id 
FROM read_csv('perf-vertx-stacks-01-collapsed-all.txt', delim='|', columns = { 'line': 'varchar' });

SELECT * FROM stack_data_lines WHERE line_id = 3

Out [15]

Running query in 'duckdb'

Out [15]

line	line_id
java;start_thread;java_start;GCTaskThread::run;StealTask::do_it;PSPromotionManager::drain_stacks_depth;oopDesc* PSPromotionManager::copy_to_survivor_space;InstanceKlass::oop_push_contents 1	3

Now split out the weight and the stack. We parse the stack into an array.

In [27]

%%sql
DROP TABLE if EXISTS stacks;

CREATE TABLE
  stacks AS
SELECT
  string_split(regexp_extract(line, '^(.*) (\d+)$', 1), ';') as stack,
  CAST(regexp_extract (line, '^(.*) (\d+)$', 2) AS int64) AS weight,
  line_id
FROM
  raw;

SELECT * FROM stacks WHERE line_id = 3

Out [27]

Running query in 'duckdb'

Out [27]

stack	weight	line_id
['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']	1	3

Using UNNEST, we create one row per stack frame. Two UNNESTs in the same SELECT statement pair up their results, which we use to get the frame index, which we call depth. We also record the ancestry.

This took a bit of sleuthing to find. In PostgreSQL, the equivalent is WITH ORDINALITY.

(Originally, I wrote a recursive CTE to build out the ancestry, but given array functions and depth, we don't need to!)

In [32]

%%sql
DROP TABLE if EXISTS stack_frames;

CREATE TABLE
  stack_frames AS
SELECT
  weight,
  line_id,
  UNNEST (RANGE (array_length(stack))) AS depth,
  UNNEST (stack) AS func,
  stack[:depth] AS ancestry,
  stack
FROM
  stacks;

SELECT * FROM stack_frames WHERE line_id = 3

Out [32]

Running query in 'duckdb'

Out [32]

weight	line_id	depth	func	ancestry	stack
1	3	0	java	[]	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	1	start_thread	['java']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	2	java_start	['java', 'start_thread']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	3	GCTaskThread::run	['java', 'start_thread', 'java_start']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	4	StealTask::do_it	['java', 'start_thread', 'java_start', 'GCTaskThread::run']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	5	PSPromotionManager::drain_stacks_depth	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	6	oopDesc* PSPromotionManager::copy_to_survivor_space	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1	3	7	InstanceKlass::oop_push_contents	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space']	['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']

Now we can finally start computing the layout. First, for a given ancestry, we compute the cumulative weights. Ine one dimension, this is like a stacked bar chart.

In [38]

%%sql
DROP TABLE if EXISTS stack_frames_with_per_ancestry_weight;

CREATE TABLE
  stack_frames_with_per_ancestry_weight AS
WITH
  b AS (
    SELECT
      depth,
      func,
      ancestry,
      SUM(weight) AS weight
    FROM
      stack_frames
    GROUP BY
      1,
      2,
      3
    ORDER BY
      4 desc
  )
SELECT
  *,
  SUM(weight) OVER (
    PARTITION BY
      ancestry
    ORDER BY
      weight desc
  ) AS running_weight
FROM
  b;

SELECT * FROM stack_frames_with_weight WHERE depth = 3

Out [38]

Running query in 'duckdb'

Out [38]

depth	func	ancestry	weight	running_weight
3	JavaThread::run	['java', 'start_thread', 'java_start']	263	263
3	GCTaskThread::run	['java', 'start_thread', 'java_start']	14	277
3	VMThread::run	['java', 'start_thread', 'java_start']	4	281
3	hypercall_page_[k]	['java', 'write', 'check_events_[k]']	3	3
3	hypercall_page_[k]	['java', 'read', 'check_events_[k]']	1	1

In []

Now, finally, we use a recursive common table expression (CTE) to build out where every rectangle needs to go
At the bottom, we have a stacked bar chart of all the functions at depth 0. As we go up the stack, though,
we want to align functions with their ancestry.

In [115]

%%sql
DROP TABLE if EXISTS flame_graph;

CREATE TABLE
  flame_graph AS
WITH RECURSIVE
  a AS (
    SELECT
      depth,
      func,
      ancestry,
      weight,
      running_weight - weight AS x,
      running_weight AS x2
    FROM
      stack_frames_with_weight
    WHERE
      depth = 0
    
    UNION ALL
    
    SELECT
      s.depth,
      s.func,
      s.ancestry,
      s.weight,
      s.running_weight - s.weight + a.x AS x,
      s.running_weight + a.x AS x2
    FROM
      stack_frames_with_weight s
      JOIN a ON s.ancestry = list_append(a.ancestry, a.func)
      AND s.depth = a.depth + 1
  ), t as (SELECT sum(weight) as w FROM stack_frames_with_weight WHERE depth = 0)
SELECT a.*, a.x / w as x_pct, a.x2 / w as x2_pct, weight / w as weight_pct FROM a JOIN t ON TRUE;
SELECT * FROM flame_graph WHERE depth = 3

Out [115]

Running query in 'duckdb'

Out [115]

depth	func	ancestry	weight	x	x2	x_pct	x2_pct	weight_pct
3	JavaThread::run	['java', 'start_thread', 'java_start']	263	0	263	0.0	0.9228070175438596	0.9228070175438596
3	GCTaskThread::run	['java', 'start_thread', 'java_start']	14	263	277	0.9228070175438596	0.9719298245614035	0.04912280701754386
3	VMThread::run	['java', 'start_thread', 'java_start']	4	277	281	0.9719298245614035	0.9859649122807017	0.014035087719298246
3	hypercall_page_[k]	['java', 'write', 'check_events_[k]']	3	281	284	0.9859649122807017	0.9964912280701754	0.010526315789473684
3	hypercall_page_[k]	['java', 'read', 'check_events_[k]']	1	284	285	0.9964912280701754	1.0	0.0035087719298245615

Finally, we can visualize this with Vega-Lite. Here I'm using Altair as a Python front-end. I'm inverting depth because I slightly "icicle" graphs, where the stacks are drooping from the ceiling.

Vega-Lite does not seem to be able to truncate text marks to a dynamic size based on the width of the bar I wish to match the text into. Please let me know if you've got a better solution for this, but I've resorted to doing the truncation here in Python land. (Nothing prevents us from doing it in SQL.)

In [122]

result = %sql SELECT * FROM flame_graph
df = result.DataFrame()

# Invert to create an icicle chart
df['depth'] = -df['depth']

# Truncate function names. We use a fixed-width font to make this somewhat correct.
HEIGHT_OF_BAR = 20
WIDTH = 800
FONT_SIZE = 11
FONT_WIDTH = 11 * 0.75
df['func_short'] = df.apply(lambda row: row['func'][:int(row['weight_pct'] * WIDTH / FONT_WIDTH)], axis=1)

tooltip = alt.Tooltip(['func', 'ancestry', 'weight', 'weight_pct'])
chart = (alt.Chart(df)
         .transform_calculate(xxx='datum.weight')
         .encode(alt.X('x_pct'),
                 alt.X2('x2_pct'), 
                 alt.Y('depth').scale(domainMax=0.5))
         .properties(width=WIDTH, height=HEIGHT_OF_BAR*(df['depth'].abs().max() + 1)))

bars = (chart
        .mark_bar(height=HEIGHT_OF_BAR)
        .encode(color=alt.Color('func', legend=None), tooltip=tooltip))
text = (chart
        .mark_text(dx=2, align='left', clip=True, font='monospace')
        .encode(text='func_short', tooltip=tooltip))
flamegraph = bars + text
flamegraph

Out [122]

Running query in 'duckdb'

Out [122]

Save as SVG Save as PNG View Source View Compiled Vega Open in Vega Editor

Implementing zoom and and testing at scale are an exercise left to the reader. As is hoisting this into a web app. As is implementing "time order" graphs: instead of cumulative sums, you need to merge adjacent, identical frame rectangles, by identifying runs using DENSE_RANK() over (PARTITION BY func ORDER BY frame_id). Evidently this is called the "Gaps and Islands" problem.

Google's Perfetto works by keeping its data in an in-memory columnar SQL database (see documentation).

The Algebra of Profiles #

If a profile is just a collection of stack traces, you can combine them by concatenating. (For pprof profiles, one such tool is pprof-merge. For our semi-colon format, cat works fine.) You can also diff them with subtraction, and differential flame graphs are a thing. Firefox profiler has tooling for comparison as well.

pprof and flamegraphs #

Let's take a very mini recursive implementation of gron that is deliberately slow. Something like:

function slow() {
	const x = Date.now();
	while (Date.now() < x + 500) {
	}
}

function r(obj, currentPath = '') {
	if (typeof obj === 'object' && obj !== null) {
		if (Array.isArray(obj)) {
			obj.forEach((item, index) => { r(item, `${currentPath}[${index}]`); });
		} else {
			Object.keys(obj).forEach(key => { r(obj[key], `${currentPath}.${key}`); });
		}
	} else {
		slow();
		console.log(`${currentPath.substring(1)}: ${obj}`);
	}
}
const s = '{"a": [1,2,3], "b": {"ba": 2, "bb": {"bba": 3, "bbb": 4}}, "c": 5}';
const o = JSON.parse(s);
r(o);

If we profile this, we should see that slow() is, well, really slow. I sued node-pprof to profile this. (node-pprof uses the normal v8 profiling bindings, and then converts from the Node profile representation to pprof's protobuf representation.) Let's look at the pprof graph and the flame graph.

The pprof graph draws an edge between pairs of callers and callees. We spend all of our time in slow() as you'd expect. The flame graph, however, has multiple slow() calls, because we invoke the function at different recursion depths. Both representations are useful: sometimes you have a hot function, but it's only hot when it comes in via a particular stack trace, and you need to figure that out.

Other Profiling Resources and Tips #

Go read Nelson Elhage's "Performance engineering, profilers, and seeing the invisible".
ProfilerPedia connects all the world's profilers and their data formats. The typical profiler schema is very simple (a collection of stack traces, maybe with some time stamps and filename/line number metadata), but there are many, many implementations, file formats, and analysis tools. (Speedscope's schema description)[https://github.com/jlfwong/speedscope/wiki/Importing-from-custom-sources] is a reasonable starting point.
Speedscope is my go to Flamegraph-esque visualizer. It sports both a Flamegraph view and a "time order" view.
Sneaking in metadata. For the most part, profiles contain only function names and statistics. You can, however, sneak in some custom metadata if you'd like. Let's say you want to associate part of a profile with a request. Somewhere, you have a handleRequest() method, but you'd really like to know much more about that request from your logs. If you're using a dynamic language, you can create a no-op passthrough function. In the example below, I snuck in the string SNEAKY into the stack trace. This will show up just fine in profiles and so forth. (You can do this in JavaScript, Python, and Java for sure; doing it in C might be a fun blog post for someone.) You can do this with a request id or the like. Whether the overhead is significant depends on what you're up to and how it compares to the request lifecycle.

> const name = "SNEAKY";
> eval(`function ${name}(f) { return f(); }; ${name}`)( () => { throw new Error(); } );
Uncaught Error
    at REPL26:1:92
    at SNEAKY (eval at <anonymous> (REPL26:1:24), <anonymous>:1:29)

Friction to getting a profile is an enemy. Typically, to get a profile, you have to SSH into production, install some tools, jump through a few containers, edit /proc/sys/kernel/perf_event_paranoid, take the profile, and then exfiltrate the result from production back to your machine for analysis. Yuck. Profiles don't have user data (they have function names and line numbers maybe!). Could you perhaps set a x-profile-me-for-duration: 10s header to a request and then get a URL to the data back (either directly or via your logging system)? If you have benchmarks, add a PROFILE=1 flag to automatically produce the profile too. Maybe even spit it out as part of your CI system with the flamegraph or trace data all ready to go.
Even better, perhaps profiling will be always on? I've not used it yet, but Elastic nee Prodfiler has an offering in this space. Some of these systems do fun eBPF magic to profile everything, and the magic is whether you get decent symbols and line numbers and such...
An implication of what we're doing here with DuckDB is that you're not limited to the queries and visualizations that others have built. If you can get your profiles into queryable form, you can build out your own custom queries that make sense for you.