Skip to main content
blog.philz.dev

Drawing flamegraphs with DuckDB and Vega-Lite

Having spent a good deal of time staring at profiles, I wanted to explore what sampling profiles and Flame Graphs are in a bit more detail. Brendan Gregg's ACM article on flamegraphs describes them:

A flame graph visualizes a collection of stack traces (aka call stacks), shown as an adjacency diagram with an inverted icicle layout.

The dataset produced by a sampling profiler is quite simple: it is a collection of call stacks. If a function appears frequently, it's hot, and may be a target for optimization. The simplest analysis might be to just look for common functions with cat stacks | sort | uniq -c (or the like), but visualizing the data helps weed out the frequent, correlated unimportant bits. A "Poor Man's Profiler" (e.g., jpmp for java) may be as simple as loop calling jstack.)

Brendan Gregg's flamegraph.pl takes a "semi-colon delimited stack with count" text file and produces an SVG. Can we do the same, but use DuckDB for the data munging and Vega-Lite (as wrapped by Altair) for the visualization? Sure we can!

For this example, we've borrowed a sample stack trace from Speedscope's repo.

Jupyter Notebook Viewer.

In [8]
%%capture
# Install dependencies. Capture cell magic hides the output.
%pip install duckdb notebook jupysql pandas matplotlib duckdb-engine pyarrow
%pip install altair vl-convert-python anywidget
In [9]
import duckdb
import pandas as pd
import altair as alt
In [10]
# Connect to in-memory DuckDB
%load_ext sql
conn = duckdb.connect()
%sql conn --alias duckdb

Load our file into DuckDB by pretending it's a CSV file. We use "|" as the delimiter, which doesn't exist in the data, so we get one row per line. We us the row_number() window function to assign ids to our stacks.

In [15]
%%sql
DROP TABLE IF EXISTS stack_data_lines;

CREATE TABLE
  stack_data_lines AS
SELECT *, row_number() OVER () AS line_id 
FROM read_csv('perf-vertx-stacks-01-collapsed-all.txt', delim='|', columns = { 'line': 'varchar' });

SELECT * FROM stack_data_lines WHERE line_id = 3
Out [15]
Running query in 'duckdb'
Out [15]
line line_id
java;start_thread;java_start;GCTaskThread::run;StealTask::do_it;PSPromotionManager::drain_stacks_depth;oopDesc* PSPromotionManager::copy_to_survivor_space;InstanceKlass::oop_push_contents 1 3

Now split out the weight and the stack. We parse the stack into an array.

In [27]
%%sql
DROP TABLE if EXISTS stacks;

CREATE TABLE
  stacks AS
SELECT
  string_split(regexp_extract(line, '^(.*) (\d+)$', 1), ';') as stack,
  CAST(regexp_extract (line, '^(.*) (\d+)$', 2) AS int64) AS weight,
  line_id
FROM
  raw;

SELECT * FROM stacks WHERE line_id = 3
Out [27]
Running query in 'duckdb'
Out [27]
stack weight line_id
['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents'] 1 3

Using UNNEST, we create one row per stack frame. Two UNNESTs in the same SELECT statement pair up their results, which we use to get the frame index, which we call depth. We also record the ancestry.

This took a bit of sleuthing to find. In PostgreSQL, the equivalent is WITH ORDINALITY.

(Originally, I wrote a recursive CTE to build out the ancestry, but given array functions and depth, we don't need to!)

In [32]
%%sql
DROP TABLE if EXISTS stack_frames;

CREATE TABLE
  stack_frames AS
SELECT
  weight,
  line_id,
  UNNEST (RANGE (array_length(stack))) AS depth,
  UNNEST (stack) AS func,
  stack[:depth] AS ancestry,
  stack
FROM
  stacks;

SELECT * FROM stack_frames WHERE line_id = 3
Out [32]
Running query in 'duckdb'
Out [32]
weight line_id depth func ancestry stack
1 3 0 java [] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 1 start_thread ['java'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 2 java_start ['java', 'start_thread'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 3 GCTaskThread::run ['java', 'start_thread', 'java_start'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 4 StealTask::do_it ['java', 'start_thread', 'java_start', 'GCTaskThread::run'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 5 PSPromotionManager::drain_stacks_depth ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 6 oopDesc* PSPromotionManager::copy_to_survivor_space ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']
1 3 7 InstanceKlass::oop_push_contents ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space'] ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space', 'InstanceKlass::oop_push_contents']

Now we can finally start computing the layout. First, for a given ancestry, we compute the cumulative weights. Ine one dimension, this is like a stacked bar chart.

In [38]
%%sql
DROP TABLE if EXISTS stack_frames_with_per_ancestry_weight;

CREATE TABLE
  stack_frames_with_per_ancestry_weight AS
WITH
  b AS (
    SELECT
      depth,
      func,
      ancestry,
      SUM(weight) AS weight
    FROM
      stack_frames
    GROUP BY
      1,
      2,
      3
    ORDER BY
      4 desc
  )
SELECT
  *,
  SUM(weight) OVER (
    PARTITION BY
      ancestry
    ORDER BY
      weight desc
  ) AS running_weight
FROM
  b;

SELECT * FROM stack_frames_with_weight WHERE depth = 3
Out [38]
Running query in 'duckdb'
Out [38]
depth func ancestry weight running_weight
3 JavaThread::run ['java', 'start_thread', 'java_start'] 263 263
3 GCTaskThread::run ['java', 'start_thread', 'java_start'] 14 277
3 VMThread::run ['java', 'start_thread', 'java_start'] 4 281
3 hypercall_page_[k] ['java', 'write', 'check_events_[k]'] 3 3
3 hypercall_page_[k] ['java', 'read', 'check_events_[k]'] 1 1
In []
Now, finally, we use a recursive common table expression (CTE) to build out where every rectangle needs to go
At the bottom, we have a stacked bar chart of all the functions at depth 0. As we go up the stack, though,
we want to align functions with their ancestry.
In [115]
%%sql
DROP TABLE if EXISTS flame_graph;

CREATE TABLE
  flame_graph AS
WITH RECURSIVE
  a AS (
    SELECT
      depth,
      func,
      ancestry,
      weight,
      running_weight - weight AS x,
      running_weight AS x2
    FROM
      stack_frames_with_weight
    WHERE
      depth = 0
    
    UNION ALL
    
    SELECT
      s.depth,
      s.func,
      s.ancestry,
      s.weight,
      s.running_weight - s.weight + a.x AS x,
      s.running_weight + a.x AS x2
    FROM
      stack_frames_with_weight s
      JOIN a ON s.ancestry = list_append(a.ancestry, a.func)
      AND s.depth = a.depth + 1
  ), t as (SELECT sum(weight) as w FROM stack_frames_with_weight WHERE depth = 0)
SELECT a.*, a.x / w as x_pct, a.x2 / w as x2_pct, weight / w as weight_pct FROM a JOIN t ON TRUE;
SELECT * FROM flame_graph WHERE depth = 3
Out [115]
Running query in 'duckdb'
Out [115]
depth func ancestry weight x x2 x_pct x2_pct weight_pct
3 JavaThread::run ['java', 'start_thread', 'java_start'] 263 0 263 0.0 0.9228070175438596 0.9228070175438596
3 GCTaskThread::run ['java', 'start_thread', 'java_start'] 14 263 277 0.9228070175438596 0.9719298245614035 0.04912280701754386
3 VMThread::run ['java', 'start_thread', 'java_start'] 4 277 281 0.9719298245614035 0.9859649122807017 0.014035087719298246
3 hypercall_page_[k] ['java', 'write', 'check_events_[k]'] 3 281 284 0.9859649122807017 0.9964912280701754 0.010526315789473684
3 hypercall_page_[k] ['java', 'read', 'check_events_[k]'] 1 284 285 0.9964912280701754 1.0 0.0035087719298245615

Finally, we can visualize this with Vega-Lite. Here I'm using Altair as a Python front-end. I'm inverting depth because I slightly "icicle" graphs, where the stacks are drooping from the ceiling.

Vega-Lite does not seem to be able to truncate text marks to a dynamic size based on the width of the bar I wish to match the text into. Please let me know if you've got a better solution for this, but I've resorted to doing the truncation here in Python land. (Nothing prevents us from doing it in SQL.)

In [122]
result = %sql SELECT * FROM flame_graph
df = result.DataFrame()

# Invert to create an icicle chart
df['depth'] = -df['depth']

# Truncate function names. We use a fixed-width font to make this somewhat correct.
HEIGHT_OF_BAR = 20
WIDTH = 800
FONT_SIZE = 11
FONT_WIDTH = 11 * 0.75
df['func_short'] = df.apply(lambda row: row['func'][:int(row['weight_pct'] * WIDTH / FONT_WIDTH)], axis=1)

tooltip = alt.Tooltip(['func', 'ancestry', 'weight', 'weight_pct'])
chart = (alt.Chart(df)
         .transform_calculate(xxx='datum.weight')
         .encode(alt.X('x_pct'),
                 alt.X2('x2_pct'), 
                 alt.Y('depth').scale(domainMax=0.5))
         .properties(width=WIDTH, height=HEIGHT_OF_BAR*(df['depth'].abs().max() + 1)))

bars = (chart
        .mark_bar(height=HEIGHT_OF_BAR)
        .encode(color=alt.Color('func', legend=None), tooltip=tooltip))
text = (chart
        .mark_text(dx=2, align='left', clip=True, font='monospace')
        .encode(text='func_short', tooltip=tooltip))
flamegraph = bars + text
flamegraph

Out [122]
Running query in 'duckdb'
Out [122]

Implementing zoom and and testing at scale are an exercise left to the reader. As is hoisting this into a web app. As is implementing "time order" graphs: instead of cumulative sums, you need to merge adjacent, identical frame rectangles, by identifying runs using DENSE_RANK() over (PARTITION BY func ORDER BY frame_id). Evidently this is called the "Gaps and Islands" problem.

Google's Perfetto works by keeping its data in an in-memory columnar SQL database (see documentation).

The Algebra of Profiles #

If a profile is just a collection of stack traces, you can combine them by concatenating. (For pprof profiles, one such tool is pprof-merge. For our semi-colon format, cat works fine.) You can also diff them with subtraction, and differential flame graphs are a thing. Firefox profiler has tooling for comparison as well.

pprof and flamegraphs #

Let's take a very mini recursive implementation of gron that is deliberately slow. Something like:

function slow() {
	const x = Date.now();
	while (Date.now() < x + 500) {
	}
}

function r(obj, currentPath = '') {
	if (typeof obj === 'object' && obj !== null) {
		if (Array.isArray(obj)) {
			obj.forEach((item, index) => { r(item, `${currentPath}[${index}]`); });
		} else {
			Object.keys(obj).forEach(key => { r(obj[key], `${currentPath}.${key}`); });
		}
	} else {
		slow();
		console.log(`${currentPath.substring(1)}: ${obj}`);
	}
}
const s = '{"a": [1,2,3], "b": {"ba": 2, "bb": {"bba": 3, "bbb": 4}}, "c": 5}';
const o = JSON.parse(s);
r(o);

If we profile this, we should see that slow() is, well, really slow. I sued node-pprof to profile this. (node-pprof uses the normal v8 profiling bindings, and then converts from the Node profile representation to pprof's protobuf representation.) Let's look at the pprof graph and the flame graph.

The pprof graph draws an edge between pairs of callers and callees. We spend all of our time in slow() as you'd expect. The flame graph, however, has multiple slow() calls, because we invoke the function at different recursion depths. Both representations are useful: sometimes you have a hot function, but it's only hot when it comes in via a particular stack trace, and you need to figure that out.

Other Profiling Resources and Tips #

> const name = "SNEAKY";
> eval(`function ${name}(f) { return f(); }; ${name}`)( () => { throw new Error(); } );
Uncaught Error
    at REPL26:1:92
    at SNEAKY (eval at <anonymous> (REPL26:1:24), <anonymous>:1:29)