Having spent a good deal of time staring at profiles, I wanted to explore what sampling profiles and Flame Graphs are in a bit more detail. Brendan Gregg's ACM article on flamegraphs describes them:
A flame graph visualizes a collection of stack traces (aka call stacks), shown as an adjacency diagram with an inverted icicle layout.
The dataset produced by a sampling profiler is quite simple: it is a collection
of call stacks. If a function appears frequently, it's hot, and may be a target
for optimization. The simplest analysis might be to just look for common functions
with cat stacks | sort | uniq -c
(or the like), but visualizing the data
helps weed out the frequent, correlated unimportant bits.
A "Poor Man's Profiler" (e.g., jpmp for java)
may be as simple as loop calling jstack
.)
Brendan Gregg's
flamegraph.pl
takes a "semi-colon delimited stack with count" text file and produces an SVG. Can we
do the same, but use DuckDB for the data munging and Vega-Lite (as wrapped by Altair)
for the visualization? Sure we can!
For this example, we've borrowed a sample stack trace from Speedscope's repo.
%%capture # Install dependencies. Capture cell magic hides the output. %pip install duckdb notebook jupysql pandas matplotlib duckdb-engine pyarrow %pip install altair vl-convert-python anywidget
import duckdb import pandas as pd import altair as alt
# Connect to in-memory DuckDB %load_ext sql conn = duckdb.connect() %sql conn --alias duckdb
Load our file into DuckDB by pretending it's a CSV file. We use "|" as the delimiter, which doesn't exist in the data, so we get one row per line. We us the row_number()
window function to assign ids to our stacks.
%%sql DROP TABLE IF EXISTS stack_data_lines; CREATE TABLE stack_data_lines AS SELECT *, row_number() OVER () AS line_id FROM read_csv('perf-vertx-stacks-01-collapsed-all.txt', delim='|', columns = { 'line': 'varchar' }); SELECT * FROM stack_data_lines WHERE line_id = 3
line | line_id |
---|---|
java;start_thread;java_start;GCTaskThread::run;StealTask::do_it;PSPromotionManager::drain_stacks_depth;oopDesc* PSPromotionManager::copy_to_survivor_space |
3 |
Now split out the weight and the stack. We parse the stack into an array.
%%sql DROP TABLE if EXISTS stacks; CREATE TABLE stacks AS SELECT string_split(regexp_extract(line, '^(.*) (\d+)$', 1), ';') as stack, CAST(regexp_extract (line, '^(.*) (\d+)$', 2) AS int64) AS weight, line_id FROM raw; SELECT * FROM stacks WHERE line_id = 3
stack | weight | line_id |
---|---|---|
['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 |
Using UNNEST
, we create one row per stack frame. Two UNNEST
s in the same SELECT
statement pair up their results, which we use to get the frame index, which we call depth
. We also record the ancestry.
This took a bit of sleuthing to find. In PostgreSQL, the equivalent is WITH ORDINALITY
.
(Originally, I wrote a recursive CTE to build out the ancestry, but given array functions and depth, we don't need to!)
%%sql DROP TABLE if EXISTS stack_frames; CREATE TABLE stack_frames AS SELECT weight, line_id, UNNEST (RANGE (array_length(stack))) AS depth, UNNEST (stack) AS func, stack[:depth] AS ancestry, stack FROM stacks; SELECT * FROM stack_frames WHERE line_id = 3
weight | line_id | depth | func | ancestry | stack |
---|---|---|---|---|---|
1 | 3 | 0 | java | [] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 1 | start_thread | ['java'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 2 | java_start | ['java', 'start_thread'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 3 | GCTaskThread::run | ['java', 'start_thread', 'java_start'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 4 | StealTask::do_it | ['java', 'start_thread', 'java_start', 'GCTaskThread::run'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 5 | PSPromotionManager::drain_stacks_depth | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 6 | oopDesc* PSPromotionManager::copy_to_survivor_space |
['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth'] | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
1 | 3 | 7 | InstanceKlass::oop_push_contents | ['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
['java', 'start_thread', 'java_start', 'GCTaskThread::run', 'StealTask::do_it', 'PSPromotionManager::drain_stacks_depth', 'oopDesc* PSPromotionManager::copy_to_survivor_space |
Now we can finally start computing the layout. First, for a given ancestry, we compute the cumulative weights. Ine one dimension, this is like a stacked bar chart.
%%sql DROP TABLE if EXISTS stack_frames_with_per_ancestry_weight; CREATE TABLE stack_frames_with_per_ancestry_weight AS WITH b AS ( SELECT depth, func, ancestry, SUM(weight) AS weight FROM stack_frames GROUP BY 1, 2, 3 ORDER BY 4 desc ) SELECT *, SUM(weight) OVER ( PARTITION BY ancestry ORDER BY weight desc ) AS running_weight FROM b; SELECT * FROM stack_frames_with_weight WHERE depth = 3
depth | func | ancestry | weight | running_weight |
---|---|---|---|---|
3 | JavaThread::run | ['java', 'start_thread', 'java_start'] | 263 | 263 |
3 | GCTaskThread::run | ['java', 'start_thread', 'java_start'] | 14 | 277 |
3 | VMThread::run | ['java', 'start_thread', 'java_start'] | 4 | 281 |
3 | hypercall_page_[k] | ['java', 'write', 'check_events_[k]'] | 3 | 3 |
3 | hypercall_page_[k] | ['java', 'read', 'check_events_[k]'] | 1 | 1 |
Now, finally, we use a recursive common table expression (CTE) to build out where every rectangle needs to go At the bottom, we have a stacked bar chart of all the functions at depth 0. As we go up the stack, though, we want to align functions with their ancestry.
%%sql DROP TABLE if EXISTS flame_graph; CREATE TABLE flame_graph AS WITH RECURSIVE a AS ( SELECT depth, func, ancestry, weight, running_weight - weight AS x, running_weight AS x2 FROM stack_frames_with_weight WHERE depth = 0 UNION ALL SELECT s.depth, s.func, s.ancestry, s.weight, s.running_weight - s.weight + a.x AS x, s.running_weight + a.x AS x2 FROM stack_frames_with_weight s JOIN a ON s.ancestry = list_append(a.ancestry, a.func) AND s.depth = a.depth + 1 ), t as (SELECT sum(weight) as w FROM stack_frames_with_weight WHERE depth = 0) SELECT a.*, a.x / w as x_pct, a.x2 / w as x2_pct, weight / w as weight_pct FROM a JOIN t ON TRUE; SELECT * FROM flame_graph WHERE depth = 3
depth | func | ancestry | weight | x | x2 | x_pct | x2_pct | weight_pct |
---|---|---|---|---|---|---|---|---|
3 | JavaThread::run | ['java', 'start_thread', 'java_start'] | 263 | 0 | 263 | 0.0 | 0.9228070175438596 | 0.9228070175438596 |
3 | GCTaskThread::run | ['java', 'start_thread', 'java_start'] | 14 | 263 | 277 | 0.9228070175438596 | 0.9719298245614035 | 0.04912280701754386 |
3 | VMThread::run | ['java', 'start_thread', 'java_start'] | 4 | 277 | 281 | 0.9719298245614035 | 0.9859649122807017 | 0.014035087719298246 |
3 | hypercall_page_[k] | ['java', 'write', 'check_events_[k]'] | 3 | 281 | 284 | 0.9859649122807017 | 0.9964912280701754 | 0.010526315789473684 |
3 | hypercall_page_[k] | ['java', 'read', 'check_events_[k]'] | 1 | 284 | 285 | 0.9964912280701754 | 1.0 | 0.0035087719298245615 |
Finally, we can visualize this with Vega-Lite. Here I'm using Altair as a Python front-end. I'm inverting depth because I slightly "icicle" graphs, where the stacks are drooping from the ceiling.
Vega-Lite does not seem to be able to truncate text marks to a dynamic size based on the width of the bar I wish to match the text into. Please let me know if you've got a better solution for this, but I've resorted to doing the truncation here in Python land. (Nothing prevents us from doing it in SQL.)
result = %sql SELECT * FROM flame_graph df = result.DataFrame() # Invert to create an icicle chart df['depth'] = -df['depth'] # Truncate function names. We use a fixed-width font to make this somewhat correct. HEIGHT_OF_BAR = 20 WIDTH = 800 FONT_SIZE = 11 FONT_WIDTH = 11 * 0.75 df['func_short'] = df.apply(lambda row: row['func'][:int(row['weight_pct'] * WIDTH / FONT_WIDTH)], axis=1) tooltip = alt.Tooltip(['func', 'ancestry', 'weight', 'weight_pct']) chart = (alt.Chart(df) .transform_calculate(xxx='datum.weight') .encode(alt.X('x_pct'), alt.X2('x2_pct'), alt.Y('depth').scale(domainMax=0.5)) .properties(width=WIDTH, height=HEIGHT_OF_BAR*(df['depth'].abs().max() + 1))) bars = (chart .mark_bar(height=HEIGHT_OF_BAR) .encode(color=alt.Color('func', legend=None), tooltip=tooltip)) text = (chart .mark_text(dx=2, align='left', clip=True, font='monospace') .encode(text='func_short', tooltip=tooltip)) flamegraph = bars + text flamegraph
Implementing zoom and and testing at scale are an exercise left to the reader.
As is hoisting this into a web app. As is implementing "time order" graphs:
instead of cumulative sums, you need to merge adjacent, identical frame
rectangles, by identifying runs using DENSE_RANK() over (PARTITION BY func ORDER BY frame_id)
. Evidently this is called the "Gaps and Islands" problem.
Google's Perfetto works by keeping its data in an in-memory columnar SQL database (see documentation).
The Algebra of Profiles #
If a profile is just a collection of stack traces, you can combine them by
concatenating. (For pprof
profiles, one such tool is
pprof-merge. For our semi-colon
format, cat
works fine.) You can also diff them with subtraction, and
differential flame graphs
are a thing. Firefox profiler has tooling for comparison as well.
pprof and flamegraphs #
Let's take a very mini recursive implementation of gron that is deliberately slow. Something like:
function slow() {
const x = Date.now();
while (Date.now() < x + 500) {
}
}
function r(obj, currentPath = '') {
if (typeof obj === 'object' && obj !== null) {
if (Array.isArray(obj)) {
obj.forEach((item, index) => { r(item, `${currentPath}[${index}]`); });
} else {
Object.keys(obj).forEach(key => { r(obj[key], `${currentPath}.${key}`); });
}
} else {
slow();
console.log(`${currentPath.substring(1)}: ${obj}`);
}
}
const s = '{"a": [1,2,3], "b": {"ba": 2, "bb": {"bba": 3, "bbb": 4}}, "c": 5}';
const o = JSON.parse(s);
r(o);
If we profile this, we should see that slow()
is,
well, really slow. I sued node-pprof to profile this. (node-pprof
uses
the normal v8 profiling bindings, and then converts
from the Node profile representation to pprof's protobuf
representation.) Let's look at the pprof graph and
the flame graph.
The pprof graph draws an edge between pairs of callers and
callees. We spend all of our time in slow()
as you'd expect.
The flame graph, however, has multiple slow()
calls, because
we invoke the function at different recursion depths. Both
representations are useful: sometimes you have a hot function,
but it's only hot when it comes in via a particular stack trace,
and you need to figure that out.
Other Profiling Resources and Tips #
-
Go read Nelson Elhage's "Performance engineering, profilers, and seeing the invisible".
-
ProfilerPedia connects all the world's profilers and their data formats. The typical profiler schema is very simple (a collection of stack traces, maybe with some time stamps and filename/line number metadata), but there are many, many implementations, file formats, and analysis tools. (Speedscope's schema description)[https://github.com/jlfwong/speedscope/wiki/Importing-from-custom-sources] is a reasonable starting point.
-
Speedscope is my go to Flamegraph-esque visualizer. It sports both a Flamegraph view and a "time order" view.
-
Sneaking in metadata. For the most part, profiles contain only function names and statistics. You can, however, sneak in some custom metadata if you'd like. Let's say you want to associate part of a profile with a request. Somewhere, you have a
handleRequest()
method, but you'd really like to know much more about that request from your logs. If you're using a dynamic language, you can create a no-op passthrough function. In the example below, I snuck in the string SNEAKY into the stack trace. This will show up just fine in profiles and so forth. (You can do this in JavaScript, Python, and Java for sure; doing it in C might be a fun blog post for someone.) You can do this with a request id or the like. Whether the overhead is significant depends on what you're up to and how it compares to the request lifecycle.
> const name = "SNEAKY";
> eval(`function ${name}(f) { return f(); }; ${name}`)( () => { throw new Error(); } );
Uncaught Error
at REPL26:1:92
at SNEAKY (eval at <anonymous> (REPL26:1:24), <anonymous>:1:29)
-
Friction to getting a profile is an enemy. Typically, to get a profile, you have to SSH into production, install some tools, jump through a few containers, edit
/proc/sys/kernel/perf_event_paranoid
, take the profile, and then exfiltrate the result from production back to your machine for analysis. Yuck. Profiles don't have user data (they have function names and line numbers maybe!). Could you perhaps set ax-profile-me-for-duration: 10s
header to a request and then get a URL to the data back (either directly or via your logging system)? If you have benchmarks, add aPROFILE=1
flag to automatically produce the profile too. Maybe even spit it out as part of your CI system with the flamegraph or trace data all ready to go. -
Even better, perhaps profiling will be always on? I've not used it yet, but Elastic nee Prodfiler has an offering in this space. Some of these systems do fun eBPF magic to profile everything, and the magic is whether you get decent symbols and line numbers and such...
-
An implication of what we're doing here with DuckDB is that you're not limited to the queries and visualizations that others have built. If you can get your profiles into queryable form, you can build out your own custom queries that make sense for you.