Every program consumes CPU cycles, memory, and sometimes I/O bandwidth. But which part of your program is slow? Why is it slow? That’s what performance profiling helps you discover.
From startups to hyperscale infrastructure, knowing where your code wastes time is the first step to making it faster, cheaper, and more reliable.
In this article series, we’ll start from basic Linux command-line tools and move step-by-step toward advanced techniques like perf
, uftrace
, and flamegraphs.
🚀 What You'll Learn
- How to think about performance from first principles
- Basic performance metrics: wall time, CPU time, IPC, cache misses
- Real-world profiling examples using C
- Flamegraph generation from
perf
Let’s say your program is "slow". But what does that actually mean?
In Linux systems, most performance issues boil down to one or more of these bottlenecks:
Resource Type |
Example Metrics |
Fundamental Limitation |
CPU |
Instructions, IPC |
CPU can only do N things per clock tick |
Memory |
Cache misses, TLB miss |
DRAM access is orders slower than L1/L2 |
I/O |
IO wait, latency |
SSDs, disks, networks have long latency |
We want to find: where the time is spent, and why.
⚛️ Step 1: Start from a basic example
- Let’s implement a simple vector dot product — a core operation in transformer models like LLaMA or GPT.
Create a file dot.c
:
Compile with debug symbols and optimization:
gcc -O2 -g -o dot.out dot.c
-O2 enables most common optimizations without being too aggressive like -O3, and it keeps the binary relatively stable.
-g activate call stack when sampling
✅ Step 2: Measure Wall-Clock Time
Start with the most basic measurement — how long does your program take?
You'll see output like:
- real: total elapsed time
- user: time spent in user-space (your code)
- sys: time spent in kernel (e.g. syscalls)
If real ≫ user + sys → you may have I/O or wait issues.
🔍 Step 3: Profile with perf
Let’s collect low-level CPU performance data:
Example output:
Now dig into call graphs:

go to line contains dot.out
, press a

press e

Here you learn:
- IPC (instructions per cycle): a key efficiency measure
- Total instructions vs cycles
- Cache miss rate (high = bad locality or large working set)
To visualize:
Open flamegraph.svg in your browser 🔥 and zoom into dot_product.

its call stack:
right side:
means cpu spend lots of time deal with page fault
Step 4: Use uftrace
to Inspect Function Call Time
uftrace is for function level tracing, for example if you want to trace what function and its execution time.
modify code:
Make sure to compile with:
gcc -pg -g -O2 -o dot.out dot.c
Then:
gcc -finstrument-functions -g -O2 -o dot.out dot.c
sudo uftrace record -P dot_product ./dot.out
Or focus on dot_product()
only:
gcc -finstrument-functions -g -O2 -o dot.out dot.c
sudo uftrace record -P dot_product ./dot.out
sudo uftrace tui

🧰 Step 5: Use Valgrind to Analyze Memory Behavior
Memory Peak with Massif
valgrind --tool=massif ./dot.out
ms_print massif.out.<pid>
Output example:
--------------------------------------------------------------------------------
Command: ./dot
Massif arguments: (none)
ms_print arguments: massif.out.10493
--------------------------------------------------------------------------------
MB
8.009^ #
|:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::#
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
|: #
0 +----------------------------------------------------------------------->Mi
0 603.4
Number of snapshots: 8
Detailed snapshots: [4 (peak)]
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 0 0 0 0
1 156,159 4,198,344 4,194,304 4,040 0
2 156,217 8,396,688 8,388,608 8,080 0
3 632,718,953 8,397,720 8,389,632 8,088 0
4 632,721,819 8,397,720 8,389,632 8,088 0
99.90% (8,389,632B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->49.95% (4,194,304B) 0x10910E: main (dot.c:23)
|
->49.95% (4,194,304B) 0x109127: main (dot.c:24)
|
->00.01% (1,024B) in 1+ places, all below ms_print's threshold (01.00%)
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
5 632,721,819 4,199,376 4,195,328 4,048 0
6 632,721,849 1,032 1,024 8 0
7 632,724,812 0 0 0 0
Or GUI:

Cache Usage with Cachegrind
kcachegrind callgrind.out.<pid>
Output example:

Instruction Counts with Callgrind
valgrind --tool=cachegrind ./dot.out
cg_annotate cachegrind.out.<pid>
--------------------------------------------------------------------------------
-- Metadata
--------------------------------------------------------------------------------
Invocation: /usr/bin/cg_annotate cachegrind.out.9770
Command: ./dot
Events recorded: Ir
Events shown: Ir
Event sort order: Ir
Threshold: 0.1%
Annotation: on
--------------------------------------------------------------------------------
-- Summary
--------------------------------------------------------------------------------
Ir__________________
632,713,576 (100.0%) PROGRAM TOTALS
--------------------------------------------------------------------------------
-- File:function summary
--------------------------------------------------------------------------------
Ir__________________________ file:function
< 632,557,613 (100.0%, 100.0%) /home/alanhc/workspace/example/profiling/dot.c:
629,148,800 (99.4%) dot_product
3,408,813 (0.5%) main
--------------------------------------------------------------------------------
-- Function:file summary
--------------------------------------------------------------------------------
Ir_________________________ function:file
> 629,148,800 (99.4%, 99.4%) dot_product:/home/alanhc/workspace/example/profiling/dot.c
> 3,408,817 (0.5%, 100.0%) main:
3,408,813 (0.5%) /home/alanhc/workspace/example/profiling/dot.c
--------------------------------------------------------------------------------
-- Annotated source file: /home/alanhc/workspace/example/profiling/dot.c
--------------------------------------------------------------------------------
Ir_________________
-- line 5 ----------------------------------------
.
. #define N (1024 * 1024)
. #define REPEAT 100
.
. float *a;
. float *b;
.
. __attribute__((noinline))
1,300 (0.0%) float dot_product(const float *a, const float *b, int n) {
300 (0.0%) float result = 0.0f;
314,573,500 (49.7%) for (int i = 0; i < n; i++) {
314,572,800 (49.7%) result += a[i] * b[i];
. }
100 (0.0%) return result;
800 (0.0%) }
.
9 (0.0%) int main() {
. // Allocate large arrays on heap
5 (0.0%) a = (float *)malloc(sizeof(float) * N);
524,300 (0.1%) b = (float *)malloc(sizeof(float) * N);
.
. // Ensure pages are committed
. memset(a, 0, sizeof(float) * N);
. memset(b, 0, sizeof(float) * N);
.
786,435 (0.1%) for (int i = 0; i < N; i++) {
786,432 (0.1%) a[i] = (float)i / 255.0f;
1,310,720 (0.2%) b[i] = (float)(N - i) / 255.0f;
. }
.
. float result = 0.0f;
. for (int j = 0; j < 1; j++) { // outer loop for longer runtime
200 (0.0%) for (int i = 0; i < REPEAT; i++) {
698 (0.0%) result += dot_product(a, b, N);
. }
. }
.
1 (0.0%) printf("a \u00b7 b (repeated) = %f\n", result);
.
2 (0.0%) free(a);
2 (0.0%) free(b);
3 (0.0%) return 0;
6 (0.0%)
--------------------------------------------------------------------------------
-- Annotation summary
--------------------------------------------------------------------------------
Ir__________________
632,557,613 (100.0%) annotated: files known & above threshold & readable, line numbers known
0 annotated: files known & above threshold & readable, line numbers unknown
0 unannotated: files known & above threshold & two or more non-identical
0 unannotated: files known & above threshold & unreadable
155,360 (0.0%) unannotated: files known & below threshold
603 (0.0%) unannotated: files unknown