A Beginner's Guide to Linux Performance Profiling: From First Principles to Flamegraphs

🧠 Why Performance Analysis Matters

Every program consumes CPU cycles, memory, and sometimes I/O bandwidth. But which part of your program is slow? Why is it slow? That’s what performance profiling helps you discover.

From startups to hyperscale infrastructure, knowing where your code wastes time is the first step to making it faster, cheaper, and more reliable.

In this article series, we’ll start from basic Linux command-line tools and move step-by-step toward advanced techniques like perf, uftrace, and flamegraphs.

🚀 What You'll Learn

How to think about performance from first principles
Basic performance metrics: wall time, CPU time, IPC, cache misses
Real-world profiling examples using C
Flamegraph generation from perf

🔧 Step 0: The First Principles of Performance

Let’s say your program is "slow". But what does that actually mean?

Performance = Useful Work / Time

In Linux systems, most performance issues boil down to one or more of these bottlenecks:

Resource Type	Example Metrics	Fundamental Limitation
CPU	Instructions, IPC	CPU can only do N things per clock tick
Memory	Cache misses, TLB miss	DRAM access is orders slower than L1/L2
I/O	IO wait, latency	SSDs, disks, networks have long latency

We want to find: where the time is spent, and why.

⚛️ Step 1: Start from a basic example

Let’s implement a simple vector dot product — a core operation in transformer models like LLaMA or GPT.

Create a file dot.c:

// dot.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N (1024 * 1024)
#define REPEAT 100000

float *a;
float *b;

__attribute__((noinline))
float dot_product(const float *a, const float *b, int n) {
    float result = 0.0f;
    for (int i = 0; i < n; i++) {
        result += a[i] * b[i];
    }
    return result;
}

int main() {
    // Allocate large arrays on heap
    a = (float *)malloc(sizeof(float) * N);
    b = (float *)malloc(sizeof(float) * N);

    // Ensure pages are committed
    memset(a, 0, sizeof(float) * N);
    memset(b, 0, sizeof(float) * N);

    for (int i = 0; i < N; i++) {
        a[i] = (float)i / 255.0f;
        b[i] = (float)(N - i) / 255.0f;
    }

    float result = 0.0f;
    for (int j = 0; j < 1; j++) {               // outer loop for longer runtime
        for (int i = 0; i < REPEAT; i++) {
            result += dot_product(a, b, N);
        }
    }

    printf("a \u00b7 b (repeated) = %f\n", result);

    free(a);
    free(b);
    return 0;
}

Compile with debug symbols and optimization:
gcc -O2 -g -o dot.out dot.c

-O2 enables most common optimizations without being too aggressive like -O3, and it keeps the binary relatively stable.
-g activate call stack when sampling

✅ Step 2: Measure Wall-Clock Time

Start with the most basic measurement — how long does your program take?

time ./dot.out

You'll see output like:

a · b (repeated) = 27185328239738355712.000000

real    0m0.025s
user    0m0.018s
sys     0m0.007s

real: total elapsed time
user: time spent in user-space (your code)
sys: time spent in kernel (e.g. syscalls)

If real ≫ user + sys → you may have I/O or wait issues.

🔍 Step 3: Profile with `perf`

Let’s collect low-level CPU performance data:

sudo perf stat -e cycles,instructions,minor-faults,major-faults,dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses,page-faults,cache-references,cache-misses  ./dot.out

Example output:

a · b (repeated) = 27185328239738355712.000000

 Performance counter stats for './dot.out':

     <not counted>      cpu_atom/cycles/                                                        (0.00%)
        32,890,984      cpu_core/cycles/                                                      
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
        44,322,936      cpu_core/instructions/           #    1.35  insn per cycle            
             2,110      minor-faults                                                          
                 0      major-faults                                                          
     <not counted>      cpu_atom/dTLB-loads/                                                    (0.00%)
         5,586,232      cpu_core/dTLB-loads/                                                  
     <not counted>      cpu_atom/dTLB-load-misses/                                              (0.00%)
             1,029      cpu_core/dTLB-load-misses/       #    0.02% of all dTLB cache accesses
   <not supported>      cpu_atom/iTLB-loads/                                                  
   <not supported>      cpu_core/iTLB-loads/                                                  
     <not counted>      cpu_atom/iTLB-load-misses/                                              (0.00%)
             2,341      cpu_core/iTLB-load-misses/                                            
             2,110      page-faults                                                           
     <not counted>      cpu_atom/cache-references/                                              (0.00%)
           329,831      cpu_core/cache-references/                                            
     <not counted>      cpu_atom/cache-misses/                                                  (0.00%)
           155,188      cpu_core/cache-misses/           #   47.05% of all cache refs         

       0.025495190 seconds time elapsed

       0.016633000 seconds user
       0.008806000 seconds sys

Now dig into call graphs:

sudo perf record -g ./dot.out 
sudo perf report

go to line contains dot.out, press a

press e

Here you learn:

IPC (instructions per cycle): a key efficiency measure
Total instructions vs cycles
Cache miss rate (high = bad locality or large working set)

To visualize:

git clone https://github.com/brendangregg/Flamegraph.git
sudo perf record -g ./dot.out
sudo perf script > out.perf
./Flamegraph/stackcollapse-perf.pl out.perf > out.folded
./Flamegraph/flamegraph.pl out.folded > flamegraph.svg

Open flamegraph.svg in your browser 🔥 and zoom into dot_product.

its call stack:

dot.out
 └── _start
     └── __libc_start_main@@GLIBC_2.34
         └── main
             └── dot_product

right side:

asm_exc_page_fault
 └── exc_page_fault
     └── do_user_addr_fault
         └── handle_mm_fault
             └── handle_pte_fault
                 └── do_anonymous_page

means cpu spend lots of time deal with page fault

Step 4: Use `uftrace` to Inspect Function Call Time

uftrace is for function level tracing, for example if you want to trace what function and its execution time.

modify code:

// dot.c
...
#define REPEAT 100
...
    for (int j = 0; j < 1; j++) {             
    ...

Make sure to compile with:
gcc -pg -g -O2 -o dot.out dot.c
Then:

sudo uftrace record ./dot.out
sudo uftrace report

  Total time   Self time       Calls  Function
  ==========  ==========  ==========  ====================
    3.633 ms    2.732 ms           1  main
  751.685 us  751.685 us           1  dot_product
  128.561 us  128.561 us           2  free
   11.118 us   11.118 us           1  __printf_chk
    9.824 us    9.824 us           2  calloc
    0.472 us    0.472 us           1  __monstartup
    0.112 us    0.112 us           1  __cxa_atexit

gcc -finstrument-functions -g -O2 -o dot.out dot.c
sudo uftrace record -P dot_product ./dot.out

Or focus on dot_product() only:
gcc -finstrument-functions -g -O2 -o dot.out dot.c
sudo uftrace record -P dot_product ./dot.out
sudo uftrace tui

🧰 Step 5: Use Valgrind to Analyze Memory Behavior

Memory Peak with Massif

valgrind --tool=massif ./dot.out
ms_print massif.out.<pid>
Output example:

--------------------------------------------------------------------------------
Command:            ./dot
Massif arguments:   (none)
ms_print arguments: massif.out.10493
--------------------------------------------------------------------------------


    MB
8.009^                                                                       #
     |:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::#
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
     |:                                                                      #
   0 +----------------------------------------------------------------------->Mi
     0                                                                   603.4

Number of snapshots: 8
 Detailed snapshots: [4 (peak)]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        156,159        4,198,344        4,194,304         4,040            0
  2        156,217        8,396,688        8,388,608         8,080            0
  3    632,718,953        8,397,720        8,389,632         8,088            0
  4    632,721,819        8,397,720        8,389,632         8,088            0
99.90% (8,389,632B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->49.95% (4,194,304B) 0x10910E: main (dot.c:23)
| 
->49.95% (4,194,304B) 0x109127: main (dot.c:24)
| 
->00.01% (1,024B) in 1+ places, all below ms_print's threshold (01.00%)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  5    632,721,819        4,199,376        4,195,328         4,048            0
  6    632,721,849            1,032            1,024             8            0
  7    632,724,812                0                0             0            0

Or GUI:

Cache Usage with Cachegrind

kcachegrind callgrind.out.<pid>
Output example:

Instruction Counts with Callgrind

valgrind --tool=cachegrind ./dot.out
cg_annotate cachegrind.out.<pid>

--------------------------------------------------------------------------------
-- Metadata
--------------------------------------------------------------------------------
Invocation:       /usr/bin/cg_annotate cachegrind.out.9770
Command:          ./dot
Events recorded:  Ir
Events shown:     Ir
Event sort order: Ir
Threshold:        0.1%
Annotation:       on

--------------------------------------------------------------------------------
-- Summary
--------------------------------------------------------------------------------
Ir__________________ 

632,713,576 (100.0%)  PROGRAM TOTALS

--------------------------------------------------------------------------------
-- File:function summary
--------------------------------------------------------------------------------
  Ir__________________________  file:function

< 632,557,613 (100.0%, 100.0%)  /home/alanhc/workspace/example/profiling/dot.c:
  629,148,800  (99.4%)            dot_product
    3,408,813   (0.5%)            main

--------------------------------------------------------------------------------
-- Function:file summary
--------------------------------------------------------------------------------
  Ir_________________________  function:file

> 629,148,800 (99.4%,  99.4%)  dot_product:/home/alanhc/workspace/example/profiling/dot.c

>   3,408,817  (0.5%, 100.0%)  main:
    3,408,813  (0.5%)            /home/alanhc/workspace/example/profiling/dot.c

--------------------------------------------------------------------------------
-- Annotated source file: /home/alanhc/workspace/example/profiling/dot.c
--------------------------------------------------------------------------------
Ir_________________ 

-- line 5 ----------------------------------------
          .          
          .          #define N (1024 * 1024)
          .          #define REPEAT 100
          .          
          .          float *a;
          .          float *b;
          .          
          .          __attribute__((noinline))
      1,300  (0.0%)  float dot_product(const float *a, const float *b, int n) {
        300  (0.0%)      float result = 0.0f;
314,573,500 (49.7%)      for (int i = 0; i < n; i++) {
314,572,800 (49.7%)          result += a[i] * b[i];
          .              }
        100  (0.0%)      return result;
        800  (0.0%)  }
          .          
          9  (0.0%)  int main() {
          .              // Allocate large arrays on heap
          5  (0.0%)      a = (float *)malloc(sizeof(float) * N);
    524,300  (0.1%)      b = (float *)malloc(sizeof(float) * N);
          .          
          .              // Ensure pages are committed
          .              memset(a, 0, sizeof(float) * N);
          .              memset(b, 0, sizeof(float) * N);
          .          
    786,435  (0.1%)      for (int i = 0; i < N; i++) {
    786,432  (0.1%)          a[i] = (float)i / 255.0f;
  1,310,720  (0.2%)          b[i] = (float)(N - i) / 255.0f;
          .              }
          .          
          .              float result = 0.0f;
          .              for (int j = 0; j < 1; j++) {               // outer loop for longer runtime
        200  (0.0%)          for (int i = 0; i < REPEAT; i++) {
        698  (0.0%)              result += dot_product(a, b, N);
          .                  }
          .              }
          .          
          1  (0.0%)      printf("a \u00b7 b (repeated) = %f\n", result);
          .          
          2  (0.0%)      free(a);
          2  (0.0%)      free(b);
          3  (0.0%)      return 0;
          6  (0.0%)  

--------------------------------------------------------------------------------
-- Annotation summary
--------------------------------------------------------------------------------
Ir__________________ 

632,557,613 (100.0%)    annotated: files known & above threshold & readable, line numbers known
          0             annotated: files known & above threshold & readable, line numbers unknown
          0           unannotated: files known & above threshold & two or more non-identical
          0           unannotated: files known & above threshold & unreadable 
    155,360   (0.0%)  unannotated: files known & below threshold
        603   (0.0%)  unannotated: files unknown