Benchmarking:
It's About Time

Matt Godbolt
C++Now 2026

"As-If" By Magic

About me

jsbeeb Miracle Specbolt Compiler Explorer HRT

Games in the 90s

Games in the 00s

Where nanoseconds matter

Browsers
Browsers
Finance
Finance
Games
Games
Audio
Audio
Realtime / embedded
Realtime / embedded
HPC
HPC

V8 logo CC BY-SA 3.0; NYSE trading floor by Kevin Hutchinson, CC BY 2.0; DualSense Edge by Evan-Amos, CC0; Neve 81 console CC BY-SA 4.0; Ford SYNC ECU FCC ID public submission. All via Wikimedia Commons.

The Question

An hourglass with sand falling

Photo: Michael Himbeault, CC BY 2.0, via Wikimedia Commons.

How long does this take?

Seems simple.

Well, let's see...

Let's benchmark something


/// hide
#include <chrono>
#include <print>

/// unhide
namespace sc = std::chrono;

int sum(const std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

int main() {
  constexpr std::array data {1, 2, 3, 4, 5};
  const auto start = sc::system_clock::now();
  auto result = sum(data);
  const auto end = sc::system_clock::now();
  std::print("result={}, took {}\n", 
             result, end - start);
}
/// unhide
          

What can go wrong?

  • Clocks
  • Compilers
  • CPUs
  • Confounding factors

What is now()?

Spaceballs: 'What is now?'

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

std::chrono clocks


struct some_clock {
  using rep        = /* rep */;
  using period     = /* period per s */;
  using duration   = duration<rep, period>;
  using time_point = time_point<some_clock>;
  static constexpr bool is_steady = /* ... */;

  static time_point now() noexcept {
    // magic to find out the current time
  }
};
          

std::chrono clocks

  • system_clock: wall time
  • steady_clock: monotonic
  • high_resolution_clock: steady_clock in a trenchcoat

Type safety


/// hide
#include <chrono>
namespace sc = std::chrono;
/// unhide
auto t1 = sc::steady_clock::now();  // **steady** clock
auto t2 = sc::system_clock::now();  // **system** clock
auto diff = t2 - t1;
          

error: no match for 'operator-':
auto diff = t2 - t1;
            ~~ ^ ~~
            |    |
            |    time_point<std::chrono::_V2::steady_clock>
            time_point<std::chrono::_V2::system_clock>
          

What happens in now()?


/// hide
#include <chrono>
namespace sc = std::chrono;
/// unhide
auto get_time() {
  return sc::steady_clock::now();
}
          

Pop quiz: What system call?

What happens in now()?


std::chrono::_V2::steady_clock::now():
sub  rsp, 0x18              ; timespec ts;
mov  edi, 0x01              ; param0 = CLOCK_MONOTONIC
mov  rsi, rsp               ; param1 = &ts
call __clock_gettime        ; clock_gettime(
                            ;     CLOCK_MONOTONIC, &ts)
imul rax, [rsp], 0x3b9aca00 ; r = ts.tv_sec * 1 billion
add  rax, [rsp+0x8]         ; r += ts.tv_nsec
add  rsp, 0x18              ; restore stack
ret                         ; return r

Interestingly CLOCK_MONOTONIC

What syscall?


/// hide
#include <chrono>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  for (int i = 0; i < 1'000'000; ++i)
    clock::now();
}
          

$ g++ clock.cpp
$ strace ./a.out 2>&1 | grep -iE 'clock|time'
$
          

What happens in clock_gettime?

glibc/sysdeps/unix/clock_gettime.c


int clock_gettime(clockid_t clk_id, timespec *tp) {
  switch (clk_id) {
    case CLOCK_MONOTONIC:
    case CLOCK_REALTIME:
      return INLINE_VSYSCALL(clock_gettime, clk_id, tp);
    // ...
  }
}

What happens in clock_gettime?


// Magically populated by the ELF loader...
int (*__vdso_clock_gettime)(int, timeval *);

// Macro magic sort of expands to:
inline int inline_vsyscall_clock_gettime(
    clockid_t clk_id, timespec *tp) { 
  if (__vdso_clock_gettime) {
    return __vdso_clock_gettime(clk_id, tp);
  }
  return syscall(clock_gettime, sc_err, 2, clk_id, tp);
}

What happens in vDSO?


$ gdb /bin/true
(gdb) starti
Starting program: /usr/bin/true
(gdb) disassemble __vdso_clock_gettime
   0x00007ffff7fbd1e0 <+0>:     jmp    0x7ffff7fbc930
(gdb) disassemble 0x7ffff7fbc930,+0x400
   0x00007ffff7fbc930:  push   %rbp
   0x00007ffff7fbc931:  mov    %rsp,%rbp
   0x00007ffff7fbc934:  push   %r14
  ...

What happens in vDSO?


__vdso_clock_gettime:
push rbp
mov  rbp, rsp
push r14
push rbx
and  rsp, -16
sub  rsp, 0x20
cmp  edi, 0x17
ja   _doSyscall
mov  eax, 0x1
mov  ecx, edi
lea  r11, [rip-26966]
shl  eax, cl
mov  edx, eax
and  edx, 0x883

je   _slowPath
mov  r9d, [r11]
mov  r10d, r9d
and  r10d, 0x1
jne  _seqlockFail
mov  eax, [r11+0x4]
cmp  eax, 0x1
jne  _notTsc
rdtscp
xchg ax, ax
shl  rdx, 0x20
or   rdx, rax
btr  rdx, 0x3f
movsxd r8, edi
...mul / shift / ret

The code


const auto *cfg = clock_data_for(clk_id);
uint64_t seq, ns, aux;
do {
  seq = cfg->seq; // "volatile" read
  if (seq & 1) continue;

  const uint64_t delta = 
      __builtin_ia32_rdtscp(&aux) - cfg->cycle_last;
  ns = (delta * cfg->mult + cfg->base) >> cfg->shift;

} while (cfg->seq != seq); // "volatile" read
return ns;
          

PSEUDOCODE! not remotely safe...

Based off code in do_hres() from kernel lib/vdso/gettimeofday.c.

The code

  • Shared memory; sequence lock
  • mult, shift, cycle_last updated on tick
  • MONOTONIC: mult steered by NTP
  • MONOTONIC_RAW: mult fixed at boot

The whole process


std::chrono::steady_clock::now()
  → __clock_gettime(CLOCK_MONOTONIC)
    → vDSO (no syscall!)
      seq lock { rdtsc + calibration maths }
    ← struct timespec
  ← time_point{timespec to nanos}
          

Function calls & a magic instruction.

No syscall!

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = std::numeric_limits<
      clock::duration>::max();
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}
          

Min = 0ns

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = std::numeric_limits<
      clock::duration>::max();
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}
          

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
using namespace std::literals;
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = 1'000'000'000ns;
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}
          

20-30ns!

Let's get benchmarking!

A US Coast and Geodetic Survey BENCH MARK disc set into red brick

Photo: Elliott R. Plack, CC0, via Wikimedia Commons (Harbor East Geodetic Disk).

When is now()?

Spaceballs: 'When is now?'

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

Benchmarking


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  int result = sum(data);
  return sc::steady_clock::now() - start;
}
          

error: unused variable 'result' [-Werror=unused-variable]
|     int result = sum(data);
|         ^~~~~~

Benchmarking


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] int result = sum(data);
  return sc::steady_clock::now() - start;
}
          

volatile "fix"


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}
          

Data hiding


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
/*constexpr*/ std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}
          

Why volatile isn't the right answer

Accesses through volatile glvalues are evaluated strictly according to the rules of the abstract machine.

[intro.abstract] Section 8 - the "as-if" rule

Why volatile isn't the right answer

The order of volatile operations cannot change relative to other volatile operations, but may change relative to non-volatile operations.

P1152R0 Deprecating volatile (JF Bastien)

A bigger problem?


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

/// unhide
auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}
          

A bigger problem?


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

/// unhide
auto benchmark() {
  // Is this a valid transform?
  [[maybe_unused]] volatile int result = sum(data);

  auto start = sc::steady_clock::now();
  return sc::steady_clock::now() - start;
}
          

A bigger problem?

..I had embarrassingly neglected the possibility that the compiler would reorder the calculation out of the timing region...
Without decimating the as-if rule, there appears to be no way to normatively require such timings to be correct. Nevertheless, timing a block of code or an algorithm is not devoid of meaning.

P0342R0 Timing barriers, Mike Spertus

Interlude

The closed front curtain of the Bolshoi Theatre, deep red with gold detailing

Bolshoi Theatre curtain: Bgelo777, CC BY-SA 4.0, via Wikimedia Commons.

GCC inline asm syntax

Interlude: GCC inline asm syntax


asm <optionally volatile> (
  "template string %0, %1 ..."
  : outputs
  : inputs
  : clobbers
);

Interlude: Constraints

"type"(expression)

Types

"r"
register
"m"
memory
"r,m"
reg or mem
"i"
immediate
"g"
anything
"a""b", …
specific reg

Modifiers

=
write only
+
read/write
&
early clobber

e.g. "=r"(dest), "+m"(buf)

Interlude: GCC inline asm syntax


/// hide
#include <cstdint>
  void test() {
/// unhide
uint64_t source = 1234;
uint64_t dest;
asm /*volatile*/ (
  "mov %1, %0"   // AT&T syntax
  : "=r" (dest)  // outputs
  : "r" (source) // inputs
  :              // no clobbers
);
/// hide
}

Interlude: GCC inline asm syntax

GCC's optimizers discard asm statements if there is no need for the output variables. The optimizers may move code out of loops if the code always returns the same result. Using the volatile qualifier disables these optimizations.

DoNotOptimize


/// hide
#include <array>
/// unhide
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile(
    ""              // No instructions at all!
    :               // no outputs
    : "r,m" (value) // input
    :               // no clobbers
  );
}
///hide

std::array data {1, 2, 3, 4, 5};

void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}
    

DoNotOptimize


/// hide
#include <array>
#include <chrono>
namespace sc = std::chrono;
/// unhide
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile("" : : "r,m" (value));
}
/// hide
std::array data {1, 2, 3, 4, 5};
static void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}
/// unhide
auto benchmark_sum_many() {
  auto now = sc::steady_clock::now();
  for (int i = 0; i < 16; ++i) {
    benchmark_sum();
  }
  return sc::steady_clock::now() - now;
}
    

ClobberMemory()


/// hide
#include <array>
#include <chrono>
namespace sc = std::chrono;
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile("" : : "r,m" (value));
}
std::array data {1, 2, 3, 4, 5};
static void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}
/// unhide
inline void ClobberMemory() {
  asm volatile("" : : : "memory");
}
auto benchmark_sum_many() {
  auto now = sc::steady_clock::now();
  for (int i = 0; i < 16; ++i) {
    ClobberMemory();
    benchmark_sum();
  }
  return sc::steady_clock::now() - now;
}
    

The standardisation gap?

  • P0342R0 (Spertus, 2016): timing_fence() - rejected
    • "If the timing fence is inside now(), and now() is in another TU, how does the compiler know there is a fence?"
  • P0412R0 (Maltsev, 2016): keep() / touch()
    • Solves elimination. Stalled at R0.
  • Rust: std::hint::black_box. Zig: mem.doNotOptimizeAway.
  • C++: …

When will then be now()?

Spaceballs: 'When is now?'

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

Hardware counters

High-resolution die shot of an Intel Pentium 4 Prescott processor

Pentium 4 Prescott die: Martijn Boer, Public Domain, via Wikimedia Commons.

When we say "now", when do we mean?

Hardware counters

  • Time-Stamp Count rdtsc, rdtscp
  • Stable since Pentium 4
  • Invariant since Nehalem (Phenom for AMD)

(ARM has CNTVCT_EL0)

Let's write our own!

rdtsc - read counter into edx:eax


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  asm("rdtsc" 
  : // outputs
  : // inputs
  : // clobbers
  );
}
    

Let's write our own!

rdtsc - read counter into edx:eax


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  asm("rdtsc" 
  : "=a"(...), "=d"(...) // outputs
  : // inputs
  : // clobbers
  );
}
    

Let's write our own!

rdtsc - read counter into edx:eax


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  uint32_t lo, hi;
  asm("rdtsc" 
  : "=a"(lo), "=d"(hi) // outputs
  : // inputs
  : // clobbers
  );
}
    

Let's write our own!

rdtsc - read counter into edx:eax


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  uint32_t lo, hi;
  asm("rdtsc" 
  : "=a"(lo), "=d"(hi)
  : // inputs
  : // clobbers
  );
  return static_cast<uint64_t>(lo) 
      | static_cast<uint64_t>(hi) << 32;
}
    

Let's write our own!

rdtsc - read counter into edx:eax


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  uint64_t lo, hi;
  asm("rdtsc" 
  : "=a"(lo), "=d"(hi)
  : // inputs
  : // clobbers
  );
  return lo | (hi << 32);
}
    

Let's test it!


/// hide
#include <cstdint>
inline uint64_t now() {
  uint64_t lo, hi;
  asm("rdtsc" : "=a"(lo), "=d"(hi));
  return lo | (hi << 32);
}
/// unhide
void testFunction();
auto benchmark() {
  const auto start = now();
  testFunction();
  const auto end = now();
  return end - start;
}
    

Final version


/// hide
#include <cstdint>
///unhide
uint64_t now() {
  uint64_t lo, hi;
  asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
  return lo | (hi << 32);
}
    

But best to leave to experts


uint64_t now() {
  return __builtin_ia32_rdtsc();
}
    

Out of Order Execution

  • Modern CPUs don't execute in order
  • Instructions run as soon as ready
  • "As-if" all over again

rdtsc vs rdtscp

  • RDTSC - immediate
  • RDTSCP - after prior instrs completed
"The RDTSCP instruction ... wait[s] until all previous instructions have executed ... [but] ... subsequent instructions may begin execution before the read operation is performed."

Fencing


      lfence      ; ensure all loads complete
      rdtsc       ; read time

      ; code under test here

      rdtscp      ; rdtsc but partially serializing
      lfence      ; ensure nothing beyond migrates above
    

Fencing


  #include <cstdint>

  auto test() {
    __builtin_ia32_lfence();
    const auto start = __builtin_ia32_rdtsc();
    //...
    std::uint32_t aux;
    const auto end = __builtin_ia32_rdtscp(&aux);
    __builtin_ia32_lfence();
    return end - start;
  }
      

Does it matter?


      extern int x, y, z, w;
      auto someFunction() {
        return x % y % z % w;
      }
      

Cost of 3 idivs?

rdtsc ~111 ns
rdtscp ~187 ns
lfence;rdtsc
rdtscp;lfence
~159 ns

Probably only matters for microbenchmarks

So, what does clock_gettime do?


...
and  r10d, 0x1
jne  _seqlockFail
mov  eax, [r11+0x4]
cmp  eax, 0x1
jne  _notTsc
rdtscp              ; rdtscp but no fences in sight!
xchg ax, ax
shl  rdx, 0x20
or   rdx, rax
btr  rdx, 0x3f
movsxd r8, edi
...

The cost of asking the time

ns cycles instr IPC
rdtsc 14 27 10 0.36
rdtscp 23 43 11 0.25
steady_clock::now() 35 68 100 1.46

~25 cycles vDSO overhead.

~89 more instructions?

Accounting for the clock overhead

  • Subtract overhead
  • Amortise over N runs

Delta benchmarking


// base = overhead + N x cost
const auto base  = measure(N);
// bench = overhead + 2N x cost
const auto bench = measure(N * 2);
// result = (overhead + 2N x cost) - overhead + N x cost
//        = N x cost
const auto result = (bench - base) / N;
    

Cancel overhead.

Which clock to use?

  • steady_clock::now() mostly
  • Fenced rdtsc/rdtscp for µbenchmarks
  • Unfenced rdtsc for speed

Confounding Factors

Pyramidal foam wedges of an anechoic chamber lit in deep blue

Photo: Tom Marvel, CC BY 2.0, via Wikimedia Commons (Anechoic chamber blue-1).

So

  • Use a good clock
  • Defeat optimizer
  • Be aware of OoO execution
  • Consider overheads
  • Surely that's all?

More CPU fun

Branch predictor: history register FFTT indexing into a table of 2-bit saturating counters

From an earlier talk on CPU microarchitecture.

Branching


/// hide
#include <cstdint>
#include <vector>
/// unhide
int state_machine(const std::vector<int> &data,
                  int threshold) {
  int state = 0, acc = 0;
  for (auto v : data) {
    if (state == 0) {
      if (v > threshold) { acc += v * v; state = 1; } 
      else { acc += v; }
    } else {
      if (v & 1) { acc ^= v; state = 0; } 
      else { acc += v >> 2; }
    }
  }
  return acc;
}
          

Branching


/// hide
#include <cstdint>
#include <vector>
int state_machine(const std::vector<int> &data,
                  int threshold) {
  int state = 0, acc = 0;
  for (auto v : data) {
    if (state == 0) {
      if (v > threshold) { acc += v * v; state = 1; } 
      else { acc += v; }
    } else {
      if (v & 1) { acc ^= v; state = 0; } 
      else { acc += v >> 2; }
    }
  }
  return acc;
}
/// unhide
auto benchmark(size_t dataSize, size_t N) {
  // seeded random data...
  auto data = make_data(dataSize); 

  auto start = now();
  for (size_t repeat = 0; repeat < N; ++repeat) {
    DoNotOptimize(state_machine(data, 50));
  }
  auto end = now();

  return end - start;
}
          

Branching


$ ./benchmark.tsc_fenced state_machine_256
# samples: 150 (+10 warmup, discarded)
name                                min
----                                ---
state_machine_256                  0.59 ns/elem
    

Different machines, different stories

ns/elem vs dataSize (log-x) for state_machine on three CPUs: AMD EPYC VM stays flat, Comet Lake cliffs at dataSize=8K, Lunar Lake cliffs at dataSize=16K

Random data, mt19937(42)..

Predictable branches

Same three CPUs, sorted data: all lines stay flat near 0.5 ns/elem. No cliff.

Same values & cache footprint.

perf stat


$ perf stat -e cycles,instructions,branches,branch-misses \
    taskset -c 2 ./benchmark.tsc_fenced state_machine_4096

   531,281,741  cycles
 1,347,442,053  instructions  #  2.54 insn per cycle
   419,833,335  branches
     1,839,889  branch-misses #  0.44% of all branches
          
dataSize Instructions Branches Miss % IPC
1,024 1.35B 420M 0.02% 3.22
4,096 1.35B 420M 0.44% 2.54
8,192 1.35B 419M 8.67% 0.96
32,768 1.35B 421M 18.10% 0.59

Branching

  • Is this real?
  • What do we want to measure?

And…

  • OoO overlap
  • Latency vs throughput
  • Cache & memory

…each its own rabbit hole.

What else can perturb you?

  • Scheduler, IRQs
  • Frequency scaling, boost, thermals
  • ASLR, link order, env
  • Noisy neighbours

So, what can we do?

A rural road forking into two paths through trees

Approaches

A rural road forking into two paths through trees

Photo: Wikideas1, CC0, via Wikimedia Commons.

  • Embrace it
  • Minimise it

Microbenchmarks

A compound optical microscope

Photo: Acagastya, CC0, via Wikimedia Commons.

...have been the focus of this talk...

System noise mitigations


# pin and isolate
taskset -c 3 ./bench
# at boot: isolcpus=3 nohz_full=3 rcu_nocbs=3

# frequency
sudo cpupower frequency-set -g performance
echo 1 | sudo \
  tee /sys/devices/system/cpu/intel_pstate/no_turbo

# IRQs off your core
echo fffffff7 | sudo tee /proc/irq/*/smp_affinity
    

Take the minimum?

CDF of state_machine_4096 on a casually-loaded laptop. Min sits at 1.77 ns/elem, the curve climbs slowly through p50 = 2.10, mean = 2.31, out to p99 = 3.50.

Noise only makes things slower.

Which question?

  • Best case
  • Typical
  • Bounded worst case

Pick one.

Sensitivity

  • BPU-heavy?
  • Cache-reliant?
  • Scheduler-jitter?

perf & CDF shape.

Slow but steady

Real workloads

Benchmark says faster.

App says no difference.

In-situ stats


struct ScopedTimer {
  uint64_t &accumulator;
  uint64_t start;
  explicit ScopedTimer(uint64_t &acc)
    : accumulator(acc), start(rdtsc()) {}
  ~ScopedTimer() { accumulator += rdtsc() - start; }
};

void hot_path() {
  ScopedTimer t{stats.hot_path};
  // ...
}
    

Benchmarking vs profiling

  • Benchmark: how fast?
  • Profile: why that fast?

You need both.

There's more perf

perf stat
counters
perf record
sample profile
perf report
interactive view
perf annotate
per-instruction
perf top
live profile
perf list
what counters exist?

perf list | wc -l7158 events (branch-misses, LLC-load-misses, resource_stalls.sb, …)

Pick the right counter.

Other tools

  • Google Benchmark / Catch2
    • (don't write your own)
  • Tracy - pipeline timing
  • Valgrind / Callgrind - exact but slow
  • perf-cpp - wrapper for perf_event_open
  • uiCA / uops.info - low level info

Conclusion

Aspen Mountain, Colorado: pine-covered slopes under blue sky

Aspen Mountain: Hunter Bryant, CC0 (Unsplash), via Wikimedia Commons.

If any single optimisation makes a routine run two or more times faster, then you’ve broken the code.

- Matt Godbolt, April 2005

Full circle

In 1993: border colours.

2026: rdtscp, perf, Catch2...

Measure your code!

Problems

  • Compiler: can eliminate work
  • CPU: predicts, overlaps work
  • Caches can make data unrealistically fast
  • The OS can interrupt your measurement
  • The clock itself has overhead

What to actually do

  • Compiler limits, work around them
  • Clock choice + tradeoffs
  • CPU surprises (BPU, OoO, cache)
  • Know your sensitivity
  • Distribution: fastest? bounded?
  • Tools: Catch2, gbench, in-situ
Silhouette of a top-hatted stage magician with arms outstretched

Magician silhouette: Firkin, CC0, via Wikimedia Commons (after an 1880 illustration).

"Some people say optimisation is magic. People who optimise say no, it isn't magic.
But really it is like magic: once you show people how it's done, it goes from amazing to 'oh, just that?'"

- Pat Farley

HRT office: people sitting on a wide wooden amphitheater staircase, working on laptops

mattgodbolt@hudson-trading.com

hudsonrivertrading.com

This is (some) of what we do at HRT

HRT trading floor: rows of desks with multiple monitors, people working under blue-tinted natural light

Godbolt's Law (2005)

If any single optimisation makes a routine run two or more times faster, then you've broken the code.

“I proceed to dance around the room… then, just before I commit, I remember to actually check the results - and yes, the reason it's going three times faster is because it's not actually doing any work at all.”

xania.org/200504/godbolt's-law

Mistakes we made preparing this talk

  • Benchmark filter state_machine_16 also matched state_machine_16384
    → perf counters counted both, instruction count was 36% too high
  • “Suspicious” N=16 data: 6% branch miss rate, IPC 1.25
    → accepted it as “startup noise” instead of investigating
  • Delta benchmarking (N vs 2N): noisier than simple timing at high iteration counts
    → elegant theory, worse in practice for our use case

A benchmarking bug in our benchmark of benchmarking.

rdtsc pipeline trace (uiCA)

rdtsc              |P  QI    D                        ER
                   |P    QI    D E                     R
                   |P    QI     D E                    R
                   |P    QI    D E                     R
                   |P    QI      D E                    R
                   |P     QI    D E                     R
                   |P     QI    D E                     R
                   |P     QI     D E                    R
                   |P     QI     D E                     R
                   |P      QI     D E                    R
                   |P      QI     D E                    R
                   |P      QI      D E                   R
                   |P      QI      D E                    R
                   |P       QI    D E                     R
                   |P       QI     D E                    R
                   |P       QI      D E                   R
                   |P       QI       D E                   R
                   |P        QI    D E                     R
shl rdx, 0x20      |P          QI    D E                   R
or rax, rdx        |P          QI      D E                 R
mov rcx, rax       |P          QE                           R
add rcx, 0x2a      |P          QI       D E                 R
imul rcx, rcx, 0x7 | P          QI       D  E               R
    

rdtscp pipeline trace (uiCA)

rdtscp             |P  QI    D ER
                   |P    QI    D ER
                   |P    QI     D ER
                   |P    QI    D E R
                   |P    QI      D ER
                   |P     QI    D E R
                   |P     QI     D ER
                   |P     QI      D ER
                   |P     QI       D ER
                   |P      QI     D E R
                   |P      QI    D E  R
                   |P      QI     D E R
                   |P      QI      D E R
                   |P       QI      D ER
                   |P       QI       D ER
                   |P       QI        D ER
                   |P       QI    D E    R
                   |P        QI    D E   R
                   |P        QI     D E  R
                   |P        QI      D E  R
shl rdx, 0x20      |P          QI    D E  R
or rax, rdx        |P          QI      D ER
mov rcx, rax       |P          QE         R
add rcx, 0x2a      | P         QI       D ER
imul rcx, rcx, 0x7 | P          QI       D  ER
    

You won't believe until you measure

Chrome Omnibox: 25,000 allocations per keypress

std::string responsible for nearly half of all Chrome browser process allocations.

Discovered by profiling. Not by guessing.

Caching ruins everything

memory, filesystem, etc etc

do you want hot or cold? (same question as bpu)

Same code, different sizes

  • 32×32: fits in L1. Blazing fast.
  • 512×512: fits in L3. Slower.
  • 2048×2048: spills to DRAM. Way slower.

Same code. Same algorithm. Same binary.

Three completely different performance profiles.

TODO: actual benchmark numbers - chart/table of ns/element at different matrix sizes

Loop order matters


/// hide
#include <mdspan>

using mat = std::mdspan<double, std::dextents<int, 2>>;
using cmat = std::mdspan<const double, std::dextents<int, 2>>;
/// unhide
// ijk - cache-hostile (stride-N on B)
void multiply_ijk(cmat A, cmat B, mat C) {
  int N = A.extent(0);
  for (int i = 0; i < N; ++i)
    for (int j = 0; j < N; ++j)
      for (int k = 0; k < N; ++k)
        C[i, j] += A[i, k] * B[k, j];
}

// ikj - cache-friendly (stride-1 on B and C)
void multiply_ikj(cmat A, cmat B, mat C) {
  int N = A.extent(0);
  for (int i = 0; i < N; ++i)
    for (int k = 0; k < N; ++k)
      for (int j = 0; j < N; ++j)
        C[i, j] += A[i, k] * B[k, j];
}
          

5–10× difference. And the compiler vectorises only one of them.

Machine-dependent results

  • Same binary, laptop vs Xeon: different L3 sizes
  • The "fits in L3" crossover point moves
  • Your benchmark is true… on your machine

TODO: benchmark numbers from two different machines (laptop vs Xeon)

The measurement perturbs the system

  • Context switches during measurement
  • Interrupts: timer, NIC, anything
  • CPU frequency scaling: first iterations at 2.4 GHz, warm ones at 4.8 GHz
  • Your "average" blends two different CPUs

Distribution matters

The minimum of many runs represents "what happens when nothing goes wrong." The mean includes all the ways the OS interrupted you.
  • Warm up
  • Pin to a core
  • Use the right clock
  • Many iterations