It's About Time

Benchmarking:
It's About Time

Matt Godbolt
C++Now 2026

"As-If" By Magic

About me

Games in the 90s

Border colour trick:
Poke *(volatile Uint32*)(0xa05f8040) = colour. Red on entry, black on exit. Scanlines = timing. Display IS the clock.
Red Dog: DEBUG_BORDER(colour) - commented out in shipped code.

Games in the 00s

Red Dog (Dreamcast, 2000).

Profile bars:
Border trick fails past 1/60s. Built pbMark() instead.
Coloured bars at screen bottom. Tick marks at 1, 2, 3 frames.
Artist builds: red > half frame = red overlay on entire screen.
CrashOut(): flash two colours = crash type. Visible across room.

Tracy = modern descendant. Same idea, timeline viewer not CRT.

Transition: "1999. Now I need microsecond accuracy. How do you measure time in 2026?"

Where nanoseconds matter

V8 logo CC BY-SA 3.0; NYSE trading floor by Kevin Hutchinson, CC BY 2.0; DualSense Edge by Evan-Amos, CC0; Neve 81 console CC BY-SA 4.0; Ford SYNC ECU FCC ID public submission. All via Wikimedia Commons.

C++ = where nanoseconds matter. Trading, games, HPC, browsers, audio, embedded.

Drop six images into images/: ns-browsers.jpg, ns-finance.jpg, ns-games.jpg, ns-audio.jpg, ns-embedded.jpg, ns-hpc.jpg. Fragments reveal them one at a time so you can name-check each domain.

The Question

Photo: Michael Himbeault, CC BY 2.0, via Wikimedia Commons.

How long does this take?

Seems simple.

Well, let's see...

Let's benchmark something


/// hide
#include <chrono>
#include <print>

/// unhide
namespace sc = std::chrono;

int sum(const std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

int main() {
  constexpr std::array data {1, 2, 3, 4, 5};
  const auto start = sc::system_clock::now();
  auto result = sum(data);
  const auto end = sc::system_clock::now();
  std::print("result={}, took {}\n", 
             result, end - start);
}
/// unhide

What can go wrong?

Clocks
Compilers
CPUs
Confounding factors

What is `now()`?

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

`std::chrono` clocks


struct some_clock {
  using rep        = /* rep */;
  using period     = /* period per s */;
  using duration   = duration<rep, period>;
  using time_point = time_point<some_clock>;
  static constexpr bool is_steady = /* ... */;

  static time_point now() noexcept {
    // magic to find out the current time
  }
};

`std::chrono` clocks

system_clock: wall time
steady_clock: monotonic
high_resolution_clock: steady_clock in a trenchcoat

Type safety


/// hide
#include <chrono>
namespace sc = std::chrono;
/// unhide
auto t1 = sc::steady_clock::now();  // **steady** clock
auto t2 = sc::system_clock::now();  // **system** clock
auto diff = t2 - t1;


error: no match for 'operator-':
auto diff = t2 - t1;
            ~~ ^ ~~
            |    |
            |    time_point<std::chrono::_V2::steady_clock>
            time_point<std::chrono::_V2::system_clock>

What happens in `now()`?


/// hide
#include <chrono>
namespace sc = std::chrono;
/// unhide
auto get_time() {
  return sc::steady_clock::now();
}

Pop quiz: What system call?

What happens in `now()`?


std::chrono::_V2::steady_clock::now():
sub  rsp, 0x18              ; timespec ts;
mov  edi, 0x01              ; param0 = CLOCK_MONOTONIC
mov  rsi, rsp               ; param1 = &ts
call __clock_gettime        ; clock_gettime(
                            ;     CLOCK_MONOTONIC, &ts)
imul rax, [rsp], 0x3b9aca00 ; r = ts.tv_sec * 1 billion
add  rax, [rsp+0x8]         ; r += ts.tv_nsec
add  rsp, 0x18              ; restore stack
ret                         ; return r

Interestingly CLOCK_MONOTONIC

What syscall?


/// hide
#include <chrono>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  for (int i = 0; i < 1'000'000; ++i)
    clock::now();
}


$ g++ clock.cpp
$ strace ./a.out 2>&1 | grep -iE 'clock|time'
$

What happens in `clock_gettime`?

glibc/sysdeps/unix/clock_gettime.c


int clock_gettime(clockid_t clk_id, timespec *tp) {
  switch (clk_id) {
    case CLOCK_MONOTONIC:
    case CLOCK_REALTIME:
      return INLINE_VSYSCALL(clock_gettime, clk_id, tp);
    // ...
  }
}

What happens in `clock_gettime`?


// Magically populated by the ELF loader...
int (*__vdso_clock_gettime)(int, timeval *);

// Macro magic sort of expands to:
inline int inline_vsyscall_clock_gettime(
    clockid_t clk_id, timespec *tp) { 
  if (__vdso_clock_gettime) {
    return __vdso_clock_gettime(clk_id, tp);
  }
  return syscall(clock_gettime, sc_err, 2, clk_id, tp);
}

What happens in `vDSO`?


$ gdb /bin/true
(gdb) starti
Starting program: /usr/bin/true
(gdb) disassemble __vdso_clock_gettime
   0x00007ffff7fbd1e0 <+0>:     jmp    0x7ffff7fbc930
(gdb) disassemble 0x7ffff7fbc930,+0x400
   0x00007ffff7fbc930:  push   %rbp
   0x00007ffff7fbc931:  mov    %rsp,%rbp
   0x00007ffff7fbc934:  push   %r14
  ...

What happens in `vDSO`?


__vdso_clock_gettime:
push rbp
mov  rbp, rsp
push r14
push rbx
and  rsp, -16
sub  rsp, 0x20
cmp  edi, 0x17
ja   _doSyscall
mov  eax, 0x1
mov  ecx, edi
lea  r11, [rip-26966]
shl  eax, cl
mov  edx, eax
and  edx, 0x883


je   _slowPath
mov  r9d, [r11]
mov  r10d, r9d
and  r10d, 0x1
jne  _seqlockFail
mov  eax, [r11+0x4]
cmp  eax, 0x1
jne  _notTsc
rdtscp
xchg ax, ax
shl  rdx, 0x20
or   rdx, rax
btr  rdx, 0x3f
movsxd r8, edi
...mul / shift / ret

The code


const auto *cfg = clock_data_for(clk_id);
uint64_t seq, ns, aux;
do {
  seq = cfg->seq; // "volatile" read
  if (seq & 1) continue;

  const uint64_t delta = 
      __builtin_ia32_rdtscp(&aux) - cfg->cycle_last;
  ns = (delta * cfg->mult + cfg->base) >> cfg->shift;

} while (cfg->seq != seq); // "volatile" read
return ns;

PSEUDOCODE! not remotely safe...

Based off code in do_hres() from kernel lib/vdso/gettimeofday.c.

The code

Shared memory; sequence lock
mult, shift, cycle_last updated on tick
MONOTONIC: mult steered by NTP
MONOTONIC_RAW: mult fixed at boot

The whole process


std::chrono::steady_clock::now()
  → __clock_gettime(CLOCK_MONOTONIC)
    → vDSO (no syscall!)
      seq lock { rdtsc + calibration maths }
    ← struct timespec
  ← time_point{timespec to nanos}

Function calls & a magic instruction.

No syscall!

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = std::numeric_limits<
      clock::duration>::max();
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}

Min = 0ns

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = std::numeric_limits<
      clock::duration>::max();
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}

How long does it take?


/// hide
#include <chrono>
#include <limits>
#include <numeric>
#include <print>
using namespace std::literals;
/// unhide
int main() {
  using clock = std::chrono::steady_clock;
  auto last = clock::now();
  auto minDelta = 1'000'000'000ns;
  for (int i = 0; i < 5'000; ++i) {
    auto now = clock::now();
    auto delta = now - last;
    minDelta = std::min(minDelta, delta);
    last = now;
  }
  std::print("Min = {}\n", minDelta);
}

20-30ns!

Let's get benchmarking!

A US Coast and Geodetic Survey BENCH MARK disc set into red brick

Photo: Elliott R. Plack, CC0, via Wikimedia Commons (Harbor East Geodetic Disk).

When is `now()`?

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

Benchmarking


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  int result = sum(data);
  return sc::steady_clock::now() - start;
}


error: unused variable 'result' [-Werror=unused-variable]
|     int result = sum(data);
|         ^~~~~~

Benchmarking


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] int result = sum(data);
  return sc::steady_clock::now() - start;
}

`volatile` "fix"


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}

Data hiding


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
/// unhide
/*constexpr*/ std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}

Why volatile isn't the right answer

Accesses through volatile glvalues are evaluated strictly according to the rules of the abstract machine.

[intro.abstract] Section 8 - the "as-if" rule

Why volatile isn't the right answer

The order of volatile operations cannot change relative to other volatile operations, but may change relative to non-volatile operations.

P1152R0 Deprecating volatile (JF Bastien)

A bigger problem?


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

/// unhide
auto benchmark() {
  auto start = sc::steady_clock::now();
  [[maybe_unused]] volatile int result = sum(data);
  return sc::steady_clock::now() - start;
}

A bigger problem?


/// hide
#include <span>
#include <cstdint>
#include <chrono>
namespace sc = std::chrono;
constexpr std::array data {1, 2, 3, 4, 5};
static int sum(std::span<const int> v) {
  int total = 0;
  for (auto x : v) total += x;
  return total;
}

/// unhide
auto benchmark() {
  // Is this a valid transform?
  [[maybe_unused]] volatile int result = sum(data);

  auto start = sc::steady_clock::now();
  return sc::steady_clock::now() - start;
}

A bigger problem?

..I had embarrassingly neglected the possibility that the compiler would reorder the calculation out of the timing region...
Without decimating the as-if rule, there appears to be no way to normatively require such timings to be correct. Nevertheless, timing a block of code or an algorithm is not devoid of meaning.

P0342R0 Timing barriers, Mike Spertus

Interlude

The closed front curtain of the Bolshoi Theatre, deep red with gold detailing

Bolshoi Theatre curtain: Bgelo777, CC BY-SA 4.0, via Wikimedia Commons.

GCC inline asm syntax

Interlude: GCC inline asm syntax


asm <optionally volatile> (
  "template string %0, %1 ..."
  : outputs
  : inputs
  : clobbers
);

Interlude: Constraints

"type"(expression)

Types

"r": register
"m": memory
"r,m": reg or mem
"i": immediate
"g": anything
"a", "b", …: specific reg

Modifiers

=: write only
+: read/write
&: early clobber

e.g. "=r"(dest), "+m"(buf)

Interlude: GCC inline asm syntax


/// hide
#include <cstdint>
  void test() {
/// unhide
uint64_t source = 1234;
uint64_t dest;
asm /*volatile*/ (
  "mov %1, %0"   // AT&T syntax
  : "=r" (dest)  // outputs
  : "r" (source) // inputs
  :              // no clobbers
);
/// hide
}

Interlude: GCC inline asm syntax

GCC's optimizers discard asm statements if there is no need for the output variables. The optimizers may move code out of loops if the code always returns the same result. Using the volatile qualifier disables these optimizations.

DoNotOptimize


/// hide
#include <array>
/// unhide
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile(
    ""              // No instructions at all!
    :               // no outputs
    : "r,m" (value) // input
    :               // no clobbers
  );
}
///hide

std::array data {1, 2, 3, 4, 5};

void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}

DoNotOptimize


/// hide
#include <array>
#include <chrono>
namespace sc = std::chrono;
/// unhide
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile("" : : "r,m" (value));
}
/// hide
std::array data {1, 2, 3, 4, 5};
static void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}
/// unhide
auto benchmark_sum_many() {
  auto now = sc::steady_clock::now();
  for (int i = 0; i < 16; ++i) {
    benchmark_sum();
  }
  return sc::steady_clock::now() - now;
}

`ClobberMemory()`


/// hide
#include <array>
#include <chrono>
namespace sc = std::chrono;
template<typename T>
void DoNotOptimize(const T &value) {
  asm volatile("" : : "r,m" (value));
}
std::array data {1, 2, 3, 4, 5};
static void benchmark_sum() {
  int total = 0;
  for (auto x : data) total += x;

  DoNotOptimize(total);
}
/// unhide
inline void ClobberMemory() {
  asm volatile("" : : : "memory");
}
auto benchmark_sum_many() {
  auto now = sc::steady_clock::now();
  for (int i = 0; i < 16; ++i) {
    ClobberMemory();
    benchmark_sum();
  }
  return sc::steady_clock::now() - now;
}

Without ClobberMemory, the entire function compiles to `ret`. buf is local + never read, so every memset is a dead store, the loop disappears.

DNO doesn't help here: memset returns dst but the interesting output is the side effect, not the return value. DNO(buf) would force buf to live in memory but stores still coalesce across iterations (each one dead w.r.t. the next).

ClobberMemory says "treat any memory I might have touched as observed at this point" — stores must complete. Empty asm, no operands, just :"memory".

Conceptual divide:
- DNO is anchored on a VALUE — "compute this and don't elide it"
- ClobberMemory is anchored on a POINT IN TIME — "memory ops before me must complete"

Pair these and you have the Google Benchmark / Folly primitive set.

The standardisation gap?

P0342R0 (Spertus, 2016): timing_fence() - rejected
- "If the timing fence is inside now(), and now() is in another TU, how does the compiler know there is a fence?"
P0412R0 (Maltsev, 2016): keep() / touch()
- Solves elimination. Stalled at R0.
Rust: std::hint::black_box. Zig: mem.doNotOptimizeAway.
C++: …

When will then be `now()`?

Spaceballs (1987) © Brooksfilms/MGM. Used under fair use for commentary/education. danoshinsky.com

Hardware counters

High-resolution die shot of an Intel Pentium 4 Prescott processor

Pentium 4 Prescott die: Martijn Boer, Public Domain, via Wikimedia Commons.

When we say "now", when do we mean?