Testing, Benchmarks & Profiling - Proving It Works and Finding the Slow Part

Q: In a benchmark, why do you wrap the code under test in `for i := 0; i < b.N; i++`?

Go runs the operation b.N times - a count it picks automatically - and divides, so a single fast call isn't swamped by clock noise

Q: Your test suite reports 100% coverage. What does that actually guarantee?

Every line of code executed at least once during the tests - nothing about whether results were checked

Q: You suspect a function is your bottleneck. What does pprof give you that a benchmark doesn't?

It samples the real run and ranks where time and memory actually went, so you optimize the proven hot spot instead of your hunch

Back in Phase 8 you saw the shape of a Go test: a function named TestXxx(t *testing.T), a plain if, and a t.Errorf when reality disagreed with you. That's enough to write your first test. It is not enough to test a real codebase without drowning.

This phase is the rest of the iceberg. The same testing package - no plugins, no DSL - also gives you a clean way to run dozens of cases through one function, a way to measure how fast and how memory-hungry your code is, and a way to find the slow part instead of guessing at it. The mental model to carry through all of it: the Go toolchain refuses to let you operate on vibes. It pushes you to write the cases down, measure the numbers, and look at where the time actually went. Once you internalize that, "is it correct?" and "is it fast?" stop being arguments and start being commands you run.

Table-driven tests - the idiomatic Go pattern

What it actually is. A table-driven test is a single test function that holds a slice of test cases and loops over them, running the same assertion logic against each one. Each case is a little struct: some inputs, the expected output, and a name. Instead of copy-pasting the same three lines for every scenario, you add a row to the table.

📝 Table-driven test - a test where the cases live in data (a slice of structs), and one loop runs every case through the same check. Adding a scenario means adding a row, not a new function.

Why Go prefers this over assertion libraries. Coming from other languages, you might expect assertThat(x).isEqualTo(6) or expect(x).toBe(6). Go deliberately doesn't ship that, and the community mostly doesn't want it. The reasoning: an assertion library is a second little language you have to learn, and its failure messages are written by someone else. A table-driven test is just Go - a slice, a for loop, an if. The data and the logic are separated, so the table reads like a specification you could hand to a stranger, and you control exactly what a failure says.

A real example. Say we're testing a Clamp function that pins a number into a [min, max] range:

// clamp.go
package mathx

func Clamp(n, min, max int) int {
	if n < min {
		return min
	}
	if n > max {
		return max
	}
	return n
}

// clamp_test.go
package mathx

import "testing"

func TestClamp(t *testing.T) {
	cases := []struct {
		name           string
		n, min, max    int
		want           int
	}{
		{"inside range", 5, 0, 10, 5},
		{"below min", -3, 0, 10, 0},
		{"above max", 99, 0, 10, 10},
		{"equal to min", 0, 0, 10, 0},
	}

	for _, tc := range cases {
		t.Run(tc.name, func(t *testing.T) {
			got := Clamp(tc.n, tc.min, tc.max)
			if got != tc.want {
				t.Errorf("Clamp(%d, %d, %d) = %d; want %d",
					tc.n, tc.min, tc.max, got, tc.want)
			}
		})
	}
}

$ go test -v ./...
=== RUN   TestClamp
=== RUN   TestClamp/inside_range
=== RUN   TestClamp/below_min
=== RUN   TestClamp/above_max
=== RUN   TestClamp/equal_to_min
--- PASS: TestClamp (0.00s)
    --- PASS: TestClamp/inside_range (0.00s)
    --- PASS: TestClamp/below_min (0.00s)
    --- PASS: TestClamp/above_max (0.00s)
    --- PASS: TestClamp/equal_to_min (0.00s)
PASS
ok  	example/mathx	0.004s

What just happened: We declared an anonymous struct slice - the table - with one row per scenario, each carrying a human name, inputs, and the expected result. The for loop walked every row and ran the same check against it. Adding a fifth case ("negative range", say) is one new line, not a whole new function. The -v flag printed each case as it ran, and because each row got its own t.Run, the output is a neat tree, not a wall of undifferentiated PASSes.

💡 Key point. The win isn't just less typing - it's that the behavior under test is now a visible list. A reviewer can scan the table and immediately ask "where's the case for min > max?" That question is much harder to ask when each scenario is buried in its own function.

Subtests, helpers, and parallelism

The t.Run(name, func) you just used is worth understanding on its own, because it unlocks three things.

Subtests (t.Run). Each t.Run is an independent sub-test with its own name and its own pass/fail. Crucially, you can run just one without running the rest - the names are addressable:

$ go test -run TestClamp/below_min -v ./...
=== RUN   TestClamp
=== RUN   TestClamp/below_min
--- PASS: TestClamp (0.00s)
    --- PASS: TestClamp/below_min (0.00s)
PASS

What just happened: -run takes a regular expression matched against test names, and the / lets you reach inside a parent into one subtest. When a single table row is failing, this is how you re-run only that row in a tight loop instead of re-running the whole suite.

Helpers (t.Helper). When you extract repeated assertion logic into a helper function, call t.Helper() at the top of it. This tells the testing package "I'm a helper - when something fails in here, blame the caller's line, not mine."

func assertEqual(t *testing.T, got, want int) {
	t.Helper() // failures report the caller's line number, not this one
	if got != want {
		t.Errorf("got %d; want %d", got, want)
	}
}

What just happened: Without t.Helper(), every failure points at the t.Errorf line inside assertEqual - useless, because all your failures look like they came from the same place. With it, the failure points at the line in your actual test that called the helper, which is the line you need to fix.

Parallelism (t.Parallel). Calling t.Parallel() inside a subtest signals that it's safe to run alongside other parallel tests, which can speed up an I/O-heavy suite. ⚠️ But parallel tests share the process: if two of them touch the same global variable or the same file, you've created a race. Reach for t.Parallel() only when each case is genuinely independent, and run with the race detector (go test -race) to catch the cases where it isn't.

Benchmarks - measuring speed and allocations

Correctness is one question. Speed is a different one, and Go answers it with the same testing package.

📝 Benchmark - a function named BenchmarkXxx(b *testing.B) that the toolchain runs many times to measure how long one operation takes. The key piece is the loop for i := 0; i < b.N; i++: you put the code-under-test inside it, and Go chooses b.N for you, dialing it up until the timing is statistically stable.

Why the b.N loop is shaped that way. You can't time a single function call reliably - it's over in nanoseconds, swamped by clock noise. So Go runs your operation b.N times (maybe millions), measures the total, and divides. You never set b.N yourself; the framework starts small, sees how long that took, and scales up until it trusts the average. Your only job is to make sure the loop body does exactly the work you want to measure.

A real example. Let's benchmark two ways of building a string from a slice - naive += concatenation versus strings.Builder:

// build_test.go
package strbuild

import (
	"strings"
	"testing"
)

var parts = []string{"a", "b", "c", "d", "e", "f", "g", "h"}

func BenchmarkConcat(b *testing.B) {
	for i := 0; i < b.N; i++ {
		s := ""
		for _, p := range parts {
			s += p // re-allocates the whole string every time
		}
		_ = s
	}
}

func BenchmarkBuilder(b *testing.B) {
	for i := 0; i < b.N; i++ {
		var sb strings.Builder
		for _, p := range parts {
			sb.WriteString(p)
		}
		_ = sb.String()
	}
}

$ go test -bench=. -benchmem ./...
goos: linux
goarch: amd64
pkg: example/strbuild
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkConcat-12     8123140    142.6 ns/op    104 B/op    7 allocs/op
BenchmarkBuilder-12   19543210     61.3 ns/op     56 B/op    3 allocs/op
PASS
ok  	example/strbuild	2.913s

What just happened: go test -bench=. ran every benchmark (. is a regex matching all of them), and -benchmem added the memory columns. Read the output right to left for the story:

ns/op - nanoseconds per operation. Builder is ~61 ns vs Concat's ~143 ns: more than twice as fast.
B/op - bytes allocated per operation. Builder allocates 56 bytes; Concat allocates 104.
allocs/op - number of heap allocations per operation. This is the killer column: Concat does 7 allocations (one per +=, because strings are immutable so each += builds a brand-new string), while Builder does 3.

The -12 suffix on each name is the value of GOMAXPROCS (the CPU count) the benchmark ran with. The headline lesson is the one the numbers force on you: += in a loop quietly re-allocates the whole string on every iteration, and the benchmark makes that invisible cost visible.

💡 Key point. allocs/op is usually the number to watch first. Allocations create work for the garbage collector (the subject of Phase 14), so cutting allocations often improves not just this benchmark but the whole program's tail latency. A change that halves ns/op but doubles allocs/op may be a bad trade under real load.

Profiling with pprof - measure, don't guess

A benchmark tells you that a function is slow. It doesn't tell you where inside it the time goes. For that, Go has pprof.

📝 pprof - a profiler (and its analysis tool, go tool pprof) that samples your running program to record where it spends CPU time or allocates memory, then lets you inspect the result as a ranked list, a call graph, or annotated source. "Profile" = the recorded data; "pprof" = the tool that reads it.

The mental model: stop guessing. Every engineer has a confident hunch about which function is the bottleneck, and that hunch is wrong often enough to be dangerous. You optimize the function you suspected, ship it, and the program is exactly as slow as before - because the real cost was somewhere you never looked. A profiler replaces the hunch with a measurement: it watches the actual run and reports, in ranked order, where the time and memory truly went. The discipline is simple and non-negotiable: profile first, then optimize the thing the profile points at.

Capturing a profile from a benchmark. The easiest on-ramp is the benchmark you already have - go test can dump a CPU profile while it runs:

$ go test -bench=BenchmarkConcat -cpuprofile=cpu.out ./...
...
$ go tool pprof cpu.out
File: strbuild.test
Type: cpu
Entering interactive mode (type "help" for commands)
(pprof) top5
Showing nodes accounting for 2.31s, 91.3% of 2.53s total
      flat  flat%   sum%        cum   cum%
     1.04s 41.1%  41.1%      1.04s 41.1%  runtime.concatstrings
     0.62s 24.5%  65.6%      0.71s 28.1%  runtime.mallocgc
     0.31s 12.3%  77.9%      0.31s 12.3%  runtime.memmove
     0.21s  8.3%  86.2%      0.21s  8.3%  runtime.nextFreeFast
     0.13s  5.1%  91.3%      2.18s 86.2%  strbuild.BenchmarkConcat
(pprof)

What just happened: -cpuprofile=cpu.out wrote a profile file during the benchmark, and go tool pprof cpu.out opened it interactively. The top5 command ranked functions by flat time - time spent in that function itself, not its callees. The story leaps off the screen: 41% of all CPU went into runtime.concatstrings and another 25% into runtime.mallocgc (the allocator). The profile confirms what the benchmark hinted: this code's cost is string concatenation and the allocations it triggers. cum (cumulative) counts a function plus everything it calls, which is why BenchmarkConcat itself shows a low flat but a high cum.

Profiling a live server. For long-running services you don't use a benchmark - you import net/http/pprof, which registers profiling endpoints on your HTTP server:

import (
	"net/http"
	_ "net/http/pprof" // blank import: registers /debug/pprof/ handlers as a side effect
)

// ... with your server running, profiles are now served under /debug/pprof/

$ go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30

What just happened: The blank import (_) pulls in the package purely for its init() side effect - it wires up handlers under /debug/pprof/. The go tool pprof command then collected 30 seconds of live CPU samples from the running server and dropped you into the same interactive analysis as before. (Inside pprof, web renders a visual call graph and list <func> shows your source annotated with per-line cost - both are worth knowing once you're comfortable with top.)

⚠️ Don't expose /debug/pprof/ on a public interface. It leaks internals and lets anyone trigger expensive profiles. Bind it to localhost, an admin port, or behind authentication.

This sets up the next phase: profiling is the evidence-gathering step. Phase 17 is about what to actually do once the profile tells you where the time is.

Coverage and fuzzing - what tests miss, and finding inputs you didn't

Coverage measures how much of your code your tests actually exercised. The toolchain tracks which lines ran:

$ go test -cover ./...
ok  	example/mathx	0.004s	coverage: 87.5% of statements

$ go test -coverprofile=cover.out ./...
$ go tool cover -html=cover.out   # opens a browser: green = covered, red = not

What just happened: -cover printed a single percentage - the share of statements that ran during the tests. -coverprofile wrote the per-line detail to a file, and go tool cover -html turned it into a color-coded view of your source: green lines were exercised, red lines never ran. The red is the useful part - it shows you the branches your tests forgot.

⚠️ 100% coverage is not "bug-free." This is the trap everyone walks into. Coverage tells you a line executed; it says nothing about whether you checked the result or whether the inputs that break it were ever tried. A test that calls Clamp once and asserts nothing can light up the whole function green. Treat coverage as a map of the untested - chase the red - not as a score to max out. High coverage with weak assertions is worse than honest medium coverage, because it feels safe.

Fuzzing attacks the other blind spot: the inputs you never thought to write down. A fuzz test generates random, mutating inputs and hammers your function, hunting for one that crashes it or breaks a property you asserted.

📝 Fuzzing - automated testing where the framework feeds your function a flood of generated inputs (evolved from "seed" examples) to discover ones that cause a panic or violate an invariant. In Go it's built in: a FuzzXxx(f *testing.F) function.

// reverse_test.go
package strbuild

import "testing"

func Reverse(s string) string {
	r := []rune(s)
	for i, j := 0, len(r)-1; i < j; i, j = i+1, j-1 {
		r[i], r[j] = r[j], r[i]
	}
	return string(r)
}

func FuzzReverse(f *testing.F) {
	f.Add("hello")        // seed corpus: starting examples
	f.Add("Göteborg")     // includes multi-byte runes on purpose
	f.Fuzz(func(t *testing.T, s string) {
		// property: reversing twice returns the original
		if Reverse(Reverse(s)) != s {
			t.Errorf("double reverse changed %q", s)
		}
	})
}

$ go test -fuzz=FuzzReverse
fuzz: elapsed: 0s, gathering baseline coverage: 0/192 completed
fuzz: elapsed: 3s, execs: 412903 (137621/sec), new interesting: 18
...
fuzz: elapsed: 12s, execs: 1843201 (no new interesting), 0 failures
^C

What just happened: f.Add seeded a few starting strings, and f.Fuzz ran a property check - "reversing twice gets the original back" - against a torrent of generated inputs derived from those seeds. We asserted a property rather than specific outputs, because we don't know in advance what inputs Go will invent. If a generated string ever broke the property, Go would stop, shrink it to the smallest failing input, and save it to disk so the failure becomes a permanent regression test. (Run a normal go test and the fuzz function still executes its seed corpus as ordinary cases - fuzzing only does the open-ended generation under -fuzz.)

💡 Key point. Coverage and fuzzing answer opposite questions. Coverage asks "what code did my chosen inputs reach?" Fuzzing asks "what inputs did I never think to choose?" Coverage finds untested code; fuzzing finds untested cases - the empty string, the lone multi-byte rune, the input that overflows. They're complementary, and neither alone makes your code safe.

Recap

Table-driven tests put your cases in a slice of structs and loop one assertion over all of them - adding a scenario is a new row, not a new function. Go prefers this to assertion libraries because it's just Go: readable as a spec, with failure messages you control.
t.Run creates addressable subtests (re-run one with -run TestX/case), t.Helper() makes a helper's failures point at the caller, and t.Parallel() opts a test into concurrent execution (pair it with -race).
Benchmarks (BenchmarkXxx(b *testing.B)) run your code b.N times - a count Go chooses - and go test -bench=. -benchmem reports ns/op, B/op, and the all-important allocs/op.
pprof replaces guesswork with measurement: capture a profile (-cpuprofile from a benchmark, or net/http/pprof on a server), open it with go tool pprof, and let top/list point you at the real hot spot before you optimize anything.
Coverage (go test -cover) maps which lines ran - use it to find the red, never as a score, because ⚠️ 100% coverage with weak assertions proves nothing.
Fuzzing (FuzzXxx(f *testing.F)) generates inputs you'd never write by hand and checks a property; it finds the edge case, shrinks the failure, and saves it as a regression.

You can now prove your Go is correct, measure how fast it is, and find the slow part with evidence instead of instinct. Next we turn that lens on the standard library itself - not as a list of packages, but as a masterclass in Go's design philosophy.

Quick check

Test yourself on the ideas that separate "I ran a test" from "I measured my program":

[
  {
    "q": "In a benchmark, why do you wrap the code under test in `for i := 0; i < b.N; i++`?",
    "choices": [
      "Go runs the operation b.N times - a count it picks automatically - and divides, so a single fast call isn't swamped by clock noise",
      "b.N is a constant you must set to the number of CPU cores",
      "The loop makes the benchmark allocate less memory",
      "It's required syntax with no effect on the measurement"
    ],
    "answer": 0,
    "explain": "A single nanosecond-scale call can't be timed reliably. Go dials b.N up until the total run time is statistically stable, then reports the per-operation average. You never set b.N yourself - you just make the loop body do exactly the work you want measured."
  },
  {
    "q": "Your test suite reports 100% coverage. What does that actually guarantee?",
    "choices": [
      "Every line of code executed at least once during the tests - nothing about whether results were checked",
      "The code has no bugs",
      "Every possible input was tested",
      "All assertions in the tests passed correctly"
    ],
    "answer": 0,
    "explain": "Coverage only measures which lines ran. A test that calls a function and asserts nothing still marks those lines green. Treat coverage as a map of the untested (chase the red); high coverage with weak assertions feels safe but proves almost nothing."
  },
  {
    "q": "You suspect a function is your bottleneck. What does pprof give you that a benchmark doesn't?",
    "choices": [
      "It samples the real run and ranks where time and memory actually went, so you optimize the proven hot spot instead of your hunch",
      "It automatically rewrites the slow function to be faster",
      "It guarantees the function has no allocations",
      "It only works on functions named BenchmarkXxx"
    ],
    "answer": 0,
    "explain": "A benchmark tells you that something is slow; pprof tells you where inside it the time goes. The discipline is to profile first and then optimize the thing the profile points at - because the function you suspected is wrong often enough to waste real effort."
  }
]

← Phase 14: The Runtime: Scheduler, Memory & GC · Guide overview · Phase 16: The Standard Library as Design →

Check your understanding 3 questions

1. In a benchmark, why do you wrap the code under test in `for i := 0; i < b.N; i++`?

2. Your test suite reports 100% coverage. What does that actually guarantee?

3. You suspect a function is your bottleneck. What does pprof give you that a benchmark doesn't?