Observability: Logs, Metrics & Traces
Your service is misbehaving. The dashboard says response times are up, but it doesn't say why. You SSH into a box and start tailing logs, but you don't even know which of your six services is the slow one. You're poking at a black box, hoping to bump into the problem. That feeling - knowing something is wrong but having no way to ask the system what - is what observability exists to fix.
Here's the reframe that makes the whole topic click: a running system is constantly throwing off three different kinds of evidence about itself. Logs are the diary of individual events. Metrics are the numbers it counts over time. Traces are the story of one request's whole journey. None of them is "the right one" - each sees something the others can't. Once you know what each kind is best at, you stop guessing and start asking precise questions: which service, how slow, and why. This guide gets you there.
How to read this
- Already drowning in a slowdown right now? Jump to Phase 3: Putting Them Together - it walks the exact metric → trace → log path from "something's slow" to "here's the line of code."
- Want observability to finally make sense? Read in order. We start with the mental model (Phase 1), then meet the three kinds of evidence (Phase 2), then use them together (Phase 3).
The phases
- Monitoring vs Observability - the core distinction: monitoring watches the things you already knew to watch; observability lets you ask new questions about a misbehaving system without shipping new code to answer them.
- The Three Pillars - logs (discrete events), metrics (numbers over time: counters, gauges, histograms), and traces (one request across services, broken into spans). What each is genuinely best at, and where each falls down.
- Putting Them Together - debugging a real slowdown end to end (metric alert → trace finds the slow service → logs explain why), a quick map of the tool landscape, and the two traps that bite teams: cardinality explosions and alert fatigue.
This guide stays at the level of concepts and how they fit together. The hands-on, tool-specific deep dives live in their own guides: reading logs line by line, Prometheus and Grafana for metrics, and reading a Dynatrace trace. Read this first to get the map; read those when you're in a specific tool.