Reading Graylog (Log Search & Streams)
On your laptop, when something breaks, you open one file and grep for the error. That works because
there's one file. Now picture the same problem in production: a dozen containers, three app servers, two
load balancers, a queue worker - and the one log line that explains the outage is sitting on whichever
box happened to handle that one unlucky request. You can't ssh into all of them and grep in parallel
while the pager is going off. That's the moment people discover they're drowning, not because there are
too many logs, but because the logs are scattered.
Here's the relief: tools like Graylog (and the very similar ELK / OpenSearch + Kibana stacks) do one
profound thing - they ship every log line from every machine into one place and put a search box on top.
The skill you already have for reading a single log file still applies; what changes is that now your
grep reaches across the entire fleet at once, you can scope it to a five-minute window, and you can
follow a single request as it bounced between services. This guide gives you that skill, and the mental
model underneath it so the search box stops feeling like a slot machine.
⏭️ New to reading logs at all? Start with Reading Logs Without Drowning
- what a log line is, what the levels mean, how to follow one request. This guide is the centralized, many-servers version of that skill.
How to read this
- Mid-incident, need to find the failing request right now? Jump to Phase 2: Searching Effectively and use the cheat-card at the top.
- Want centralized logging to finally make sense? Read in order - Phase 1 installs the mental model (one search box over everything, structured fields vs. raw text), and the rest builds on it.
The phases
- Why Centralized Logs - why
grepon one box stops working across a fleet, what Graylog/ELK actually collect and where, and the two ideas everything rests on: one search box over everything, and structured fields vs. raw text. - Searching Effectively - the query model:
field:valuesearches, time-range scoping (your #1 lever), boolean operators, following one request by its correlation id, and reading the histogram to find the spike. - Streams, Dashboards & Alerts - routing subsets of logs into streams (e.g. just prod ERRORs), saving dashboards you can glance at, and alerting on log conditions so the system pages you instead of you discovering the fire by accident.
This guide stays at the level of reading and searching centralized logs. The deeper operational side - running the cluster, designing index/retention policies, parsing pipelines, and wiring logs together with metrics and traces - is its own topic. For where logs sit in the bigger picture, see Observability: Logs, Metrics & Traces.