The Warehouse — A Database Built for Analytics
You already know what a database is — the thing behind your app that stores users, orders, and sessions. So when someone says "data warehouse," it's tempting to picture the same thing, just bigger. That picture will quietly lead you wrong, and it's worth fixing before anything else.
A warehouse is a database. But it's tuned for a completely different job: answering big analytical questions across your whole history, fast. Once you see what that job demands, every design choice a warehouse makes — and every bill it hands you — starts to make sense.
Two different jobs: running the app vs. understanding the app
What it actually is. There are two fundamentally different kinds of database work, and the industry has names for them.
📝 OLTP (Online Transaction Processing) is your app's database: lots of tiny, fast operations — "insert this order," "fetch user 4827," "update this row." It cares about reading and writing individual records quickly and safely.
📝 OLAP (Online Analytical Processing) is the warehouse's job: a few enormous questions over millions or billions of rows — "total revenue per country per month for the last three years." It cares about scanning and aggregating huge ranges quickly.
Why people get this wrong. The instinct is "just run the analytics query on the app database." For a
while you can, and then one day a SELECT SUM(...) ... GROUP BY ... over the whole orders table locks
things up and the app slows for real users. The two jobs compete for the same machine. A warehouse exists
so the heavy analytical questions live somewhere they can't hurt production — on hardware shaped for exactly
that work.
OLTP (your app DB) OLAP (the warehouse)
───────────────── ────────────────────
"fetch order #4827" "sum revenue by month, 3 years"
millions of tiny reads/writes a few huge scans + aggregations
row-by-row, low latency column-by-column, high throughput
optimized for: keep the app running optimized for: answer big questions
Schema-on-write: structure decided up front
What it actually is. A warehouse is schema-on-write: you define the structure — table names, columns, types — before the data goes in. Data that doesn't fit the shape gets rejected or transformed to fit at load time. The structure is the price of admission.
What it does in real life. When you load data, you're committing to a contract: this column is a
DATE, that one is an INTEGER, this one can't be null. Once it's in, every query can trust that shape.
$ bq query --use_legacy_sql=false \
'SELECT country, SUM(amount) AS revenue
FROM `shop.analytics.orders`
WHERE order_date >= "2026-01-01"
GROUP BY country
ORDER BY revenue DESC'
+---------+-----------+
| country | revenue |
+---------+-----------+
| US | 1842300.5 |
| DE | 612885.0 |
| JP | 498120.0 |
+---------+-----------+
What just happened: Because every row in orders was forced into the same known structure on the way in,
the warehouse could scan the whole table and aggregate it with confidence — no guessing what amount means
or whether country is even there. That up-front discipline is what makes the query both fast and
trustworthy. (This example uses BigQuery's bq CLI; Snowflake and Redshift run the same SQL through their
own clients.)
⚠️ Gotcha — schema-on-write makes change slow. The flip side of that contract: when the business adds a new field, or a source system starts sending data in a new shape, you have to evolve the schema and often backfill or reload. Structure that's great for querying is friction when reality shifts. Hold onto this tension — it's exactly what the lake in Phase 2 trades away.
Columnar storage: why aggregations are fast
What it actually is. This is the single most important idea in the phase, so it earns its own section. A
normal app database stores data row by row — all of order #4827's fields sit together on disk. A
warehouse stores data column by column — every amount sits together, every country sits together.
Why people get this wrong. It sounds like a trivial implementation detail. It isn't — it's the whole
reason warehouses are fast at analytics. Picture the question "what's the total amount?"
ROW STORE (app DB) COLUMN STORE (warehouse)
────────────────── ────────────────────────
[id|date|country|amount] id : 1, 2, 3, 4, ...
[id|date|country|amount] date : ..., ..., ...
[id|date|country|amount] country: US, DE, JP, ...
[id|date|country|amount] amount : 30, 12, 9, 45, ... ◄── read ONLY this
to sum amount you must touch to sum amount you read one
every row (and every column) tightly-packed column
What it does in real life. To total amount, a column store reads only the amount column and skips
everything else — far less data off disk. And because every value in a column is the same type and often
similar (lots of repeated countries, dates in order), it compresses extremely well, shrinking what has to
be read even further.
💡 Key point. Columnar storage is the reason a warehouse can aggregate over billions of rows quickly. It's not magic and it's not just "more servers" — it's that analytical queries usually touch a few columns across many rows, and column storage is built for precisely that access pattern.
Why this saves you later. When a warehouse query over a giant table returns in seconds and the "same" query on your app database would crawl, you'll know it's not luck. And you'll understand the trade-off: column stores are slow at the OLTP job (fetching or updating one whole row means touching many separate columns), which is exactly why you don't run your app on one.
What it costs you
A warehouse isn't free, and the costs are worth naming honestly so the bill never surprises you.
- Money, often per query or per compute-time. Managed warehouses like BigQuery, Snowflake, and Redshift
typically charge for the compute a query uses (and/or the data it scans). A careless
SELECT *over a huge table, or a dashboard auto-refreshing every minute, can run up real cost. Concrete pricing changes often and varies by vendor — check the current pricing page before you commit; don't trust a number you half-remember. - Rigidity. Schema-on-write means structural change has a tax (see the gotcha above).
- It wants modeled, clean data. A warehouse is happiest with structured, well-shaped tables. Raw, messy, or unstructured data (images, logs, free text, half-known JSON) is an awkward fit — which is the exact gap the lake fills next.
🪖 War story. A team I know pointed a popular BI tool at their warehouse and left the dashboards on auto-refresh for a launch. The dashboards were lovely. The end-of-month bill was not — every viewer refresh had been firing fresh scans over the largest table they had. The fix was caching and scheduled refreshes, but the lesson stuck: in a pay-per-query warehouse, who runs what, how often is a cost decision, not just a UX one.
Recap
- A warehouse is a database tuned for OLAP (big analytical questions), not OLTP (your app's tiny reads and writes) — so heavy analytics can't slow down production.
- It's schema-on-write: structure is decided up front, which makes queries fast and trustworthy but makes change slow.
- Columnar storage is why aggregations over huge tables are fast — it reads only the columns a query needs and compresses them well.
- The costs are money (often per query/compute), rigidity, and a poor fit for raw or unstructured data.
That last cost — the awkwardness with raw, messy, everything-data — is the reason the data lake exists. That's next.
← Guide overview · Phase 2: The Lake (and Lakehouse) →
Check your understanding 3 questions
1. What is the difference between OLTP and OLAP?
2. Why is columnar storage the reason warehouses are fast at analytics?
3. What does schema-on-write mean for a warehouse, and its downside?