Joining & Combining
Up to now we've worked one table: the sales DataFrame, all by itself. But real analysis is almost never one
table. Your sales rows know the product name and a units count — but what category is that product?
Who's the supplier? That information lives somewhere else, in a products table, because repeating
"Widgets are Hardware, supplied by Acme" on every single sales row would be wasteful and error-prone. So the
data gets split across tables on purpose, linked by a shared value. This phase is about putting it back
together.
📝 Here's the mental model: combining means matching rows from one table to rows in another using a shared
key. If you've touched SQL, you already know this move by its real name — it's a JOIN, and pandas'
merge is that join, just with DataFrame syntax instead of SELECT ... JOIN ... ON. Same picture, same
join types, same gotchas. If joins have never quite clicked, the calm walkthrough in
SQL Joins, Finally Explained draws the picture in pure SQL terms; everything
there maps one-to-one onto what we're about to do.
We'll keep the running sales dataset, and introduce a second products table to join to it:
=
= *
=
What just happened: we built two tables that share a product column — that shared column is the key
that lets us tie a sales row to its product details. Notice the deliberate mismatch: sales has a Gizmo
that isn't in products, and products has a Sprocket that never sold. Those non-matches are exactly
where join types start to matter.
merge and the join types
The core function is pd.merge. You hand it two DataFrames, tell it the key with on=, and tell it which
rows to keep with how=. Start with the default, how="inner":
=
product units category supplier
0 Widget 10 Hardware Acme
1 Widget 7 Hardware Acme
2 Gadget 4 Electronics Globex
3 Gadget 12 Electronics Globex
What just happened: merge matched each sales row to the products row with the same product, and
glued the category and supplier columns on. The how="inner" means keep only rows that matched on
both sides — so the Gizmo sale (no product entry) and the Sprocket product (no sale) both vanished.
Inner join = the intersection.
Often you don't want rows to disappear — you want every sales row preserved, matched detail where it exists
and blanks where it doesn't. That's a left join (how="left"): keep all of the left table:
=
product units category supplier
0 Widget 10 Hardware Acme
1 Gadget 4 Electronics Globex
2 Widget 7 Hardware Acme
3 Gadget 12 Electronics Globex
4 Gizmo 5 NaN NaN
What just happened: every original sales row survived — including the Gizmo sale. But Gizmo has no
entry in products, so pandas had nothing to fill category and supplier with and put NaN there.
⚠️ This is the thing to internalize about left/right/outer joins: non-matches don't drop the row, they
fill the borrowed columns with NaN. A sudden crop of NaNs after a merge usually means keys that didn't
line up, not missing source data.
The how= options, in one breath:
inner— only rows that match on both sides (the default). The intersection.left— every row from the left table;NaNin the right's columns where there's no match.right— mirror image: every row from the right table;NaNon the left where there's no match.outer— every row from both tables;NaNwherever either side had no match. The union.
An outer join shows both lonely rows at once:
=
product units category
0 Gadget 4.0 Electronics
1 Gadget 12.0 Electronics
2 Gizmo 5.0 NaN
3 Sprocket NaN Hardware
4 Widget 7.0 Hardware
5 Widget 10.0 Hardware
What just happened: the outer join kept everything — the Gizmo sale with no product (category is
NaN) and the Sprocket product with no sale (units is NaN). Notice units turned into floats
(4.0): pandas widens an integer column to float so it can hold NaN, since plain ints can't represent
missing. 💡 Pick the join type by asking "which rows must survive even if they don't match?" — none (inner),
the left ones (left), or all of them (outer).
Choosing the join keys
on="product" works when both tables name the key column identically. They often don't. Say sales calls
it product but products calls it item — use left_on and right_on to name each side:
=
=
product item category
0 Widget Widget Hardware
1 Widget Widget Hardware
What just happened: left_on="product" and right_on="item" told merge that those two differently-named
columns hold the same key. The match worked exactly as before; you just get both key columns in the result
(drop one with .drop(columns="item") if it bothers you). If your key sits in the index rather than a
column, use left_index=True / right_index=True the same way.
Now the trap that bites everyone at least once. ⚠️ A merge multiplies rows when the key isn't unique on the
side you're joining to. If products accidentally listed Widget twice, every Widget sale would match
both rows and your result would silently double:
= # Widget appears twice
5 sales rows -> 6 after merge
What just happened: the duplicate Widget in the lookup table turned a 5-row merge into a 6-row one — each
of the two Widget sales matched two product rows... but here only one extra appeared because of how the
counts fell, and on bigger data this kind of many-to-many blow-up can multiply your rows into the millions.
The defense is to declare what you expect with validate=:
MergeError: Merge keys are not unique in right dataset; not a one-to-many merge
What just happened: validate="many_to_one" asserts "many sales rows, but each product key appears once in
the lookup." Because the key wasn't unique on the right, pandas refused the merge and told you why — far
better than a silently inflated result you discover three steps later. Reach for validate= whenever you're
joining a fact table to what's supposed to be a one-row-per-key lookup.
concat: stacking, not matching
merge relates tables side by side on a key. The other combining move is just stacking — gluing rows
on top of each other, or columns next to each other, with no key matching at all.
📝 pd.concat([df1, df2]) stacks tables. By default it stacks rows (one DataFrame's rows after the
other's) — perfect for "I have January sales in one DataFrame and February in another, give me one table":
=
=
=
2 + 3 -> 5 rows
What just happened: concat lined the two frames up and stacked February's rows below January's into one
5-row table. The two frames have the same columns, which is what makes stacking sensible — concat aligns
by column name and fills NaN for any column one frame lacks. Pass axis=1 instead and concat glues columns
side by side (aligning on the index) rather than stacking rows.
So when do you reach for which? merge when the tables relate by a key (sales ↔ products: different shapes, joined on a shared value). concat when the tables are more of the same thing (Jan + Feb sales: same columns, appended). That distinction — relate by key vs. stack more of the same — covers the vast majority of data assembly you'll ever do.
Suffixes and verifying the join
One last practicality. When both tables have a non-key column with the same name, merge can't keep two
columns named the same, so it disambiguates with _x (left) and _y (right) suffixes:
# both tables happen to have a "region" column
=
=
product region_x region_y
0 Widget North Imported
1 Widget North Imported
What just happened: both frames carried a region column, so merge kept the sales one as region_x and
the product one as region_y. Cryptic. Set readable names yourself with suffixes=:
pd.merge(sales, prod_region, on="product", suffixes=("_sale", "_origin")) gives you region_sale and
region_origin.
⚠️ And the habit that saves you the most grief: check the row count before and after every merge. A left/inner merge should never increase your left table's row count — if it does, your key isn't unique and rows multiplied. A merge that was meant to enrich your data but changed how many rows you have is a bug, not a result:
=
=
before: 5 after: 5
What just happened: a left join with a clean one-row-per-product lookup left the row count untouched at 5 —
exactly what you want. The two-second len() check is the cheapest bug-catcher in pandas; make it reflex.
💡 The whole phase boils down to two verbs: merge to look up related data by key, concat to stack
more of the same — and almost every "combine these datasets" task is one or the other.
Recap
- Combining = matching rows across tables on a shared key.
pd.mergeis exactly the SQL JOIN (SQL Joins, Finally Explained) in DataFrame form. how=picks which rows survive:inner(matches only),left(all left rows),right(all right),outer(all rows from both). ⚠️ Non-matches don't drop the row in left/right/outer — they fill the borrowed columns withNaN.- Keys:
on=for an identically-named column,left_on/right_onfor differently-named ones,left_index/right_indexto join on the index. - The multiplication trap: if the key isn't unique on the side you join to, rows multiply (a
many-to-many join can explode the row count). Guard with
validate=. concatstacks rather than matches — rows by default, columns withaxis=1. Use it for "more of the same" (Jan + Feb sales); use merge for "relate by key."- Verify: overlapping column names get
_x/_y(override withsuffixes=), and always check the row count before/after a merge — an unexpected change is a bug.
Quick check
Make sure the join types and the row-count instinct stuck:
[
{
"q": "You merge sales (5 rows) with a products lookup using how=\"left\", and a few sales have a product not in the lookup. What happens to those rows?",
"choices": [
"They're kept, with NaN filled into the columns borrowed from products",
"They're dropped, because there's no match",
"The whole merge raises an error",
"They're duplicated until a match is found"
],
"answer": 0,
"explain": "A left join keeps every left (sales) row. Where there's no matching product, pandas has nothing to fill the borrowed columns with, so it puts NaN there — the row stays."
},
{
"q": "Your products lookup accidentally lists \"Widget\" on two rows. You inner-merge sales onto it. What's the danger?",
"choices": [
"Each Widget sale matches both Widget rows, multiplying your row count",
"The merge silently drops all Widget rows",
"pandas automatically deduplicates the lookup for you",
"Nothing — duplicate keys are always fine"
],
"answer": 0,
"explain": "A non-unique key on the side you join to multiplies rows: every sale matches every duplicate. Declaring validate=\"many_to_one\" makes pandas refuse the merge instead of silently inflating it."
},
{
"q": "You have January sales and February sales in two DataFrames with identical columns, and want one combined table. Which tool fits?",
"choices": [
"pd.concat([jan, feb]) — stacking more of the same",
"pd.merge(jan, feb, on=\"date\") — relate by key",
"Neither; you must loop and append row by row",
"pd.merge with how=\"outer\" on every column"
],
"answer": 0,
"explain": "Same shape, same columns, just more rows of the same thing — that's concat (stacking). merge is for relating different tables by a shared key, which isn't what's happening here."
}
]
← Phase 6: GroupBy & Aggregation · Guide overview · Phase 8: Time Series & Dates →
Check your understanding
1. You merge sales (5 rows) with a products lookup using how="left", and a few sales have a product not in the lookup. What happens to those rows?
2. Your products lookup accidentally lists "Widget" on two rows. You inner-merge sales onto it. What's the danger?
3. You have January sales and February sales in two DataFrames with identical columns, and want one combined table. Which tool fits?