Saving the Data, and Where to Take It
We've got a thousand clean records sitting in memory. The moment the program ends, they're gone. This phase fixes that - we write them to CSV and JSON - and then turns the finished scraper into one complete script. After that, the fun part: a tour of where you'd take this when a weekend project meets a real need.
Both file formats ship with Python. No installs. csv for the spreadsheet
people, json for the program-talking-to-program people. We'll write both,
because they answer different questions and cost nothing extra.
Write to CSV
CSV opens in Excel, Numbers, Google Sheets, and every data tool on earth. Our
records are a list of dicts with identical keys, which is exactly what
csv.DictWriter is built for.
return
=
=
Two details that save you grief. newline="" stops Python from adding blank
lines between every row on Windows - leave it off and your CSV looks
double-spaced in Excel. And encoding="utf-8" makes sure titles with accents or
symbols survive instead of turning into garbage. Always pass both when writing
CSV.
Write to JSON
JSON keeps your data's shape - nested structures, real numbers, None as
null. It's the format you'd hand to another program or a web front-end.
indent=2 makes the file human-readable instead of one giant line.
ensure_ascii=False lets real characters (£, é, -) appear as themselves rather
than \u escapes. Drop both and JSON still works, but you'll thank yourself for
the readable version when you open it to debug.
The whole thing, in one file
Here's the complete scraper - fetch, parse, extract defensively, paginate
politely, save both formats. This is the program the project was building toward.
Save it as scraper.py.
=
= 1.0
=
=
=
=
=
return
=
=
= 1
=
=
=
=
+= 1
return
return
=
=
Run it:
Watch it walk the pages, then open books.csv in a spreadsheet. There's your
weekend's work: a thousand books with titles, prices, ratings, stock, and links -
sortable, filterable, yours. That's a finished, working scraper.
Where to take it next
You've got the core skill. Here's the map of what's past the edge of this project, roughly in order of effort.
| Upgrade | What it buys you | First tool to look at |
|---|---|---|
| A database | Query, dedupe, update over time | sqlite3 (built in) |
| Scheduling | Runs itself on a timer | cron, Task Scheduler |
| Concurrency | Many pages at once, faster | httpx + asyncio |
| Headless browser | Scrape JS-built pages | Playwright |
| A framework | Big crawls, built-in plumbing | Scrapy |
A few of those deserve a sentence.
A database. When you scrape the same site repeatedly, a CSV per run gets
messy fast. SQLite - which ships with Python as sqlite3 - lets you store
records in a real table, ask questions with SQL, and update yesterday's data
instead of duplicating it. It's the natural next step when "save a file" stops
being enough.
Scheduling. A scraper that runs itself is worth ten you have to remember to
run. On macOS or Linux, cron fires your script on a schedule; on Windows, Task
Scheduler does the same. Point it at python scraper.py nightly and wake up to
fresh data.
Headless browsers, for the sites that fight back. Here's the wall you'll hit
eventually: some pages arrive nearly empty and build their content with
JavaScript after loading. requests only sees that empty shell - it doesn't
run JavaScript. When response.text is missing data you can plainly see in your
browser, that's the symptom. The cure is a headless browser like Playwright,
which drives a real (invisible) browser, lets the JavaScript run, and then
hands you the finished HTML to feed into the very same BeautifulSoup code you
wrote this weekend. Everything you learned still applies - you've upgraded the
fetch step, nothing else.
Scrapy. When a one-file script grows into a serious crawler - many sites, retries, pipelines, politeness baked in - Scrapy is the framework built for it. It's more to learn, so reach for it when you've outgrown a script, not before.
Where we are
You built a real web scraper this weekend. It fetches pages, parses messy HTML, extracts clean and typed records, walks an entire catalog at a respectful pace, and saves the results to formats you can actually use. Every piece is code you understand, because you wrote it one phase at a time.
The same five-box loop - fetch, parse, extract, next, save - scales from this practice site to almost anything you'll want to point it at. Swap the selectors for a new site's HTML, keep the politeness, and you're scraping. Go find some public data worth having, and treat the servers kindly while you get it.