Extracting Structured Data
So far we've printed pieces - a title here, a price there. A real scraper produces records: one structured object per thing, with the same fields every time, clean enough to drop into a spreadsheet without hand-fixing. This phase builds that. By the end you'll have a function that turns one book's HTML into one tidy dictionary, and a loop that gives you a list of them.
The dictionary is our record. Each book becomes
{"title": ..., "price": ..., "rating": ..., "in_stock": ..., "url": ...}.
Same keys, every book. That sameness is what makes the next phase - saving -
trivial.
Extract one book into a dict
Create extract.py. We'll write a function that takes a single book element and
returns a dictionary. Look at the page's HTML again: the rating lives in a
class like star-rating Three, the stock status is text in a p.instock, and
the link is a relative href we'll need to fix up.
=
=
=
= .
# Rating is encoded in the class, e.g. "star-rating Three"
=
= # ["star-rating", "Three"] -> "Three"
= .
= +
return
=
=
=
Run python extract.py. You'll get a dictionary - but a slightly grubby one.
The stock text is wrapped in whitespace and newlines, and the price has a stray
character on the front. Let's clean it.
Clean the text as you pull it
Raw HTML text is full of indentation, newlines, and the odd encoding artifact. Clean it at the moment of extraction, so every record downstream is already tidy. A few moves cover almost everything:
| Problem | Fix |
|---|---|
| Leading/trailing whitespace, newlines | .text.strip() |
| A currency symbol you want as a number | .replace("£", "") then float(...) |
| Internal double spaces | " ".join(text.split()) |
Here's parse_book with cleaning built in, turning the price into a real number
we can sort and total later:
=
=
= . # e.g. "£51.77"
=
=
=
=
= +
return
Now price is 51.77, a float, not a string. Decide on the type you want for
each field at extraction time - a scraper that emits clean, typed records is
worth ten that emit strings someone has to scrub later.
Survive a missing field
Here's the thing that separates a script that works once from a scraper you can
trust: real pages are inconsistent. One book is missing a rating. Another has no
price because it's out of stock. The moment you call .text on a
select_one that found nothing, you get AttributeError: 'NoneType' object has no attribute 'text' and the whole run dies on item 47 of 1000.
The fix is to check before you reach in. A small helper keeps the main function readable:
=
return
select_one returns None when nothing matches, and None is falsy, so the
if found guard catches it. When the element is missing you get your default
instead of a crash. Wire it in:
=
=
=
=
=
=
=
= +
return
Now a missing price becomes None, not a stack trace. None is a deliberate
"we looked and it wasn't there" - far more useful than an empty string, because
later you can ask "which records are missing a price?" and get a real answer.
Pull the whole page into records
Put it together: loop every book, build a list of dicts, and report.
=
=
return
=
=
=
return
=
=
=
=
Run it. Twenty clean dictionaries, and because the price is a real number, you can find the cheapest book with one line. That's the payoff of typing your data as you extract it.
Where we are
You have a list of clean, structured, typed records, and code that won't fall over when a page leaves a field blank. One problem remains: this is only the first 20 books. There are a thousand. Next phase we follow the "next" link through every page - and we do it without being a nuisance to the server.