Pagination and Being Polite
We've been scraping one page. The catalog has fifty. This phase teaches the scraper to walk from page to page until there are no more - and, equally important, to do it in a way that doesn't hammer the server or get you blocked. By the end you'll have a loop that collects the whole catalog at a respectful pace.
I'm putting the manners and the law in the same phase as the loop on purpose. The mechanics of pagination take ten minutes. Knowing how not to be a problem is the part that keeps you out of trouble, and it's the part a lot of tutorials skip. We won't.
Find the "next" link
Open the site and scroll to the bottom - there's a "next" button. Inspect it.
On books.toscrape.com it's a <li class="next"> containing an <a> whose
href points at the following page. When you're on the last page, that element
isn't there at all. That presence-or-absence is exactly the signal we need: keep
going while there's a next link, stop when there isn't.
=
= # e.g. "catalogue/page-2.html"
= None # we're on the last page
One wrinkle: that href is relative. From the catalogue pages it might be
page-2.html, meaning "relative to where I am now." Python's standard library
has the right tool so you don't guess: urljoin combines the current page's URL
with a relative link and gives you a correct absolute URL.
=
# -> https://books.toscrape.com/catalogue/page-2.html
Set a User-Agent first
Before we loop, one courtesy and one practicality. Every request carries a
User-Agent header that says who's calling. By default requests sends
something like python-requests/2.x, which is honest but anonymous. Many servers
treat the default Python agent with suspicion or block it outright.
Set a real one that identifies you and, ideally, how to reach you. This is both polite (the site owner can see who you are) and effective (you're less likely to be filtered). A shared session attaches the header to every request:
=
Don't impersonate a real browser to sneak past defenses. If a site plainly doesn't want bots, that's information, not a challenge - more on that below.
The pagination loop
Now the full walk. Start at page one, parse it, find the next link, sleep a moment, repeat - until there's no next link.
=
= 1.0 # seconds between requests
=
=
=
=
return
=
=
= 1
=
=
=
=
+= 1
# be a good guest
return
=
Run it. It walks every page, printing as it goes, pausing a second between
fetches, and ends with all thousand books in one list. The while url: loop is
the engine: url becomes None the moment there's no next link, and the loop
ends on its own.
Why the delay matters
That time.sleep(DELAY) isn't decoration. Without it, a fast loop fires
requests as quickly as Python can manage - dozens per second - and to the server
that's indistinguishable from an attack. You can knock a small site over, and
you will absolutely get your IP blocked. A one-second pause makes you a normal
visitor. Slow is polite, and polite is what keeps you scraping.
A second knob worth knowing for bigger jobs: retries with backoff. If a request fails, wait a bit and try again - and wait longer each time rather than pounding a struggling server. For this project a fixed delay is enough; keep backoff in your back pocket for production.
robots.txt - read it, respect it
Most sites publish a file at /robots.txt saying which paths automated clients
may and may not touch. It's a request, not a wall, but ignoring it is rude and,
depending on where you are and what you're doing, can be a factor against you
legally. Python's standard library reads it for you:
=
=
can_fetch returns True or False. The honest move is to check it before you
scrape a path and skip what's disallowed. (Our practice site allows everything;
real sites often disallow /search, /cart, login areas, and the like.)
The ethics and the law, plainly
I'm not your lawyer, and this isn't legal advice - but here's the working understanding a careful person operates with:
- Public, factual data is the safe ground. Facts (prices, titles, public listings) generally aren't copyrightable. Re-publishing someone's creative content wholesale is a different matter.
- Terms of Service exist. Many sites' ToS forbid automated access. Violating them can be a breach of contract even when the data itself is public. Read them for anything you'll do more than once.
- Don't scrape personal data carelessly. Names, emails, profiles - privacy law (GDPR, CCPA, and friends) applies regardless of whether data is "public."
- Don't degrade the service. Hammering a server can cross from rude into unlawful (computer-misuse statutes). The delay isn't only manners.
- Logins and paywalls change everything. Scraping behind authentication, or bypassing access controls, is a sharp escalation. Don't, unless you have clear permission.
- Prefer an API. If the site offers one, use it. It's faster, more stable, and it's them inviting you in.
The short version: scrape public facts, slowly, with an honest User-Agent, while respecting robots.txt and ToS, and never for personal data or behind a login. That posture covers the vast majority of legitimate scraping.
graph TD
A[Want some data] --> B{Is there an API?}
B -->|yes| C[Use the API]
B -->|no| D{robots.txt + ToS allow it?}
D -->|no| E[Stop or ask permission]
D -->|yes| F[Scrape slowly, honest UA]
Where we are
The scraper now collects an entire catalog, page by page, at a pace that won't get you blocked or sued - with a real User-Agent and a robots.txt check in the toolkit. The data's all in memory, though, and memory vanishes when the program exits. Last phase: write it to disk, then look at where you'd take this next.