A scraper on your laptop is fine until you close it mid-run, your home IP gets rate-limited, or you want the same job to run every hour whether you're awake or not. Moving it to a VPS fixes all three: it stays up 24/7, it doesn't burn your home IP's reputation, and cron or a systemd timer runs it on schedule without you. Here's how to set that up, how much server you actually need, and the parts most guides quietly skip.
Why a VPS beats your machine for this
- 24/7 and scheduled. A scraper that runs hourly needs a host that's always on. A laptop isn't.
- Your home IP stays clean. Scraping from home means your residential IP takes the rate-limits and blocks. Do it from a server and your own connection stays untouched.
- Stability. No sleep, no wifi drops mid-crawl, a steady datacenter connection, and a place to accumulate results.
The stack, and what each part needs
Two very different weight classes, and picking the wrong plan wastes money or starves the job:
- httpx / requests (Python) — for APIs, JSON endpoints, and static HTML. This is light: the process sits in tens of megabytes, network-bound, not CPU-bound. A $3 Nano (1 vCPU / 1 GB) runs this comfortably, even with concurrency via
asyncio. - Playwright / headless Chromium — for JavaScript-rendered sites where you need a real browser. This is the heavy one. Headless Chromium is roughly 300–400 MB per browser instance, plus 100–200 MB per open context/tab, plus your runtime. Budget:
- $5 Micro (2 vCPU / 2 GB) — one or two browser contexts at a time.
- $8 Small (4 vCPU / 4 GB) — several parallel contexts, or heavier pages.
The rule of thumb: your code is almost never the bottleneck — Chromium is. Size for the browser, not the scraper.
Scheduling: systemd timer over cron
cron works, but a systemd timer is the better default on a server you maintain: logs through journalctl, catch-up if the box was down, and inspectable per-run status. A minimal setup:
# /etc/systemd/system/scrape.service
[Unit]
Description=Run scraper
[Service]
Type=oneshot
User=scraper
WorkingDirectory=/home/scraper/job
ExecStart=/home/scraper/job/venv/bin/python scrape.py
# /etc/systemd/system/scrape.timer
[Unit]
Description=Hourly scrape
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
sudo systemctl enable --now scrape.timer
journalctl -u scrape.service -f # watch runs
Persistent=true is the bit cron can't do: if the server was off at run time, the job fires once on boot instead of silently skipping.
Where results go
Keep it simple and match the volume: SQLite for structured data you'll query (one file, zero setup), CSV for quick tabular dumps, or an S3-compatible object store when results outgrow the box or you want them off-server. Rotate your logs (logrotate or journald limits) so a chatty scraper doesn't slowly fill the disk.
The honest part: egress IP and reputation
This is the detail that decides whether your scraper works for a week or gets blocked on day one.
On a NAT plan, outbound traffic shares one egress IP with other customers. That IP's reputation is shared — a neighbor scraping the same target can get the address rate-limited before you send a single request. Fine for light, occasional scraping; a liability at volume.
A dedicated IP gives you your own egress reputation — nobody else's behavior affects it. But it cuts both ways: aggressive scraping burns your own clean IP, and once a target blocks it, it's blocked. A dedicated IP is control, not immunity.
At real scale, you need external proxy pools. No single IP — shared or dedicated — can spread load across many addresses, which is what serious scraping against IP-rate-limited targets requires. Proxies are a generic, third-party layer you add on top; the VPS runs the scraper, the proxy pool provides the addresses. Don't expect one server IP to do a proxy pool's job.
Ethics and the AUP — not optional
Scraping lives in a legal and ethical gray zone, so be clear-eyed:
- Respect rate limits and robots.txt. Throttle your requests. A polite scraper looks like traffic; an impolite one looks like an attack.
- Don't DoS your target. Hammering a site until it falls over isn't scraping, it's a denial-of-service — and it's an acceptable-use violation here that gets the service terminated.
- Scrape only what you're allowed to. Legal, permitted data collection is the line. Cross it and it's on you.
The bottom line
A VPS is the right home for a scraper: always-on, scheduled, and off your home IP. Match the plan to the stack — $3 Nano for httpx, $5 Micro to $8 Small for Playwright — schedule with a systemd timer, and be honest about IPs: shared egress shares reputation, a dedicated IP is yours to build or burn, and real scale means proxy pools. Lock the box down first with the new-VPS security checklist, size it right using the VPS sizing guide, and if paying-without-a-card privacy matters, the anonymous VPS breakdown is the honest version. Scrape responsibly — the AUP is real.