What VPS specs do I need for web scraping?

Depends on the stack. A plain HTTP scraper (httpx/requests hitting APIs or static HTML) is light — a $3 Nano with 1 GB is plenty. The moment you need a real browser for JavaScript-heavy sites, Playwright with headless Chromium wants 2–4 GB: a $5 Micro for one or two browser contexts, a $8 Small if you run several in parallel. Chromium is the RAM hog, not your code.

Can I scrape from the server's own IP, or do I need proxies?

For low-volume, well-behaved scraping of sites that allow it, the server's IP is fine. At scale, or against sites that rate-limit by IP, you'll need an external proxy pool — one IP (shared or dedicated) can't spread the load, and hammering from a single address gets it blocked fast. A dedicated IP gives you a clean reputation you control; proxies give you many.

cron or systemd timer for scheduled scrapes?

A systemd timer, in almost every case. Unlike cron it gives you proper logging via journalctl, dependency ordering, automatic catch-up if the box was down, and per-run status you can inspect. cron still works for dead-simple jobs, but a timer is the better default on a server you actually maintain.

Is web scraping allowed on the VPS?

Legal, respectful scraping — yes. Aggressive scraping that ignores rate limits or robots.txt, or that amounts to hammering a target into a denial of service, is an acceptable-use violation and gets the service terminated. Scrape what you're allowed to, throttle yourself, and don't turn a scraper into an attack.

VPS for web scraping: setup, sizing, and honest limits

A scraper on your laptop is fine until you close it mid-run, your home IP gets rate-limited, or you want the same job to run every hour whether you're awake or not. Moving it to a VPS fixes all three: it stays up 24/7, it doesn't burn your home IP's reputation, and cron or a systemd timer runs it on schedule without you. Here's how to set that up, how much server you actually need, and the parts most guides quietly skip.

Why a VPS beats your machine for this

24/7 and scheduled. A scraper that runs hourly needs a host that's always on. A laptop isn't.
Your home IP stays clean. Scraping from home means your residential IP takes the rate-limits and blocks. Do it from a server and your own connection stays untouched.
Stability. No sleep, no wifi drops mid-crawl, a steady datacenter connection, and a place to accumulate results.

The stack, and what each part needs

Two very different weight classes, and picking the wrong plan wastes money or starves the job:

httpx / requests (Python) — for APIs, JSON endpoints, and static HTML. This is light: the process sits in tens of megabytes, network-bound, not CPU-bound. A $3 Nano (1 vCPU / 1 GB) runs this comfortably, even with concurrency via asyncio.
Playwright / headless Chromium — for JavaScript-rendered sites where you need a real browser. This is the heavy one. Headless Chromium is roughly 300–400 MB per browser instance, plus 100–200 MB per open context/tab, plus your runtime. Budget:
- $5 Micro (2 vCPU / 2 GB) — one or two browser contexts at a time.
- $8 Small (4 vCPU / 4 GB) — several parallel contexts, or heavier pages.

The rule of thumb: your code is almost never the bottleneck — Chromium is. Size for the browser, not the scraper.

Scheduling: systemd timer over cron

cron works, but a systemd timer is the better default on a server you maintain: logs through journalctl, catch-up if the box was down, and inspectable per-run status. A minimal setup:

# /etc/systemd/system/scrape.service
[Unit]
Description=Run scraper
[Service]
Type=oneshot
User=scraper
WorkingDirectory=/home/scraper/job
ExecStart=/home/scraper/job/venv/bin/python scrape.py

# /etc/systemd/system/scrape.timer
[Unit]
Description=Hourly scrape
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target

sudo systemctl enable --now scrape.timer
journalctl -u scrape.service -f   # watch runs

Persistent=true is the bit cron can't do: if the server was off at run time, the job fires once on boot instead of silently skipping.

Where results go

Keep it simple and match the volume: SQLite for structured data you'll query (one file, zero setup), CSV for quick tabular dumps, or an S3-compatible object store when results outgrow the box or you want them off-server. Rotate your logs (logrotate or journald limits) so a chatty scraper doesn't slowly fill the disk.

The honest part: egress IP and reputation

This is the detail that decides whether your scraper works for a week or gets blocked on day one.

On a NAT plan, outbound traffic shares one egress IP with other customers. That IP's reputation is shared — a neighbor scraping the same target can get the address rate-limited before you send a single request. Fine for light, occasional scraping; a liability at volume.

A dedicated IP gives you your own egress reputation — nobody else's behavior affects it. But it cuts both ways: aggressive scraping burns your own clean IP, and once a target blocks it, it's blocked. A dedicated IP is control, not immunity.

At real scale, you need external proxy pools. No single IP — shared or dedicated — can spread load across many addresses, which is what serious scraping against IP-rate-limited targets requires. Proxies are a generic, third-party layer you add on top; the VPS runs the scraper, the proxy pool provides the addresses. Don't expect one server IP to do a proxy pool's job.

Ethics and the AUP — not optional

Scraping lives in a legal and ethical gray zone, so be clear-eyed:

Respect rate limits and robots.txt. Throttle your requests. A polite scraper looks like traffic; an impolite one looks like an attack.
Don't DoS your target. Hammering a site until it falls over isn't scraping, it's a denial-of-service — and it's an acceptable-use violation here that gets the service terminated.
Scrape only what you're allowed to. Legal, permitted data collection is the line. Cross it and it's on you.

The bottom line

A VPS is the right home for a scraper: always-on, scheduled, and off your home IP. Match the plan to the stack — $3 Nano for httpx, $5 Micro to $8 Small for Playwright — schedule with a systemd timer, and be honest about IPs: shared egress shares reputation, a dedicated IP is yours to build or burn, and real scale means proxy pools. Lock the box down first with the new-VPS security checklist, size it right using the VPS sizing guide, and if paying-without-a-card privacy matters, the anonymous VPS breakdown is the honest version. Scrape responsibly — the AUP is real.