Sending every prompt to someone else's API is fine — until it isn't. Maybe the data is sensitive and you'd rather it never leave your server. Maybe you're tired of rate limits, or of a model version changing under your feet, or of a per-token meter ticking while you experiment. At some point "what if I just ran my own?" stops being a thought experiment.
Ollama makes that genuinely easy. The harder question is what fits on a VPS — and here the honest answer matters more than the hype.
What you can actually run on CPU
No GPU? Then you're doing CPU inference, and model size is everything. Quantization (squeezing the weights down to 4-bit) is what makes this practical — you lose a sliver of quality and save a pile of memory.
Rough numbers, the ones that matter:
- 3B model, 4-bit — ~3 GB RAM. Snappy enough for chat and simple tasks.
- 7B model, 4-bit — ~5 GB RAM. The sweet spot: noticeably smarter, still runs at a readable pace.
- 13B and up — 8-10 GB+ and slow on CPU. Technically possible, practically annoying.
Speed, honestly: on a few vCPUs you'll see a handful of tokens per second. Perfectly fine for a chatbot or a coding assistant where you read as it types. Not fine for batch-processing a million documents — that's a GPU job, and we don't pretend otherwise. We don't offer GPU instances. If your plan needs a 70B model or heavy throughput, a CPU VPS — ours or anyone's — is the wrong tool, and you should know that before you spend a cent.
But a private 7B that answers your questions and never phones home? That runs comfortably on a 6 GB box.
The install — three commands
Spin up the server, SSH in, and:
curl -fsSL https://ollama.com/install.sh | sh # installs Ollama
ollama run llama3.2:3b # pulls + runs a 3B model
That's it — you're chatting in the terminal. To use it from your own code, Ollama already serves an HTTP API on port 11434:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2:3b",
"prompt": "Summarize this changelog in two lines: ...",
"stream": false
}'
One gotcha worth knowing up front: by default that API binds to localhost. Keep it that way and tunnel over SSH, or it's exposed to the internet. If you want it reachable, put it behind auth — don't leave an open model endpoint on a public IP.
Picking the box
Match the plan to the model, not the other way around:
| You want to run | RAM you need | Sensible plan |
|---|---|---|
| 3B model, light use | ~3 GB | Small (4 GB) |
| 7B model, comfortably | ~5-6 GB | Medium (6 GB) |
| Bigger / high throughput | GPU territory | not a CPU VPS |
For most self-hosters, Medium (6 GB) is the honest recommendation — enough headroom for a 7B model plus your app and the OS. Small (4 GB) works if you stick to 3B. Anything below that is too tight once the OS and context eat in.
Why do it here
If you're self-hosting an LLM, privacy is usually half the reason — so it'd be odd to hand over your ID to rent the box. You don't have to: you can pay in USDC or USDT (or a card via the on-ramp), no KYC, and the server is yours in about a minute. Crypto-native, agent-friendly, and the data stays on a machine you control.
The trade-off is the one we've been honest about: CPU only, a 6 GB ceiling, small models. Within that, self-hosting is great. Outside it, don't let anyone sell you a CPU box for a job that needs a GPU.
Ready to try? Pick a plan, pay, and you'll have root in about 60 seconds — then it's three commands to your own private model.