Can I run an LLM without a GPU?

Yes, for small models. A 3B or 7B model in 4-bit quantization runs on CPU and answers at a readable pace — fine for a chatbot, a coding helper, or background tasks where you're not watching the cursor. What you can't do on CPU is serve a big model (13B and up) or push high throughput; that's where a GPU stops being optional.

How much RAM do I need for Llama 7B?

Around 5 GB for a 4-bit 7B model once you add the context window and the OS — so a 6 GB box is the comfortable floor. A 3B model fits in about 3 GB. Smaller quant (Q4) trades a little quality for a lot less memory, which is the right trade on a VPS.

Is self-hosting an LLM cheaper than an API?

It depends on volume. A hosted API charges per token and costs nothing when idle; a VPS is a flat monthly bill whether you use it or not. If you run a steady stream of requests, or you mainly care about privacy and no rate limits, self-hosting wins. For occasional one-off prompts, a metered API is cheaper.

Why self-host instead of using a cloud API?

Three reasons people actually do it: the data never leaves your server (real for legal, medical, or just private notes), there are no rate limits or per-token meter, and the model won't change or get deprecated under you. The cost is you manage the box and live within its hardware.

What's the biggest model I can realistically run on a CPU VPS?

A 7B model in 4-bit is the sweet spot on a 6 GB box. You can technically load a 13B with enough RAM, but on CPU it gets slow enough that you'll feel it. Past that you want a GPU, which is a different kind of host.

Self-hosting a local LLM on a VPS with Ollama — what actually works

Sending every prompt to someone else's API is fine — until it isn't. Maybe the data is sensitive and you'd rather it never leave your server. Maybe you're tired of rate limits, or of a model version changing under your feet, or of a per-token meter ticking while you experiment. At some point "what if I just ran my own?" stops being a thought experiment.

Ollama makes that genuinely easy. The harder question is what fits on a VPS — and here the honest answer matters more than the hype.

What you can actually run on CPU

No GPU? Then you're doing CPU inference, and model size is everything. Quantization (squeezing the weights down to 4-bit) is what makes this practical — you lose a sliver of quality and save a pile of memory.

Rough numbers, the ones that matter:

3B model, 4-bit — ~3 GB RAM. Snappy enough for chat and simple tasks.
7B model, 4-bit — ~5 GB RAM. The sweet spot: noticeably smarter, still runs at a readable pace.
13B and up — 8-10 GB+ and slow on CPU. Technically possible, practically annoying.

Speed, honestly: on a few vCPUs you'll see a handful of tokens per second. Perfectly fine for a chatbot or a coding assistant where you read as it types. Not fine for batch-processing a million documents — that's a GPU job, and we don't pretend otherwise. We don't offer GPU instances. If your plan needs a 70B model or heavy throughput, a CPU VPS — ours or anyone's — is the wrong tool, and you should know that before you spend a cent.

But a private 7B that answers your questions and never phones home? That runs comfortably on a 6 GB box.

The install — three commands

Spin up the server, SSH in, and:

curl -fsSL https://ollama.com/install.sh | sh   # installs Ollama
ollama run llama3.2:3b                            # pulls + runs a 3B model

That's it — you're chatting in the terminal. To use it from your own code, Ollama already serves an HTTP API on port 11434:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2:3b",
  "prompt": "Summarize this changelog in two lines: ...",
  "stream": false
}'

One gotcha worth knowing up front: by default that API binds to localhost. Keep it that way and tunnel over SSH, or it's exposed to the internet. If you want it reachable, put it behind auth — don't leave an open model endpoint on a public IP.

Picking the box

Match the plan to the model, not the other way around:

You want to run	RAM you need	Sensible plan
3B model, light use	~3 GB	Small (4 GB)
7B model, comfortably	~5-6 GB	Medium (6 GB)
Bigger / high throughput	GPU territory	not a CPU VPS

For most self-hosters, Medium (6 GB) is the honest recommendation — enough headroom for a 7B model plus your app and the OS. Small (4 GB) works if you stick to 3B. Anything below that is too tight once the OS and context eat in.

Why do it here

If you're self-hosting an LLM, privacy is usually half the reason — so it'd be odd to hand over your ID to rent the box. You don't have to: you can pay in USDC or USDT (or a card via the on-ramp), no KYC, and the server is yours in about a minute. Crypto-native, agent-friendly, and the data stays on a machine you control.

The trade-off is the one we've been honest about: CPU only, a 6 GB ceiling, small models. Within that, self-hosting is great. Outside it, don't let anyone sell you a CPU box for a job that needs a GPU.

Ready to try? Pick a plan, pay, and you'll have root in about 60 seconds — then it's three commands to your own private model.

Self-hosting a local LLM on a VPS with Ollama — what actually works

What you can actually run on CPU

The install — three commands

Picking the box

Why do it here

FAQ