Affiliate disclosure: Some links in this article are affiliate links. If you sign up through them, we may earn a commission at no extra cost to you. Recommendations are based on documented platform capabilities and official pricing as of May 2026.
Self-Host Ollama on a $7 VPS: Complete Setup Guide (2026)
Running your own LLM inference server costs less than a streaming subscription. Ollama makes it straightforward to run models like Llama 3, Mistral, Qwen, and Phi locally or on a VPS — and a CPU-only server is enough for many use cases.
This guide covers the complete setup: choosing a VPS, installing Ollama, picking models that fit the hardware, securing the API endpoint, and keeping costs predictable. No GPU required for the $7 tier.
When CPU-only Ollama is actually useful
GPU hosting is where Ollama shines for speed — but CPU-only inference is not useless. It works well for:
- Development and experimentation — testing prompts, evaluating models, building prototypes before committing to GPU costs
- Low-latency simple tasks — short classification, simple RAG queries on small corpora
- Small models — Phi-3 Mini (3.8B), Gemma 2 (2B), Llama 3.2 (1B/3B) run adequately on a 4-core CPU with 8 GB RAM
- Always-available fallback — if your primary GPU inference provider has an outage, a CPU Ollama instance handles the reduced workload
For production inference on large models (13B+), GPU instances (Hetzner Cloud GPU, Lambda Labs, RunPod) are necessary. This guide targets the CPU use case explicitly.
Hardware requirements by model size
| Model | Parameters | Minimum RAM | Recommended RAM | CPU-only usable? |
|---|---|---|---|---|
| Phi-3 Mini | 3.8B | 4 GB | 6 GB | Yes — reasonable speed |
| Llama 3.2 | 1B/3B | 2–4 GB | 4 GB | Yes — fast |
| Gemma 2 | 2B/9B | 4–8 GB | 8 GB | 2B yes; 9B slow |
| Qwen 2.5 | 7B | 6 GB | 8 GB | Usable, slow |
| Mistral 7B | 7B | 6 GB | 8 GB | Usable, slow |
| Llama 3 | 8B | 6 GB | 8 GB | Slow on CPU |
| Llama 3.1 | 70B | 40+ GB | 48 GB | CPU not practical |
Rule of thumb: Model file size ≈ RAM needed. A Q4-quantized 7B model is ~4 GB; you need roughly 1.5× that in available RAM (model + inference overhead).
Choosing a VPS
For CPU-only Ollama, the sweet spot is a 4 vCPU / 8 GB RAM instance. This handles 7B models (slowly) and smaller models (adequately).
Hetzner Cloud (recommended) Hetzner
Hetzner offers the best price-to-performance for this use case:
| Instance | vCPU | RAM | Monthly | Best for |
|---|---|---|---|---|
| CX22 | 2 AMD | 4 GB | €4.85 (~$5.20) | Llama 3.2 3B, Phi-3 Mini |
| CX32 | 4 AMD | 8 GB | €9.68 (~$10.40) | Mistral 7B, Qwen 7B |
| CX42 | 8 AMD | 16 GB | €19.35 (~$21) | Larger models |
Hetzner’s network locations: Nuremberg, Falkenstein (Germany), Helsinki (Finland), Hillsboro (US). Choose the one closest to your users.
DigitalOcean Digitalocean
| Droplet | vCPU | RAM | Monthly |
|---|---|---|---|
| Basic 4 GB | 2 | 4 GB | $24 |
| Basic 8 GB | 4 | 8 GB | $48 |
DigitalOcean is more expensive than Hetzner for equivalent specs but has a more beginner-friendly dashboard and more global datacenter locations.
What to avoid
- Shared CPU instances (Hetzner CX11, DigitalOcean Basic 1 GB) — insufficient RAM for any meaningful model
- ARM instances — Ollama runs on ARM but binary availability varies; x86 is safer for initial setup
Step 1: Provision your VPS
For this guide, using a Hetzner CX32 (4 vCPU, 8 GB RAM, Ubuntu 22.04).
After creating the server in the Hetzner Cloud console:
<h1>SSH in with your key</h1>
ssh root@your-server-ip
<h1>Update the system</h1>
apt update && apt upgrade -y
<h1>Create a non-root user (recommended)</h1>
useradd -m -s /bin/bash ollama
usermod -aG sudo ollama
Step 2: Install Ollama
The official install script handles the binary, systemd service, and user setup:
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama as a systemd service that starts automatically. Verify:
systemctl status ollama
<h1>Should show: active (running)</h1>
ollama --version
By default, Ollama listens on localhost:11434 and is not externally accessible. This is intentional — the API has no built-in authentication.
Step 3: Pull your first model
<h1>As the ollama user or root</h1>
ollama pull llama3.2:3b # 2 GB download, good baseline
ollama pull phi3:mini # 2.3 GB, fast on CPU
ollama pull mistral:7b-q4_K_M # 4.1 GB, better quality, slower on CPU
<h1>List downloaded models</h1>
ollama list
Test locally:
ollama run phi3:mini "What is the capital of Japan?"
For API usage (while on the server):
curl http://localhost:11434/api/generate
-d '{"model":"phi3:mini","prompt":"What is the MCP protocol?","stream":false}'
Step 4: Expose the API securely
By default, Ollama only listens on localhost. To expose it externally, you have two options.
Option A: nginx reverse proxy with bearer token auth (recommended)
Install nginx and configure a reverse proxy with token authentication:
apt install nginx -y
Create /etc/nginx/sites-available/ollama:
server {
listen 443 ssl;
server_name ollama.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/ollama.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/ollama.yourdomain.com/privkey.pem;
location / {
# Simple bearer token auth
set $auth_token "Bearer YOUR_STRONG_TOKEN_HERE";
if ($http_authorization != $auth_token) {
return 401 "Unauthorizedn";
}
proxy_pass http://localhost:11434;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
# Required for streaming responses
proxy_buffering off;
proxy_read_timeout 300s;
proxy_connect_timeout 300s;
}
}
Get a TLS certificate with Certbot:
apt install certbot python3-certbot-nginx -y
certbot --nginx -d ollama.yourdomain.com
Enable and restart nginx:
ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx
Option B: Bind Ollama to all interfaces with firewall rules
Edit Ollama’s systemd service to bind to 0.0.0.0:
systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Then restrict access via UFW to specific IP ranges:
ufw allow from YOUR_IP to any port 11434
ufw enable
Note: Option A with nginx is more robust — it gives you TLS, proper auth, and easy future expansion (rate limiting, multiple upstreams).
Step 5: Configure for production use
Set model memory limits
Add to Ollama’s systemd override to control memory usage:
systemctl edit ollama
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=2"
OLLAMA_MAX_LOADED_MODELS=1 ensures only one model is resident in RAM at once — critical on 8 GB RAM. OLLAMA_NUM_PARALLEL=2 allows two concurrent requests to the same model.
Set up log rotation
Ollama logs can grow. Configure rotation:
cat > /etc/logrotate.d/ollama << 'EOF'
/var/log/ollama/<em>.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
}
EOF
Enable automatic restarts
Ollama’s default systemd unit includes Restart=always. Verify:
systemctl cat ollama | grep Restart
If it’s not set, add it via systemctl edit ollama.
Step 6: Use with your applications
OpenAI-compatible API
Ollama exposes an OpenAI-compatible API at /v1/. Applications using the OpenAI Python or JavaScript SDK can talk to Ollama with a base URL override:
from openai import OpenAI
client = OpenAI(
base_url="https://ollama.yourdomain.com/v1",
api_key="Bearer YOUR_STRONG_TOKEN_HERE",
)
response = client.chat.completions.create(
model="phi3:mini",
messages=[{"role": "user", "content": "Explain Docker volumes briefly."}]
)
print(response.choices[0].message.content)
LangChain integration
from langchain_community.llms import Ollama
llm = Ollama(
base_url="https://ollama.yourdomain.com",
model="phi3:mini",
headers={"Authorization": "Bearer YOUR_STRONG_TOKEN_HERE"}
)
response = llm.invoke("What is the difference between Railway and Fly.io?")
Claude Code agent fallback
In a Claude Code agent, configure Ollama as a fallback for tasks where Claude isn’t needed:
import os
import anthropic
from openai import OpenAI
<h1>Use Claude for complex reasoning</h1>
claude = anthropic.Anthropic()
<h1>Use local Ollama for simple classification/routing</h1>
local_llm = OpenAI(
base_url=os.environ["OLLAMA_URL"] + "/v1",
api_key=os.environ["OLLAMA_TOKEN"],
)
def classify_intent(text: str) -> str:
"""Simple classification — Ollama is fast enough for this."""
response = local_llm.chat.completions.create(
model="phi3:mini",
messages=[{"role": "user", "content": f"Classify as 'question', 'command', or 'other': {text}"}]
)
return response.choices[0].message.content.strip()
Cost comparison: self-hosted vs. API
Running phi3:mini on a Hetzner CX32 ($10.40/month):
| Scenario | Self-hosted (Hetzner) | Anthropic Claude Haiku | Notes |
|---|---|---|---|
| 1M tokens/month | ~$10.40 flat | ~$1 | Self-hosted cheaper at high volume |
| 100k tokens/month | ~$10.40 flat | ~$0.10 | API is cheaper at low volume |
| 10M tokens/month | ~$10.40 flat | ~$10 | Break-even zone |
The real advantage of self-hosted is not cost for typical workloads — it’s data privacy and no per-token anxiety. If you need to process sensitive documents, experiment freely, or run high-volume batch jobs, the flat monthly rate makes sense.
Common issues
OOM kills. If your server runs out of RAM mid-inference, reduce OLLAMA_MAX_LOADED_MODELS to 1 and consider a smaller quantization level (e.g., q4_K_S instead of q8_0).
Slow inference on CPU. Expected. A 7B model on a 4-core CPU generates ~2–5 tokens/second. For interactive use, prefer models under 4B. For batch use, the speed is fine.
Port 11434 not accessible. Check that UFW or Hetzner’s firewall rules allow the port, or use the nginx reverse proxy approach.
Model download fails. Ensure the server has at least 2× the model size in free disk space during download (the download and the final model both take space temporarily). Hetzner CX32 includes 40 GB disk — sufficient.
Next steps
If you want GPU inference instead of CPU:
- [Modal vs Replicate vs RunPod for GPU Inference](https://hostingpundit.com/modal-vs-replicate-vs-runpod/)
- Hetzner Cloud GPU instances (CCX53, CCX63) — available in EU regions
If you want to use Ollama as an MCP server backend:
- Build an MCP server that wraps the Ollama API
- Deploy the MCP server to Railway or Fly.io
- Connect from Claude Code or Claude Desktop
Prices verified May 2026. Check official documentation before provisioning.*
Leave a Reply