Self-Host Ollama on a $7 VPS: Complete Setup Guide (2026)

Written by

in

Affiliate disclosure: Some links in this article are affiliate links. If you sign up through them, we may earn a commission at no extra cost to you. Recommendations are based on documented platform capabilities and official pricing as of May 2026.


Self-Host Ollama on a $7 VPS: Complete Setup Guide (2026)

Running your own LLM inference server costs less than a streaming subscription. Ollama makes it straightforward to run models like Llama 3, Mistral, Qwen, and Phi locally or on a VPS — and a CPU-only server is enough for many use cases.

This guide covers the complete setup: choosing a VPS, installing Ollama, picking models that fit the hardware, securing the API endpoint, and keeping costs predictable. No GPU required for the $7 tier.


When CPU-only Ollama is actually useful

GPU hosting is where Ollama shines for speed — but CPU-only inference is not useless. It works well for:

  • Development and experimentation — testing prompts, evaluating models, building prototypes before committing to GPU costs
  • Low-latency simple tasks — short classification, simple RAG queries on small corpora
  • Small models — Phi-3 Mini (3.8B), Gemma 2 (2B), Llama 3.2 (1B/3B) run adequately on a 4-core CPU with 8 GB RAM
  • Always-available fallback — if your primary GPU inference provider has an outage, a CPU Ollama instance handles the reduced workload

For production inference on large models (13B+), GPU instances (Hetzner Cloud GPU, Lambda Labs, RunPod) are necessary. This guide targets the CPU use case explicitly.


Hardware requirements by model size

Model Parameters Minimum RAM Recommended RAM CPU-only usable?
Phi-3 Mini 3.8B 4 GB 6 GB Yes — reasonable speed
Llama 3.2 1B/3B 2–4 GB 4 GB Yes — fast
Gemma 2 2B/9B 4–8 GB 8 GB 2B yes; 9B slow
Qwen 2.5 7B 6 GB 8 GB Usable, slow
Mistral 7B 7B 6 GB 8 GB Usable, slow
Llama 3 8B 6 GB 8 GB Slow on CPU
Llama 3.1 70B 40+ GB 48 GB CPU not practical

Rule of thumb: Model file size ≈ RAM needed. A Q4-quantized 7B model is ~4 GB; you need roughly 1.5× that in available RAM (model + inference overhead).


Choosing a VPS

For CPU-only Ollama, the sweet spot is a 4 vCPU / 8 GB RAM instance. This handles 7B models (slowly) and smaller models (adequately).

Hetzner Cloud (recommended) Hetzner

Hetzner offers the best price-to-performance for this use case:

Instance vCPU RAM Monthly Best for
CX22 2 AMD 4 GB €4.85 (~$5.20) Llama 3.2 3B, Phi-3 Mini
CX32 4 AMD 8 GB €9.68 (~$10.40) Mistral 7B, Qwen 7B
CX42 8 AMD 16 GB €19.35 (~$21) Larger models

Hetzner’s network locations: Nuremberg, Falkenstein (Germany), Helsinki (Finland), Hillsboro (US). Choose the one closest to your users.

Sign up for Hetzner

DigitalOcean Digitalocean

Droplet vCPU RAM Monthly
Basic 4 GB 2 4 GB $24
Basic 8 GB 4 8 GB $48

DigitalOcean is more expensive than Hetzner for equivalent specs but has a more beginner-friendly dashboard and more global datacenter locations.

What to avoid

  • Shared CPU instances (Hetzner CX11, DigitalOcean Basic 1 GB) — insufficient RAM for any meaningful model
  • ARM instances — Ollama runs on ARM but binary availability varies; x86 is safer for initial setup

Step 1: Provision your VPS

For this guide, using a Hetzner CX32 (4 vCPU, 8 GB RAM, Ubuntu 22.04).

After creating the server in the Hetzner Cloud console:

<h1>SSH in with your key</h1>
ssh root@your-server-ip

<h1>Update the system</h1>
apt update && apt upgrade -y

<h1>Create a non-root user (recommended)</h1>
useradd -m -s /bin/bash ollama
usermod -aG sudo ollama

Step 2: Install Ollama

The official install script handles the binary, systemd service, and user setup:

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama as a systemd service that starts automatically. Verify:

systemctl status ollama
<h1>Should show: active (running)</h1>

ollama --version

By default, Ollama listens on localhost:11434 and is not externally accessible. This is intentional — the API has no built-in authentication.


Step 3: Pull your first model

<h1>As the ollama user or root</h1>
ollama pull llama3.2:3b       # 2 GB download, good baseline
ollama pull phi3:mini         # 2.3 GB, fast on CPU
ollama pull mistral:7b-q4_K_M # 4.1 GB, better quality, slower on CPU

<h1>List downloaded models</h1>
ollama list

Test locally:

ollama run phi3:mini "What is the capital of Japan?"

For API usage (while on the server):

curl http://localhost:11434/api/generate 
  -d '{"model":"phi3:mini","prompt":"What is the MCP protocol?","stream":false}'

Step 4: Expose the API securely

By default, Ollama only listens on localhost. To expose it externally, you have two options.

Option A: nginx reverse proxy with bearer token auth (recommended)

Install nginx and configure a reverse proxy with token authentication:

apt install nginx -y

Create /etc/nginx/sites-available/ollama:

server {
    listen 443 ssl;
    server_name ollama.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ollama.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.yourdomain.com/privkey.pem;

    location / {
        # Simple bearer token auth
        set $auth_token "Bearer YOUR_STRONG_TOKEN_HERE";
        if ($http_authorization != $auth_token) {
            return 401 "Unauthorizedn";
        }

        proxy_pass http://localhost:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;

        # Required for streaming responses
        proxy_buffering off;
        proxy_read_timeout 300s;
        proxy_connect_timeout 300s;
    }
}

Get a TLS certificate with Certbot:

apt install certbot python3-certbot-nginx -y
certbot --nginx -d ollama.yourdomain.com

Enable and restart nginx:

ln -s /etc/nginx/sites-available/ollama /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

Option B: Bind Ollama to all interfaces with firewall rules

Edit Ollama’s systemd service to bind to 0.0.0.0:

systemctl edit ollama

Add:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Then restrict access via UFW to specific IP ranges:

ufw allow from YOUR_IP to any port 11434
ufw enable

Note: Option A with nginx is more robust — it gives you TLS, proper auth, and easy future expansion (rate limiting, multiple upstreams).


Step 5: Configure for production use

Set model memory limits

Add to Ollama’s systemd override to control memory usage:

systemctl edit ollama
[Service]
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_NUM_PARALLEL=2"

OLLAMA_MAX_LOADED_MODELS=1 ensures only one model is resident in RAM at once — critical on 8 GB RAM. OLLAMA_NUM_PARALLEL=2 allows two concurrent requests to the same model.

Set up log rotation

Ollama logs can grow. Configure rotation:

cat > /etc/logrotate.d/ollama << 'EOF'
/var/log/ollama/<em>.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
}
EOF

Enable automatic restarts

Ollama’s default systemd unit includes Restart=always. Verify:

systemctl cat ollama | grep Restart

If it’s not set, add it via systemctl edit ollama.


Step 6: Use with your applications

OpenAI-compatible API

Ollama exposes an OpenAI-compatible API at /v1/. Applications using the OpenAI Python or JavaScript SDK can talk to Ollama with a base URL override:

from openai import OpenAI

client = OpenAI(
    base_url="https://ollama.yourdomain.com/v1",
    api_key="Bearer YOUR_STRONG_TOKEN_HERE",
)

response = client.chat.completions.create(
    model="phi3:mini",
    messages=[{"role": "user", "content": "Explain Docker volumes briefly."}]
)
print(response.choices[0].message.content)

LangChain integration

from langchain_community.llms import Ollama

llm = Ollama(
    base_url="https://ollama.yourdomain.com",
    model="phi3:mini",
    headers={"Authorization": "Bearer YOUR_STRONG_TOKEN_HERE"}
)

response = llm.invoke("What is the difference between Railway and Fly.io?")

Claude Code agent fallback

In a Claude Code agent, configure Ollama as a fallback for tasks where Claude isn’t needed:

import os
import anthropic
from openai import OpenAI

<h1>Use Claude for complex reasoning</h1>
claude = anthropic.Anthropic()

<h1>Use local Ollama for simple classification/routing</h1>
local_llm = OpenAI(
    base_url=os.environ["OLLAMA_URL"] + "/v1",
    api_key=os.environ["OLLAMA_TOKEN"],
)

def classify_intent(text: str) -> str:
    """Simple classification — Ollama is fast enough for this."""
    response = local_llm.chat.completions.create(
        model="phi3:mini",
        messages=[{"role": "user", "content": f"Classify as 'question', 'command', or 'other': {text}"}]
    )
    return response.choices[0].message.content.strip()

Cost comparison: self-hosted vs. API

Running phi3:mini on a Hetzner CX32 ($10.40/month):

Scenario Self-hosted (Hetzner) Anthropic Claude Haiku Notes
1M tokens/month ~$10.40 flat ~$1 Self-hosted cheaper at high volume
100k tokens/month ~$10.40 flat ~$0.10 API is cheaper at low volume
10M tokens/month ~$10.40 flat ~$10 Break-even zone

The real advantage of self-hosted is not cost for typical workloads — it’s data privacy and no per-token anxiety. If you need to process sensitive documents, experiment freely, or run high-volume batch jobs, the flat monthly rate makes sense.


Common issues

OOM kills. If your server runs out of RAM mid-inference, reduce OLLAMA_MAX_LOADED_MODELS to 1 and consider a smaller quantization level (e.g., q4_K_S instead of q8_0).

Slow inference on CPU. Expected. A 7B model on a 4-core CPU generates ~2–5 tokens/second. For interactive use, prefer models under 4B. For batch use, the speed is fine.

Port 11434 not accessible. Check that UFW or Hetzner’s firewall rules allow the port, or use the nginx reverse proxy approach.

Model download fails. Ensure the server has at least 2× the model size in free disk space during download (the download and the final model both take space temporarily). Hetzner CX32 includes 40 GB disk — sufficient.


Next steps

If you want GPU inference instead of CPU:

  • [Modal vs Replicate vs RunPod for GPU Inference](https://hostingpundit.com/modal-vs-replicate-vs-runpod/)
  • Hetzner Cloud GPU instances (CCX53, CCX63) — available in EU regions

If you want to use Ollama as an MCP server backend:

  • Build an MCP server that wraps the Ollama API
  • Deploy the MCP server to Railway or Fly.io
  • Connect from Claude Code or Claude Desktop

Prices verified May 2026. Check official documentation before provisioning.*


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *