Ollama is a tool for running large language models locally. It handles: Model downloading and management — one command to pull any supported model Inference serving — runs a local API server at localhost:11434 Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon Multi-model support — switch between models without restarting The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.

Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds: ChatGPT-style conversation UI with history Model switching from the sidebar Document upload for Q&A (RAG) Image generation integration Multi-user support with separate accounts Voice input/output Web search integration Prompt templates and system prompts ---

Self-Host Your AI: Ollama + Open WebUI 2026

ChatGPT and Claude are excellent — but every prompt you send is logged, used to improve their models, and processed on someone else's infrastructure. Ollama makes it trivial to run powerful open-weight LLMs locally. Open WebUI wraps it in a polished ChatGPT-like interface. Together, you get a fully private AI assistant that runs on your own hardware.

Ollama has over 165,000 GitHub stars. Open WebUI has over 60,000. Both are MIT-licensed and genuinely production-quality.

Quick Verdict

If you have 16GB+ RAM and want a private AI assistant, this setup takes under 10 minutes and delivers a ChatGPT-quality experience entirely offline. The best models for most users in 2026: Llama 3.3 70B for quality (16GB+ VRAM or 32GB RAM), Gemma 3 12B or Mistral Small 3 for balanced performance on consumer hardware.

What Is Ollama?

Ollama is a tool for running large language models locally. It handles:

Model downloading and management — one command to pull any supported model
Inference serving — runs a local API server at localhost:11434
Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon
Multi-model support — switch between models without restarting

The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.

What Is Open WebUI?

Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds:

ChatGPT-style conversation UI with history
Model switching from the sidebar
Document upload for Q&A (RAG)
Image generation integration
Multi-user support with separate accounts
Voice input/output
Web search integration
Prompt templates and system prompts

System Requirements

Setup	Minimum	Recommended
7B models	8GB RAM	16GB RAM
13B models	16GB RAM	32GB RAM
70B models	32GB RAM	64GB RAM or 16GB VRAM
GPU	Any modern NVIDIA/AMD	NVIDIA RTX 3080+ or better
Apple Silicon	M1 (8GB)	M2/M3 (16GB+)

Apple Silicon Macs use unified memory efficiently — an M2 MacBook Pro with 16GB handles 13B models well in real-time.

Option A: Docker Compose (Recommended)

This is the production-grade setup: Ollama and Open WebUI as separate services with shared networking.

# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama:
  open-webui:

# Start
docker compose up -d

# Check status
docker compose ps

Open WebUI will be available at http://localhost:3000.

Remove the deploy.resources block if you don't have an NVIDIA GPU — CPU inference works fine for smaller models.

Option B: All-in-One Docker (Fastest Start)

One command for the complete stack with GPU:

docker run -d \
  -p 3000:8080 \
  --gpus all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Without GPU:

docker run -d \
  -p 3000:8080 \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Option C: Native Install (Apple Silicon)

For Mac users who want maximum performance without Docker overhead:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Ollama starts automatically as a background service
# Pull a model
ollama pull llama3.3

# Install Open WebUI separately (requires Python or Docker)
pip install open-webui
open-webui serve

Apple Silicon uses Metal GPU acceleration automatically — no configuration needed.

Downloading Models

Once Ollama is running, pull models with:

# Via Docker
docker exec ollama ollama pull <model-name>

# Or via native Ollama CLI
ollama pull <model-name>

Recommended Models by Hardware

Hardware	Recommended Model	Why
8GB RAM (CPU only)	`gemma3:4b`	Good quality, fast on CPU
16GB RAM	`mistral-small3:latest`	Strong at 12B, reasonable speed
16GB VRAM	`llama3.3:70b-instruct-q4`	Near frontier quality
32GB RAM	`llama3.3:70b`	Best local model for most tasks
64GB RAM / M3 Ultra	`llama3.1:405b-q4`	Frontier-class performance

Key Models Available in 2026

ollama pull llama3.3          # Meta Llama 3.3 70B — best all-around
ollama pull mistral-small3    # Mistral Small 3 12B — fast, multilingual
ollama pull gemma3            # Google Gemma 3 27B
ollama pull gemma3:4b         # Google Gemma 3 4B — lightweight
ollama pull deepseek-r1       # DeepSeek R1 — strong reasoning
ollama pull codestral         # Mistral Codestral — code specialist
ollama pull phi4              # Microsoft Phi-4 14B — efficient
ollama pull qwen2.5-coder     # Alibaba Qwen2.5 Coder

Explore all models at ollama.com/library.

Open http://localhost:3000
Create admin account (first account becomes admin)
Select a model from the dropdown (top of chat)
Start chatting

Configure a System Prompt

Settings → Models → Edit → System Prompt:

You are a helpful assistant. Be concise. When writing code, always include comments explaining what the code does.

Enable Web Search

Settings → Web → Enable Web Search → choose a search engine (DuckDuckGo works without an API key).

Upload Documents (RAG)

Click the paperclip icon in any chat to upload PDFs, text files, or web URLs. Open WebUI will use them as context for your questions — basic RAG without any additional configuration.

Multi-User Setup

Open WebUI supports multiple users with separate conversation histories:

Admin → Settings → Users → Create new user
Each user gets their own login and private chat history
Admins can restrict which models each user can access
Rate limiting available per user

This makes it viable to share your instance with a small team or family.

Performance Expectations

Model	Hardware	Tokens/second
Llama 3.3 70B	RTX 4090	~60–80 t/s
Llama 3.3 70B	M3 Max (64GB)	~30–40 t/s
Llama 3.3 70B	CPU (32GB RAM)	~3–8 t/s
Mistral Small 3 12B	RTX 3060 12GB	~80–100 t/s
Gemma 3 4B	CPU only (8GB)	~15–25 t/s

CPU inference is slow but usable for short prompts. For conversational use, aim for 20+ t/s — which requires GPU or Apple Silicon for anything above 7B parameters.

Connecting Other Apps to Your Local Ollama

Ollama's OpenAI-compatible API means any app that supports OpenAI can use your local models:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Explain Docker volumes in one sentence."}],
)
print(response.choices[0].message.content)

Apps with native Ollama support: Continue (VS Code extension), Cursor (via API config), LibreChat, Obsidian Smart Connections, LM Studio (alternative frontend), AnythingLLM, Chatbox.

Keeping Everything Updated

# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker compose up -d

# Update Ollama (if using Docker)
docker pull ollama/ollama:latest
docker compose up -d

# Update a model to latest version
docker exec ollama ollama pull llama3.3

Models don't update automatically — pull a new version when you want the latest weights.

Remote Access

For access outside your home network:

Option 1: Tailscale (easiest)

curl -fsSL https://tailscale.com/install.sh | sh
tailscale up
# Access Open WebUI at your Tailscale IP: http://100.x.x.x:3000

Option 2: Nginx reverse proxy with SSL Point your domain at the server, use Let's Encrypt for SSL, proxy to localhost:3000. Standard Nginx Proxy Manager or Caddy configuration.

Option 3: Cloudflare Tunnel Zero-config HTTPS without opening firewall ports. Free for personal use.

Vs. Paying for ChatGPT/Claude

	Ollama + Open WebUI	ChatGPT Plus	Claude Pro
Cost	Hardware only	$20/mo	$20/mo
Privacy	100% local	Logs prompts	Logs prompts
Model quality	Near-frontier	Frontier	Frontier
Context length	128K+ (Llama 3.3)	128K	200K
Image input	Llava/Gemma	✅	✅
Internet access	Via plugin	✅	✅
Uptime	Your hardware	99.9%+	99.9%+

The quality gap between local 70B models and ChatGPT-4 has narrowed dramatically in 2026. For most writing, coding, and Q&A tasks, Llama 3.3 70B is competitive. For frontier reasoning or cutting-edge tasks, commercial APIs still win.

Troubleshooting Common Issues

"Cannot connect to Ollama" in Open WebUI

Check that Ollama is running and accessible from the Open WebUI container:

# Test Ollama API from within Open WebUI container
docker exec open-webui curl http://ollama:11434/api/version

# If using host networking instead of Docker network
docker exec open-webui curl http://host.docker.internal:11434/api/version

If you get a connection error, verify the OLLAMA_BASE_URL environment variable matches your actual Ollama address.

Models Download Slowly or Stall

Ollama downloads model files in chunks. Large models (70B = ~40GB) take time on slow connections. Monitor progress:

docker exec ollama ollama pull llama3.3
# Shows download progress with transfer speed

If a download stalls, re-run the same command — Ollama resumes from where it left off.

GPU Not Detected

Verify NVIDIA container toolkit is installed:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

If this fails, install the NVIDIA Container Toolkit:

distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Use Cases Where Local AI Excels

Code review with sensitive codebases. Send proprietary code to a local model without any risk of it appearing in training data or being logged by a third party.

Document Q&A on confidential files. Upload legal documents, financial reports, or internal specs to Open WebUI's RAG feature. Nothing leaves your server.

Always-on assistant without rate limits. Commercial APIs have rate limits and occasional outages. Your local Ollama instance is always available, no subscription required, no context limit throttling.

Air-gapped environments. Factories, government, healthcare, finance — environments where internet access is restricted or regulated. Ollama works entirely offline after the initial model download.

Experimentation without billing anxiety. Run thousands of inference requests, test different models, build automation — all at zero marginal cost.

Building Automations with Ollama's API

The API is OpenAI-compatible, making it easy to build local automation:

# Test from command line
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.3",
    "prompt": "Summarize this in one sentence: [your text]",
    "stream": false
  }'

# Simple summarization script
import requests

def summarize(text: str, model: str = "llama3.3") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": f"Summarize in 2 sentences: {text}",
            "stream": False,
        },
    )
    return response.json()["response"]

Use this pattern to build local AI pipelines: summarize emails, classify support tickets, extract structured data from documents, generate commit messages — without any API costs or external dependencies.

See all self-hosted AI tools at OSSAlt.

Comments