Self-Host Your AI: Ollama + Open WebUI 2026
Self-Host Your AI: Ollama + Open WebUI 2026
ChatGPT and Claude are excellent — but every prompt you send is logged, used to improve their models, and processed on someone else's infrastructure. Ollama makes it trivial to run powerful open-weight LLMs locally. Open WebUI wraps it in a polished ChatGPT-like interface. Together, you get a fully private AI assistant that runs on your own hardware.
Ollama has over 165,000 GitHub stars. Open WebUI has over 60,000. Both are MIT-licensed and genuinely production-quality.
Quick Verdict
If you have 16GB+ RAM and want a private AI assistant, this setup takes under 10 minutes and delivers a ChatGPT-quality experience entirely offline. The best models for most users in 2026: Llama 3.3 70B for quality (16GB+ VRAM or 32GB RAM), Gemma 3 12B or Mistral Small 3 for balanced performance on consumer hardware.
What Is Ollama?
Ollama is a tool for running large language models locally. It handles:
- Model downloading and management — one command to pull any supported model
- Inference serving — runs a local API server at
localhost:11434 - Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon
- Multi-model support — switch between models without restarting
The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.
What Is Open WebUI?
Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds:
- ChatGPT-style conversation UI with history
- Model switching from the sidebar
- Document upload for Q&A (RAG)
- Image generation integration
- Multi-user support with separate accounts
- Voice input/output
- Web search integration
- Prompt templates and system prompts
System Requirements
| Setup | Minimum | Recommended |
|---|---|---|
| 7B models | 8GB RAM | 16GB RAM |
| 13B models | 16GB RAM | 32GB RAM |
| 70B models | 32GB RAM | 64GB RAM or 16GB VRAM |
| GPU | Any modern NVIDIA/AMD | NVIDIA RTX 3080+ or better |
| Apple Silicon | M1 (8GB) | M2/M3 (16GB+) |
Apple Silicon Macs use unified memory efficiently — an M2 MacBook Pro with 16GB handles 13B models well in real-time.
Option A: Docker Compose (Recommended)
This is the production-grade setup: Ollama and Open WebUI as separate services with shared networking.
# docker-compose.yml
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama:
open-webui:
# Start
docker compose up -d
# Check status
docker compose ps
Open WebUI will be available at http://localhost:3000.
Remove the deploy.resources block if you don't have an NVIDIA GPU — CPU inference works fine for smaller models.
Option B: All-in-One Docker (Fastest Start)
One command for the complete stack with GPU:
docker run -d \
-p 3000:8080 \
--gpus all \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
Without GPU:
docker run -d \
-p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
Option C: Native Install (Apple Silicon)
For Mac users who want maximum performance without Docker overhead:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Ollama starts automatically as a background service
# Pull a model
ollama pull llama3.3
# Install Open WebUI separately (requires Python or Docker)
pip install open-webui
open-webui serve
Apple Silicon uses Metal GPU acceleration automatically — no configuration needed.
Downloading Models
Once Ollama is running, pull models with:
# Via Docker
docker exec ollama ollama pull <model-name>
# Or via native Ollama CLI
ollama pull <model-name>
Recommended Models by Hardware
| Hardware | Recommended Model | Why |
|---|---|---|
| 8GB RAM (CPU only) | gemma3:4b | Good quality, fast on CPU |
| 16GB RAM | mistral-small3:latest | Strong at 12B, reasonable speed |
| 16GB VRAM | llama3.3:70b-instruct-q4 | Near frontier quality |
| 32GB RAM | llama3.3:70b | Best local model for most tasks |
| 64GB RAM / M3 Ultra | llama3.1:405b-q4 | Frontier-class performance |
Key Models Available in 2026
ollama pull llama3.3 # Meta Llama 3.3 70B — best all-around
ollama pull mistral-small3 # Mistral Small 3 12B — fast, multilingual
ollama pull gemma3 # Google Gemma 3 27B
ollama pull gemma3:4b # Google Gemma 3 4B — lightweight
ollama pull deepseek-r1 # DeepSeek R1 — strong reasoning
ollama pull codestral # Mistral Codestral — code specialist
ollama pull phi4 # Microsoft Phi-4 14B — efficient
ollama pull qwen2.5-coder # Alibaba Qwen2.5 Coder
Explore all models at ollama.com/library.
First Login and Setup
- Open
http://localhost:3000 - Create admin account (first account becomes admin)
- Select a model from the dropdown (top of chat)
- Start chatting
Configure a System Prompt
Settings → Models → Edit → System Prompt:
You are a helpful assistant. Be concise. When writing code, always include comments explaining what the code does.
Enable Web Search
Settings → Web → Enable Web Search → choose a search engine (DuckDuckGo works without an API key).
Upload Documents (RAG)
Click the paperclip icon in any chat to upload PDFs, text files, or web URLs. Open WebUI will use them as context for your questions — basic RAG without any additional configuration.
Multi-User Setup
Open WebUI supports multiple users with separate conversation histories:
- Admin → Settings → Users → Create new user
- Each user gets their own login and private chat history
- Admins can restrict which models each user can access
- Rate limiting available per user
This makes it viable to share your instance with a small team or family.
Performance Expectations
| Model | Hardware | Tokens/second |
|---|---|---|
| Llama 3.3 70B | RTX 4090 | ~60–80 t/s |
| Llama 3.3 70B | M3 Max (64GB) | ~30–40 t/s |
| Llama 3.3 70B | CPU (32GB RAM) | ~3–8 t/s |
| Mistral Small 3 12B | RTX 3060 12GB | ~80–100 t/s |
| Gemma 3 4B | CPU only (8GB) | ~15–25 t/s |
CPU inference is slow but usable for short prompts. For conversational use, aim for 20+ t/s — which requires GPU or Apple Silicon for anything above 7B parameters.
Connecting Other Apps to Your Local Ollama
Ollama's OpenAI-compatible API means any app that supports OpenAI can use your local models:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored
)
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Explain Docker volumes in one sentence."}],
)
print(response.choices[0].message.content)
Apps with native Ollama support: Continue (VS Code extension), Cursor (via API config), LibreChat, Obsidian Smart Connections, LM Studio (alternative frontend), AnythingLLM, Chatbox.
Keeping Everything Updated
# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker compose up -d
# Update Ollama (if using Docker)
docker pull ollama/ollama:latest
docker compose up -d
# Update a model to latest version
docker exec ollama ollama pull llama3.3
Models don't update automatically — pull a new version when you want the latest weights.
Remote Access
For access outside your home network:
Option 1: Tailscale (easiest)
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up
# Access Open WebUI at your Tailscale IP: http://100.x.x.x:3000
Option 2: Nginx reverse proxy with SSL
Point your domain at the server, use Let's Encrypt for SSL, proxy to localhost:3000. Standard Nginx Proxy Manager or Caddy configuration.
Option 3: Cloudflare Tunnel Zero-config HTTPS without opening firewall ports. Free for personal use.
Vs. Paying for ChatGPT/Claude
| Ollama + Open WebUI | ChatGPT Plus | Claude Pro | |
|---|---|---|---|
| Cost | Hardware only | $20/mo | $20/mo |
| Privacy | 100% local | Logs prompts | Logs prompts |
| Model quality | Near-frontier | Frontier | Frontier |
| Context length | 128K+ (Llama 3.3) | 128K | 200K |
| Image input | Llava/Gemma | ✅ | ✅ |
| Internet access | Via plugin | ✅ | ✅ |
| Uptime | Your hardware | 99.9%+ | 99.9%+ |
The quality gap between local 70B models and ChatGPT-4 has narrowed dramatically in 2026. For most writing, coding, and Q&A tasks, Llama 3.3 70B is competitive. For frontier reasoning or cutting-edge tasks, commercial APIs still win.
Troubleshooting Common Issues
"Cannot connect to Ollama" in Open WebUI
Check that Ollama is running and accessible from the Open WebUI container:
# Test Ollama API from within Open WebUI container
docker exec open-webui curl http://ollama:11434/api/version
# If using host networking instead of Docker network
docker exec open-webui curl http://host.docker.internal:11434/api/version
If you get a connection error, verify the OLLAMA_BASE_URL environment variable matches your actual Ollama address.
Models Download Slowly or Stall
Ollama downloads model files in chunks. Large models (70B = ~40GB) take time on slow connections. Monitor progress:
docker exec ollama ollama pull llama3.3
# Shows download progress with transfer speed
If a download stalls, re-run the same command — Ollama resumes from where it left off.
GPU Not Detected
Verify NVIDIA container toolkit is installed:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
If this fails, install the NVIDIA Container Toolkit:
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
Use Cases Where Local AI Excels
Code review with sensitive codebases. Send proprietary code to a local model without any risk of it appearing in training data or being logged by a third party.
Document Q&A on confidential files. Upload legal documents, financial reports, or internal specs to Open WebUI's RAG feature. Nothing leaves your server.
Always-on assistant without rate limits. Commercial APIs have rate limits and occasional outages. Your local Ollama instance is always available, no subscription required, no context limit throttling.
Air-gapped environments. Factories, government, healthcare, finance — environments where internet access is restricted or regulated. Ollama works entirely offline after the initial model download.
Experimentation without billing anxiety. Run thousands of inference requests, test different models, build automation — all at zero marginal cost.
Building Automations with Ollama's API
The API is OpenAI-compatible, making it easy to build local automation:
# Test from command line
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.3",
"prompt": "Summarize this in one sentence: [your text]",
"stream": false
}'
# Simple summarization script
import requests
def summarize(text: str, model: str = "llama3.3") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": f"Summarize in 2 sentences: {text}",
"stream": False,
},
)
return response.json()["response"]
Use this pattern to build local AI pipelines: summarize emails, classify support tickets, extract structured data from documents, generate commit messages — without any API costs or external dependencies.
See all self-hosted AI tools at OSSAlt.
Related: 10 Open-Source Tools to Replace SaaS in 2026 · Coolify vs Vercel: Cost Comparison 2026