Open-source alternatives guide
Self-Host Your AI: Ollama + Open WebUI 2026
Run Llama 4, Mistral, Gemma, and DeepSeek locally with Ollama and Open WebUI. Full Docker setup, model guide, and GPU acceleration in under 10 minutes.
Self-Host Your AI: Ollama + Open WebUI 2026
ChatGPT and Claude are excellent — but every prompt you send is logged, used to improve their models, and processed on someone else's infrastructure. Ollama makes it trivial to run powerful open-weight LLMs locally. Open WebUI wraps it in a polished ChatGPT-like interface. Together, you get a fully private AI assistant that runs on your own hardware.
Ollama has over 165,000 GitHub stars. Open WebUI has over 60,000. Both are MIT-licensed and genuinely production-quality.
Quick Verdict
If you have 16GB+ RAM and want a private AI assistant, this setup takes under 10 minutes and delivers a ChatGPT-quality experience entirely offline. The best models for most users in 2026: Llama 3.3 70B for quality (16GB+ VRAM or 32GB RAM), Gemma 3 12B or Mistral Small 3 for balanced performance on consumer hardware.
What Is Ollama?
Ollama is a tool for running large language models locally. It handles:
- Model downloading and management — one command to pull any supported model
- Inference serving — runs a local API server at
localhost:11434 - Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon
- Multi-model support — switch between models without restarting
The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.
What Is Open WebUI?
Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds:
- ChatGPT-style conversation UI with history
- Model switching from the sidebar
- Document upload for Q&A (RAG)
- Image generation integration
- Multi-user support with separate accounts
- Voice input/output
- Web search integration
- Prompt templates and system prompts
System Requirements
| Setup | Minimum | Recommended |
|---|---|---|
| 7B models | 8GB RAM | 16GB RAM |
| 13B models | 16GB RAM | 32GB RAM |
| 70B models | 32GB RAM | 64GB RAM or 16GB VRAM |
| GPU | Any modern NVIDIA/AMD | NVIDIA RTX 3080+ or better |
| Apple Silicon | M1 (8GB) | M2/M3 (16GB+) |
Apple Silicon Macs use unified memory efficiently — an M2 MacBook Pro with 16GB handles 13B models well in real-time.
Option A: Docker Compose (Recommended)
This is the production-grade setup: Ollama and Open WebUI as separate services with shared networking.
# docker-compose.yml
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama:
open-webui:
# Start
docker compose up -d
# Check status
docker compose ps
Open WebUI will be available at http://localhost:3000.
Remove the deploy.resources block if you don't have an NVIDIA GPU — CPU inference works fine for smaller models.
Option B: All-in-One Docker (Fastest Start)
One command for the complete stack with GPU:
docker run -d \
-p 3000:8080 \
--gpus all \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
Without GPU:
docker run -d \
-p 3000:8080 \
-v ollama:/root/.ollama \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:ollama
Option C: Native Install (Apple Silicon)
For Mac users who want maximum performance without Docker overhead:
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Ollama starts automatically as a background service
# Pull a model
ollama pull llama3.3
# Install Open WebUI separately (requires Python or Docker)
pip install open-webui
open-webui serve
Apple Silicon uses Metal GPU acceleration automatically — no configuration needed.
Downloading Models
Once Ollama is running, pull models with:
# Via Docker
docker exec ollama ollama pull <model-name>
# Or via native Ollama CLI
ollama pull <model-name>
Recommended Models by Hardware
| Hardware | Recommended Model | Why |
|---|---|---|
| 8GB RAM (CPU only) | gemma3:4b | Good quality, fast on CPU |
| 16GB RAM | mistral-small3:latest | Strong at 12B, reasonable speed |
| 16GB VRAM | llama3.3:70b-instruct-q4 | Near frontier quality |
| 32GB RAM | llama3.3:70b | Best local model for most tasks |
| 64GB RAM / M3 Ultra | llama3.1:405b-q4 | Frontier-class performance |
Key Models Available in 2026
ollama pull llama3.3 # Meta Llama 3.3 70B — best all-around
ollama pull mistral-small3 # Mistral Small 3 12B — fast, multilingual
ollama pull gemma3 # Google Gemma 3 27B
ollama pull gemma3:4b # Google Gemma 3 4B — lightweight
ollama pull deepseek-r1 # DeepSeek R1 — strong reasoning
ollama pull codestral # Mistral Codestral — code specialist
ollama pull phi4 # Microsoft Phi-4 14B — efficient
ollama pull qwen2.5-coder # Alibaba Qwen2.5 Coder
Explore all models at ollama.com/library.
First Login and Setup
- Open
http://localhost:3000 - Create admin account (first account becomes admin)
- Select a model from the dropdown (top of chat)
- Start chatting
Configure a System Prompt
Settings → Models → Edit → System Prompt:
You are a helpful assistant. Be concise. When writing code, always include comments explaining what the code does.
Enable Web Search
Settings → Web → Enable Web Search → choose a search engine (DuckDuckGo works without an API key).
Upload Documents (RAG)
Click the paperclip icon in any chat to upload PDFs, text files, or web URLs. Open WebUI will use them as context for your questions — basic RAG without any additional configuration.
Multi-User Setup
Open WebUI supports multiple users with separate conversation histories:
- Admin → Settings → Users → Create new user
- Each user gets their own login and private chat history
- Admins can restrict which models each user can access
- Rate limiting available per user
This makes it viable to share your instance with a small team or family.
Performance Expectations
| Model | Hardware | Tokens/second |
|---|---|---|
| Llama 3.3 70B | RTX 4090 | ~60–80 t/s |
| Llama 3.3 70B | M3 Max (64GB) | ~30–40 t/s |
| Llama 3.3 70B | CPU (32GB RAM) | ~3–8 t/s |
| Mistral Small 3 12B | RTX 3060 12GB | ~80–100 t/s |
| Gemma 3 4B | CPU only (8GB) | ~15–25 t/s |
CPU inference is slow but usable for short prompts. For conversational use, aim for 20+ t/s — which requires GPU or Apple Silicon for anything above 7B parameters.
Connecting Other Apps to Your Local Ollama
Ollama's OpenAI-compatible API means any app that supports OpenAI can use your local models:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required but ignored
)
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "Explain Docker volumes in one sentence."}],
)
print(response.choices[0].message.content)
Apps with native Ollama support: Continue (VS Code extension), Cursor (via API config), LibreChat, Obsidian Smart Connections, LM Studio (alternative frontend), AnythingLLM, Chatbox.
Keeping Everything Updated
# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker compose up -d
# Update Ollama (if using Docker)
docker pull ollama/ollama:latest
docker compose up -d
# Update a model to latest version
docker exec ollama ollama pull llama3.3
Models don't update automatically — pull a new version when you want the latest weights.
Remote Access
For access outside your home network:
Option 1: Tailscale (easiest)
curl -fsSL https://tailscale.com/install.sh | sh
tailscale up
# Access Open WebUI at your Tailscale IP: http://100.x.x.x:3000
Option 2: Nginx reverse proxy with SSL
Point your domain at the server, use Let's Encrypt for SSL, proxy to localhost:3000. Standard Nginx Proxy Manager or Caddy configuration.
Option 3: Cloudflare Tunnel Zero-config HTTPS without opening firewall ports. Free for personal use.
Vs. Paying for ChatGPT/Claude
| Ollama + Open WebUI | ChatGPT Plus | Claude Pro | |
|---|---|---|---|
| Cost | Hardware only | $20/mo | $20/mo |
| Privacy | 100% local | Logs prompts | Logs prompts |
| Model quality | Near-frontier | Frontier | Frontier |
| Context length | 128K+ (Llama 3.3) | 128K | 200K |
| Image input | Llava/Gemma | ✅ | ✅ |
| Internet access | Via plugin | ✅ | ✅ |
| Uptime | Your hardware | 99.9%+ | 99.9%+ |
The quality gap between local 70B models and ChatGPT-4 has narrowed dramatically in 2026. For most writing, coding, and Q&A tasks, Llama 3.3 70B is competitive. For frontier reasoning or cutting-edge tasks, commercial APIs still win.
Troubleshooting Common Issues
"Cannot connect to Ollama" in Open WebUI
Check that Ollama is running and accessible from the Open WebUI container:
# Test Ollama API from within Open WebUI container
docker exec open-webui curl http://ollama:11434/api/version
# If using host networking instead of Docker network
docker exec open-webui curl http://host.docker.internal:11434/api/version
If you get a connection error, verify the OLLAMA_BASE_URL environment variable matches your actual Ollama address.
Models Download Slowly or Stall
Ollama downloads model files in chunks. Large models (70B = ~40GB) take time on slow connections. Monitor progress:
docker exec ollama ollama pull llama3.3
# Shows download progress with transfer speed
If a download stalls, re-run the same command — Ollama resumes from where it left off.
GPU Not Detected
Verify NVIDIA container toolkit is installed:
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi
If this fails, install the NVIDIA Container Toolkit:
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
Use Cases Where Local AI Excels
Code review with sensitive codebases. Send proprietary code to a local model without any risk of it appearing in training data or being logged by a third party.
Document Q&A on confidential files. Upload legal documents, financial reports, or internal specs to Open WebUI's RAG feature. Nothing leaves your server.
Always-on assistant without rate limits. Commercial APIs have rate limits and occasional outages. Your local Ollama instance is always available, no subscription required, no context limit throttling.
Air-gapped environments. Factories, government, healthcare, finance — environments where internet access is restricted or regulated. Ollama works entirely offline after the initial model download.
Experimentation without billing anxiety. Run thousands of inference requests, test different models, build automation — all at zero marginal cost.
Building Automations with Ollama's API
The API is OpenAI-compatible, making it easy to build local automation:
# Test from command line
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.3",
"prompt": "Summarize this in one sentence: [your text]",
"stream": false
}'
# Simple summarization script
import requests
def summarize(text: str, model: str = "llama3.3") -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": f"Summarize in 2 sentences: {text}",
"stream": False,
},
)
return response.json()["response"]
Use this pattern to build local AI pipelines: summarize emails, classify support tickets, extract structured data from documents, generate commit messages — without any API costs or external dependencies.
See all self-hosted AI tools at OSSAlt.
Related: 10 Open-Source Tools to Replace SaaS in 2026 · Coolify vs Vercel: Cost Comparison 2026
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
The SaaS-to-Self-Hosted Migration Guide (Free PDF)
Step-by-step: infrastructure setup, data migration, backups, and security for 15+ common SaaS replacements. Used by 300+ developers.
Join 300+ self-hosters. Unsubscribe in one click.