Ollama is a tool for running large language models locally. It handles: Model downloading and management — one command to pull any supported model Inference serving — runs a local API server at localhost:11434 Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon Multi-model support — switch between models without restarting The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.

Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds: ChatGPT-style conversation UI with history Model switching from the sidebar Document upload for Q&A (RAG) Image generation integration Multi-user support with separate accounts Voice input/output Web search integration Prompt templates and system prompts ---

Self-Host Your AI: Ollama + Open WebUI 2026

ChatGPT and Claude are excellent — but every prompt you send is logged, used to improve their models, and processed on someone else's infrastructure. Ollama makes it trivial to run powerful open-weight LLMs locally. Open WebUI wraps it in a polished ChatGPT-like interface. Together, you get a fully private AI assistant that runs on your own hardware.

Ollama has over 165,000 GitHub stars. Open WebUI has over 60,000. Both are MIT-licensed and genuinely production-quality.

Quick Verdict

If you have 16GB+ RAM and want a private AI assistant, this setup takes under 10 minutes and delivers a ChatGPT-quality experience entirely offline. The best models for most users in 2026: Llama 3.3 70B for quality (16GB+ VRAM or 32GB RAM), Gemma 3 12B or Mistral Small 3 for balanced performance on consumer hardware.

What Is Ollama?

Ollama is a tool for running large language models locally. It handles:

Model downloading and management — one command to pull any supported model
Inference serving — runs a local API server at localhost:11434
Hardware acceleration — automatic GPU detection for NVIDIA, AMD, and Apple Silicon
Multi-model support — switch between models without restarting

The API is OpenAI-compatible, meaning any app built for ChatGPT can be pointed at Ollama instead.

What Is Open WebUI?

Open WebUI is a self-hosted web interface for Ollama (and OpenAI-compatible APIs). It adds:

ChatGPT-style conversation UI with history
Model switching from the sidebar
Document upload for Q&A (RAG)
Image generation integration
Multi-user support with separate accounts
Voice input/output
Web search integration
Prompt templates and system prompts

System Requirements

Setup	Minimum	Recommended
7B models	8GB RAM	16GB RAM
13B models	16GB RAM	32GB RAM
70B models	32GB RAM	64GB RAM or 16GB VRAM
GPU	Any modern NVIDIA/AMD	NVIDIA RTX 3080+ or better
Apple Silicon	M1 (8GB)	M2/M3 (16GB+)

Apple Silicon Macs use unified memory efficiently — an M2 MacBook Pro with 16GB handles 13B models well in real-time.

Option A: Docker Compose (Recommended)

This is the production-grade setup: Ollama and Open WebUI as separate services with shared networking.

# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama:
  open-webui:

# Start
docker compose up -d

# Check status
docker compose ps

Open WebUI will be available at http://localhost:3000.

Remove the deploy.resources block if you don't have an NVIDIA GPU — CPU inference works fine for smaller models.

Option B: All-in-One Docker (Fastest Start)

One command for the complete stack with GPU:

docker run -d \
  -p 3000:8080 \
  --gpus all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Without GPU:

docker run -d \
  -p 3000:8080 \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Option C: Native Install (Apple Silicon)

For Mac users who want maximum performance without Docker overhead:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Ollama starts automatically as a background service
# Pull a model
ollama pull llama3.3

# Install Open WebUI separately (requires Python or Docker)
pip install open-webui
open-webui serve

Apple Silicon uses Metal GPU acceleration automatically — no configuration needed.

Downloading Models

Once Ollama is running, pull models with:

# Via Docker
docker exec ollama ollama pull <model-name>

# Or via native Ollama CLI
ollama pull <model-name>

Recommended Models by Hardware

Hardware	Recommended Model	Why
8GB RAM (CPU only)	`gemma3:4b`	Good quality, fast on CPU
16GB RAM	`mistral-small3:latest`	Strong at 12B, reasonable speed
16GB VRAM	`llama3.3:70b-instruct-q4`	Near frontier quality
32GB RAM	`llama3.3:70b`	Best local model for most tasks
64GB RAM / M3 Ultra	`llama3.1:405b-q4`	Frontier-class performance

Key Models Available in 2026

ollama pull llama3.3          # Meta Llama 3.3 70B — best all-around
ollama pull mistral-small3    # Mistral Small 3 12B — fast, multilingual
ollama pull gemma3            # Google Gemma 3 27B
ollama pull gemma3:4b         # Google Gemma 3 4B — lightweight
ollama pull deepseek-r1       # DeepSeek R1 — strong reasoning
ollama pull codestral         # Mistral Codestral — code specialist
ollama pull phi4              # Microsoft Phi-4 14B — efficient
ollama pull qwen2.5-coder     # Alibaba Qwen2.5 Coder

Explore all models at ollama.com/library.

Open http://localhost:3000
Create admin account (first account becomes admin)
Select a model from the dropdown (top of chat)
Start chatting

Configure a System Prompt

Settings → Models → Edit → System Prompt:

You are a helpful assistant. Be concise. When writing code, always include comments explaining what the code does.

Enable Web Search

Settings → Web → Enable Web Search → choose a search engine (DuckDuckGo works without an API key).

Upload Documents (RAG)

Click the paperclip icon in any chat to upload PDFs, text files, or web URLs. Open WebUI will use them as context for your questions — basic RAG without any additional configuration.

Multi-User Setup

Open WebUI supports multiple users with separate conversation histories:

Admin → Settings → Users → Create new user
Each user gets their own login and private chat history
Admins can restrict which models each user can access
Rate limiting available per user

This makes it viable to share your instance with a small team or family.

Performance Expectations

Model	Hardware	Tokens/second
Llama 3.3 70B	RTX 4090	~60–80 t/s
Llama 3.3 70B	M3 Max (64GB)	~30–40 t/s
Llama 3.3 70B	CPU (32GB RAM)	~3–8 t/s
Mistral Small 3 12B	RTX 3060 12GB	~80–100 t/s
Gemma 3 4B	CPU only (8GB)	~15–25 t/s

CPU inference is slow but usable for short prompts. For conversational use, aim for 20+ t/s — which requires GPU or Apple Silicon for anything above 7B parameters.

Connecting Other Apps to Your Local Ollama

Ollama's OpenAI-compatible API means any app that supports OpenAI can use your local models:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but ignored
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[{"role": "user", "content": "Explain Docker volumes in one sentence."}],
)
print(response.choices[0].message.content)

Apps with native Ollama support: Continue (VS Code extension), Cursor (via API config), LibreChat, Obsidian Smart Connections, LM Studio (alternative frontend), AnythingLLM, Chatbox.

Keeping Everything Updated

# Update Open WebUI
docker pull ghcr.io/open-webui/open-webui:main
docker compose up -d

# Update Ollama (if using Docker)
docker pull ollama/ollama:latest
docker compose up -d

# Update a model to latest version
docker exec ollama ollama pull llama3.3

Models don't update automatically — pull a new version when you want the latest weights.

Remote Access

For access outside your home network:

Option 1: Tailscale (easiest)

curl -fsSL https://tailscale.com/install.sh | sh
tailscale up
# Access Open WebUI at your Tailscale IP: http://100.x.x.x:3000

Option 2: Nginx reverse proxy with SSL Point your domain at the server, use Let's Encrypt for SSL, proxy to localhost:3000. Standard Nginx Proxy Manager or Caddy configuration.

Option 3: Cloudflare Tunnel Zero-config HTTPS without opening firewall ports. Free for personal use.

Vs. Paying for ChatGPT/Claude

	Ollama + Open WebUI	ChatGPT Plus	Claude Pro
Cost	Hardware only	$20/mo	$20/mo
Privacy	100% local	Logs prompts	Logs prompts
Model quality	Near-frontier	Frontier	Frontier
Context length	128K+ (Llama 3.3)	128K	200K
Image input	Llava/Gemma	✅	✅
Internet access	Via plugin	✅	✅
Uptime	Your hardware	99.9%+	99.9%+

The quality gap between local 70B models and ChatGPT-4 has narrowed dramatically in 2026. For most writing, coding, and Q&A tasks, Llama 3.3 70B is competitive. For frontier reasoning or cutting-edge tasks, commercial APIs still win.

Troubleshooting Common Issues

"Cannot connect to Ollama" in Open WebUI

Check that Ollama is running and accessible from the Open WebUI container:

# Test Ollama API from within Open WebUI container
docker exec open-webui curl http://ollama:11434/api/version

# If using host networking instead of Docker network
docker exec open-webui curl http://host.docker.internal:11434/api/version

If you get a connection error, verify the OLLAMA_BASE_URL environment variable matches your actual Ollama address.

Models Download Slowly or Stall

Ollama downloads model files in chunks. Large models (70B = ~40GB) take time on slow connections. Monitor progress:

docker exec ollama ollama pull llama3.3
# Shows download progress with transfer speed

If a download stalls, re-run the same command — Ollama resumes from where it left off.

GPU Not Detected

Verify NVIDIA container toolkit is installed:

docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi

If this fails, install the NVIDIA Container Toolkit:

distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Use Cases Where Local AI Excels

Code review with sensitive codebases. Send proprietary code to a local model without any risk of it appearing in training data or being logged by a third party.

Document Q&A on confidential files. Upload legal documents, financial reports, or internal specs to Open WebUI's RAG feature. Nothing leaves your server.

Always-on assistant without rate limits. Commercial APIs have rate limits and occasional outages. Your local Ollama instance is always available, no subscription required, no context limit throttling.

Air-gapped environments. Factories, government, healthcare, finance — environments where internet access is restricted or regulated. Ollama works entirely offline after the initial model download.

Experimentation without billing anxiety. Run thousands of inference requests, test different models, build automation — all at zero marginal cost.

Building Automations with Ollama's API

The API is OpenAI-compatible, making it easy to build local automation:

# Test from command line
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.3",
    "prompt": "Summarize this in one sentence: [your text]",
    "stream": false
  }'

# Simple summarization script
import requests

def summarize(text: str, model: str = "llama3.3") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": f"Summarize in 2 sentences: {text}",
            "stream": False,
        },
    )
    return response.json()["response"]

Use this pattern to build local AI pipelines: summarize emails, classify support tickets, extract structured data from documents, generate commit messages — without any API costs or external dependencies.

See all self-hosted AI tools at OSSAlt.

How to Keep a Private AI Stack Useful After Launch

The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.

This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.

Where Self-Hosted AI Wins and Where It Still Does Not

Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.

Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.

The SaaS-to-Self-Hosted Migration Guide (Free PDF)