Open-source alternatives guide
Self-Hosted LLM: DeepSeek and Qwen 2026
Run DeepSeek R1 and Qwen 2.5 locally with Ollama. Hardware requirements by model size, cost vs cloud APIs, and GPU setup guide for 2026 Updated for 2026.
Self-Hosted LLM Guide: Run DeepSeek and Qwen Locally
DeepSeek R1 shocked the AI world in early 2025 by matching o1-level reasoning at a fraction of the training cost. Qwen 2.5 and Qwen3 from Alibaba brought frontier-class coding ability to open-weight models. Both are available under permissive licenses. Both run locally via Ollama. And both eliminate the API costs and privacy concerns of using cloud LLM services.
This guide covers hardware requirements for every model size, exact Ollama commands, cost comparison vs. cloud APIs, and what to actually expect from local inference in 2026.
Quick Verdict
For reasoning tasks (math, coding, logic): DeepSeek R1 8B or 32B depending on your hardware. For coding specifically: Qwen2.5-Coder-32B on a 24GB GPU is currently the best local coding model — matches GPT-4o-mini on most benchmarks. Budget hardware: DeepSeek R1 Distill 7B or Qwen3 8B run on consumer GPUs from 2019–2020. No GPU: 7B models on Apple Silicon (M1+) are fully usable for real work.
Why Run Models Locally in 2026?
Privacy: No data leaves your machine. Prompts containing code, customer data, medical records, or proprietary information stay local.
Cost at scale: GPT-4o at $2.50/1M input tokens adds up. 1M tokens/day = $75/month. Self-hosted: $0 marginal cost after hardware.
No rate limits: Commercial APIs throttle requests. Local inference is limited only by your GPU/CPU.
Latency: With a good GPU, local 7B models respond in <200ms for short prompts.
Offline capability: Works without internet. Useful in air-gapped environments, travel, or unreliable connections.
Experimentation: Try 20 different models in an afternoon without billing anxiety.
Understanding Quantization
Model files are distributed in quantized formats that trade quality for size and speed. The key formats you'll encounter:
| Format | Size reduction | Quality loss | Best for |
|---|---|---|---|
| Q4_K_M | ~75% vs FP16 | Minimal | Default choice; best quality-per-GB |
| Q4_0 | ~75% vs FP16 | Slight | Faster than Q4_K_M, marginally lower quality |
| Q8_0 | ~50% vs FP16 | Negligible | When you have VRAM to spare |
| FP16 | No compression | None | Full quality; requires large VRAM |
| GGUF | Varies | Varies | Format used by Ollama/llama.cpp |
For daily use, Q4_K_M is the right default. Quality is nearly indistinguishable from full precision for conversational and coding tasks.
Hardware Requirements by Model Size
7B–8B Models (Entry Level)
Requirement: 6–8GB VRAM or 8–16GB RAM (CPU) GPU options: RTX 3060 12GB, RTX 4060 8GB, RX 6700 XT, M1/M2/M3 MacBook
ollama pull deepseek-r1:7b # DeepSeek R1 Distill 7B
ollama pull qwen2.5-coder:7b # Qwen 2.5 Coder 7B
ollama pull qwen3:8b # Qwen3 8B
Expected speed: 30–80 t/s on GPU, 5–15 t/s on CPU (M2 MacBook gets ~25–40 t/s)
Real-world capability: Solid for Q&A, summarization, basic coding, classification. Reasoning quality is noticeably below GPT-4o for complex multi-step problems.
14B–32B Models (Mid-Range)
Requirement: 12–24GB VRAM or 32GB RAM GPU options: RTX 3090 (24GB), RTX 4090 (24GB), RTX 4080 (16GB for 14B), M2/M3 Max/Ultra
ollama pull deepseek-r1:14b # DeepSeek R1 Distill 14B
ollama pull deepseek-r1:32b # DeepSeek R1 Distill 32B (needs 24GB VRAM)
ollama pull qwen2.5-coder:32b # Qwen 2.5 Coder 32B — best local coding model
ollama pull qwen2.5:14b # Qwen 2.5 14B general
Expected speed: 20–60 t/s on RTX 4090
Real-world capability: 32B models are where local LLMs become genuinely impressive. Qwen2.5-Coder-32B benchmarks at GPT-4o-mini level on coding tasks. DeepSeek R1 32B handles multi-step reasoning that smaller models struggle with.
70B Models (High-End)
Requirement: 2× 24GB VRAM (2× RTX 3090/4090) or 64GB+ unified memory (M2 Ultra, M3 Ultra) GPU options: Dual RTX 4090 (NVLink not required), M2/M3 Ultra Mac Studio
ollama pull deepseek-r1:70b # DeepSeek R1 70B
ollama pull qwen2.5:72b # Qwen 2.5 72B
ollama pull llama3.3:70b # Meta Llama 3.3 70B (general)
Expected speed: 20–40 t/s on dual RTX 4090 (offloads layers across GPUs automatically)
Real-world capability: Near-frontier reasoning quality. The quality gap vs. GPT-4o is minimal for most tasks.
671B Models (Requires Server Hardware)
DeepSeek V3/R1 full 671B models require 8× H100 80GB or equivalent — not consumer hardware. The distilled models above (7B–70B) use DeepSeek's knowledge in smaller architectures and are the practical choice for local deployment.
VRAM Quick Reference
| GPU | VRAM | Max Model (Q4_K_M) |
|---|---|---|
| RTX 4060 | 8GB | 7B |
| RTX 3060 12GB | 12GB | 13B |
| RTX 4080 | 16GB | 13–14B |
| RTX 3090 / 4090 | 24GB | 32B |
| 2× RTX 4090 | 48GB | 70B |
| M2 Max (32GB unified) | 32GB | 32B |
| M3 Ultra (192GB unified) | 192GB | 671B (quantized) |
Apple Silicon's unified memory architecture means the GPU and CPU share the same memory pool — an M2 Max with 32GB can run 30B models that would need a dedicated 24GB VRAM GPU on a Windows machine.
Setting Up DeepSeek R1 with Ollama
# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run DeepSeek R1
ollama run deepseek-r1:14b
# With explicit thinking visible
# DeepSeek R1 shows its chain-of-thought in <think> tags
DeepSeek R1's reasoning chains are displayed by default — you see the model "thinking" through problems step by step. This is useful for debugging and understanding the model's approach, not just the final answer.
Running via API
import ollama
# Chat with DeepSeek R1
response = ollama.chat(
model='deepseek-r1:14b',
messages=[
{
'role': 'user',
'content': 'Implement a binary search tree in Python with insert, search, and delete methods.'
}
]
)
print(response['message']['content'])
# Or with curl (OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1:14b",
"messages": [{"role": "user", "content": "Explain recursion with a simple example"}]
}'
Setting Up Qwen2.5-Coder for Development
Qwen2.5-Coder-32B is the recommended local model for software development in 2026. It supports 100+ programming languages, fill-in-the-middle completion, and long-context code understanding.
# Pull the coding model
ollama pull qwen2.5-coder:32b
# Or the 7B version for constrained hardware
ollama pull qwen2.5-coder:7b
Integrate with VS Code via Continue
Continue is a VS Code/JetBrains extension that connects to Ollama for local AI coding assistance:
- Install Continue extension
- Open Continue settings → add model:
{
"models": [
{
"title": "Qwen2.5-Coder 32B",
"provider": "ollama",
"model": "qwen2.5-coder:32b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5-Coder 7B (fast)",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Use the 32B model for chat/explanations and the 7B for tab autocomplete (autocomplete needs to be fast — 7B delivers better latency).
Cost Comparison: Local vs. Cloud
Scenario: Developer using AI coding assistant
Assumptions: 500K tokens/day, 50% input / 50% output, 22 working days/month = 11M tokens/month
| Provider | Cost | Notes |
|---|---|---|
| GPT-4o | $2.50/$10 per 1M | ~$65/month |
| Claude Sonnet 4.6 | $3/$15 per 1M | ~$72/month |
| Groq (Llama 3.3 70B) | $0.59/$0.79 per 1M | ~$7.30/month |
| Qwen2.5-Coder 32B local | $0/month | Hardware amortized |
Hardware payback period:
- RTX 4090 (24GB): ~$1,600–1,800 new
- At $65/month savings vs GPT-4o: payback in 25–28 months
- At $7.30/month savings vs Groq: payback in 18+ years (Groq wins for low-volume)
The math favors local when: You use AI heavily (>500K tokens/day), have an existing GPU, or already have a desktop workstation where GPU cost is shared with gaming/other work.
Cloud wins for low-volume: If you're doing 50K tokens/day, Groq at $0.73/month is unbeatable.
Model Quality Reality Check
Don't believe the hype or the FUD. Realistic assessment for 2026:
| Task | Local 7B | Local 32B | GPT-4o | Claude Opus 4.6 |
|---|---|---|---|---|
| Simple Q&A | ✅ Good | ✅ Great | ✅ Great | ✅ Great |
| Code generation (common patterns) | ✅ Good | ✅ Great | ✅ Great | ✅ Great |
| Complex reasoning | ⚠️ Mediocre | ✅ Good | ✅ Great | ✅ Best |
| Math (competition level) | ❌ Poor | ✅ Good (R1) | ✅ Good | ✅ Great |
| Long document analysis | ❌ Limited | ✅ Good | ✅ Great | ✅ Best |
| Creative writing | ✅ Good | ✅ Great | ✅ Great | ✅ Best |
| Multilingual | ✅ Good (Qwen) | ✅ Great | ✅ Great | ✅ Great |
The 32B tier is genuinely competitive for day-to-day development work. The gap vs. frontier models shows most clearly in complex multi-step reasoning, mathematical proofs, and tasks requiring judgment about nuanced tradeoffs.
Recommended Setups by Budget
Budget: No Dedicated GPU (Under $0 extra)
- Hardware: M1/M2/M3 MacBook (8–16GB), or any modern CPU with 16GB RAM
- Model:
deepseek-r1:7borqwen3:8b - Speed: 15–30 t/s on Apple Silicon, 5–10 t/s on CPU-only x86
- Best for: Light coding assistance, Q&A, summarization
Mid-Range: ~$400–800
- Hardware: Used RTX 3090 (
$500 used) or new RTX 4070 Ti Super ($800) - Model:
qwen2.5-coder:32bordeepseek-r1:32b - Speed: 30–50 t/s
- Best for: Full-time development assistant, replaces GitHub Copilot + ChatGPT
High-End: ~$1,600–3,600
- Hardware: RTX 4090 (24GB) or 2× RTX 3090
- Model:
qwen2.5-coder:32b(single 4090) orllama3.3:70b(dual 3090) - Speed: 40–80 t/s
- Best for: Production inference server, team-shared AI endpoint, agentic workflows
Qwen3 vs Qwen2.5: What Changed
Qwen3, released mid-2025, introduced a "thinking mode" similar to DeepSeek R1's chain-of-thought reasoning. Qwen3 models support two modes:
- Thinking mode (
/think): Extended reasoning with visible thought process. Use for complex coding, math, and multi-step problems. - Non-thinking mode (
/no_think): Fast conversational responses. Use for Q&A, summarization, and simple code completions.
# Qwen3 with thinking mode
ollama run qwen3:8b
>>> /think Write a recursive function to flatten nested lists in Python.
# Qwen3 without thinking (faster)
>>> /no_think What does the zip() function do?
For teams that previously used separate models for "fast chat" and "deep reasoning," Qwen3 consolidates this into a single model. Run the 8B for solo developers; the 30B (requires 24GB VRAM) for team-shared endpoints.
Running a Shared Team Endpoint
With Ollama + Open WebUI, you can run a local inference server that your whole team connects to:
# docker-compose.yml — team AI server
services:
ollama:
image: ollama/ollama:latest
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:main
volumes:
- open-webui:/app/backend/data
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
A single RTX 4090 machine serves 3–5 concurrent users on 32B models. 5 developers sharing a $1,800 GPU = $360 each — paid back in 3–5 months vs. ChatGPT Plus subscriptions.
Choosing Models for Specific Tasks
Not every use case needs the same model. Here's how to match model to task:
| Use case | Recommended model | Why |
|---|---|---|
| Code autocomplete | qwen2.5-coder:7b | Fast response time matters more than depth |
| Code review / refactoring | qwen2.5-coder:32b | Needs broad context and reasoning |
| Math / logic problems | deepseek-r1:14b | Chain-of-thought reasoning |
| Document summarization | qwen2.5:14b | Strong instruction following, context length |
| Multilingual content | qwen2.5:7b | Qwen models excel at non-English languages |
| Agentic workflows | deepseek-r1:32b | Better multi-step planning |
The biggest mistake new local LLM users make is running a single large model for everything. Use 7B models where speed matters (autocomplete, quick lookups) and 14B–32B models where quality matters (code review, complex reasoning). Ollama handles multiple loaded models with separate GPU memory allocation.
Troubleshooting Common Issues
Model loads but responses are slow (CPU offloading)
If Ollama outputs llm_load_tensors: offloaded X/Y layers to GPU, some model layers are running on CPU because they don't fit in VRAM. Options: use a smaller model, use a more aggressive quantization (Q4_0 instead of Q4_K_M), or add more VRAM.
# Check what's happening during load
OLLAMA_DEBUG=1 ollama run deepseek-r1:14b
Out of memory errors
# Reduce context window (default is model-max, often 4096–32768)
ollama run deepseek-r1:14b --num-ctx 2048
Reducing context from 32K to 4K can cut VRAM usage by 30–50% with minimal impact for most conversational use.
Multiple users hitting the same endpoint
Ollama processes one request at a time by default. For concurrent users, set:
OLLAMA_NUM_PARALLEL=4 # Process up to 4 requests simultaneously
This enables batching but increases per-request VRAM usage.
Browse all AI self-hosting guides at OSSAlt.
Related: Self-Host Your AI: Ollama + Open WebUI 2026 · 10 Open-Source Tools to Replace SaaS in 2026
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
How to Keep a Private AI Stack Useful After Launch
The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.
This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.
Where Self-Hosted AI Wins and Where It Still Does Not
Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.
Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.
Related Reading
The SaaS-to-Self-Hosted Migration Guide (Free PDF)
Step-by-step: infrastructure setup, data migration, backups, and security for 15+ common SaaS replacements. Used by 300+ developers.
Join 300+ self-hosters. Unsubscribe in one click.