Self-Hosted LLM: DeepSeek and Qwen 2026

Q: Why Run Models Locally in 2026?

Privacy: No data leaves your machine. Prompts containing code, customer data, medical records, or proprietary information stay local. Cost at scale: GPT-4o at $2.50/1M input tokens adds up. 1M tokens/day = $75/month. Self-hosted: $0 marginal cost after hardware. No rate limits: Commercial APIs throttle requests. Local inference is limited only by your GPU/CPU. Latency: With a good GPU, local 7B models respond in <200ms for short prompts. Offline capability: Works without internet. Useful in air-

DeepSeek R1 shocked the AI world in early 2025 by matching o1-level reasoning at a fraction of the training cost. Qwen 2.5 and Qwen3 from Alibaba brought frontier-class coding ability to open-weight models. Both are available under permissive licenses. Both run locally via Ollama. And both eliminate the API costs and privacy concerns of using cloud LLM services.

This guide covers hardware requirements for every model size, exact Ollama commands, cost comparison vs. cloud APIs, and what to actually expect from local inference in 2026.

Quick Verdict

For reasoning tasks (math, coding, logic): DeepSeek R1 8B or 32B depending on your hardware. For coding specifically: Qwen2.5-Coder-32B on a 24GB GPU is currently the best local coding model — matches GPT-4o-mini on most benchmarks. Budget hardware: DeepSeek R1 Distill 7B or Qwen3 8B run on consumer GPUs from 2019–2020. No GPU: 7B models on Apple Silicon (M1+) are fully usable for real work.

Why Run Models Locally in 2026?

Privacy: No data leaves your machine. Prompts containing code, customer data, medical records, or proprietary information stay local.

Cost at scale: GPT-4o at $2.50/1M input tokens adds up. 1M tokens/day = $75/month. Self-hosted: $0 marginal cost after hardware.

No rate limits: Commercial APIs throttle requests. Local inference is limited only by your GPU/CPU.

Latency: With a good GPU, local 7B models respond in <200ms for short prompts.

Offline capability: Works without internet. Useful in air-gapped environments, travel, or unreliable connections.

Experimentation: Try 20 different models in an afternoon without billing anxiety.

Understanding Quantization

Model files are distributed in quantized formats that trade quality for size and speed. The key formats you'll encounter:

Format	Size reduction	Quality loss	Best for
Q4_K_M	~75% vs FP16	Minimal	Default choice; best quality-per-GB
Q4_0	~75% vs FP16	Slight	Faster than Q4_K_M, marginally lower quality
Q8_0	~50% vs FP16	Negligible	When you have VRAM to spare
FP16	No compression	None	Full quality; requires large VRAM
GGUF	Varies	Varies	Format used by Ollama/llama.cpp

For daily use, Q4_K_M is the right default. Quality is nearly indistinguishable from full precision for conversational and coding tasks.

Hardware Requirements by Model Size

7B–8B Models (Entry Level)

Requirement: 6–8GB VRAM or 8–16GB RAM (CPU) GPU options: RTX 3060 12GB, RTX 4060 8GB, RX 6700 XT, M1/M2/M3 MacBook

ollama pull deepseek-r1:7b           # DeepSeek R1 Distill 7B
ollama pull qwen2.5-coder:7b         # Qwen 2.5 Coder 7B
ollama pull qwen3:8b                 # Qwen3 8B

Expected speed: 30–80 t/s on GPU, 5–15 t/s on CPU (M2 MacBook gets ~25–40 t/s)

Real-world capability: Solid for Q&A, summarization, basic coding, classification. Reasoning quality is noticeably below GPT-4o for complex multi-step problems.

14B–32B Models (Mid-Range)

Requirement: 12–24GB VRAM or 32GB RAM GPU options: RTX 3090 (24GB), RTX 4090 (24GB), RTX 4080 (16GB for 14B), M2/M3 Max/Ultra

ollama pull deepseek-r1:14b          # DeepSeek R1 Distill 14B
ollama pull deepseek-r1:32b          # DeepSeek R1 Distill 32B (needs 24GB VRAM)
ollama pull qwen2.5-coder:32b        # Qwen 2.5 Coder 32B — best local coding model
ollama pull qwen2.5:14b              # Qwen 2.5 14B general

Expected speed: 20–60 t/s on RTX 4090

Real-world capability: 32B models are where local LLMs become genuinely impressive. Qwen2.5-Coder-32B benchmarks at GPT-4o-mini level on coding tasks. DeepSeek R1 32B handles multi-step reasoning that smaller models struggle with.

70B Models (High-End)

Requirement: 2× 24GB VRAM (2× RTX 3090/4090) or 64GB+ unified memory (M2 Ultra, M3 Ultra) GPU options: Dual RTX 4090 (NVLink not required), M2/M3 Ultra Mac Studio

ollama pull deepseek-r1:70b          # DeepSeek R1 70B
ollama pull qwen2.5:72b              # Qwen 2.5 72B
ollama pull llama3.3:70b             # Meta Llama 3.3 70B (general)

Expected speed: 20–40 t/s on dual RTX 4090 (offloads layers across GPUs automatically)

Real-world capability: Near-frontier reasoning quality. The quality gap vs. GPT-4o is minimal for most tasks.

671B Models (Requires Server Hardware)

DeepSeek V3/R1 full 671B models require 8× H100 80GB or equivalent — not consumer hardware. The distilled models above (7B–70B) use DeepSeek's knowledge in smaller architectures and are the practical choice for local deployment.

VRAM Quick Reference

GPU	VRAM	Max Model (Q4_K_M)
RTX 4060	8GB	7B
RTX 3060 12GB	12GB	13B
RTX 4080	16GB	13–14B
RTX 3090 / 4090	24GB	32B
2× RTX 4090	48GB	70B
M2 Max (32GB unified)	32GB	32B
M3 Ultra (192GB unified)	192GB	671B (quantized)

Apple Silicon's unified memory architecture means the GPU and CPU share the same memory pool — an M2 Max with 32GB can run 30B models that would need a dedicated 24GB VRAM GPU on a Windows machine.

Setting Up DeepSeek R1 with Ollama

# Install Ollama (Linux/macOS)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run DeepSeek R1
ollama run deepseek-r1:14b

# With explicit thinking visible
# DeepSeek R1 shows its chain-of-thought in <think> tags

DeepSeek R1's reasoning chains are displayed by default — you see the model "thinking" through problems step by step. This is useful for debugging and understanding the model's approach, not just the final answer.

Running via API

import ollama

# Chat with DeepSeek R1
response = ollama.chat(
    model='deepseek-r1:14b',
    messages=[
        {
            'role': 'user',
            'content': 'Implement a binary search tree in Python with insert, search, and delete methods.'
        }
    ]
)
print(response['message']['content'])

# Or with curl (OpenAI-compatible API)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-r1:14b",
    "messages": [{"role": "user", "content": "Explain recursion with a simple example"}]
  }'

Setting Up Qwen2.5-Coder for Development

Qwen2.5-Coder-32B is the recommended local model for software development in 2026. It supports 100+ programming languages, fill-in-the-middle completion, and long-context code understanding.

# Pull the coding model
ollama pull qwen2.5-coder:32b

# Or the 7B version for constrained hardware
ollama pull qwen2.5-coder:7b

Integrate with VS Code via Continue

Continue is a VS Code/JetBrains extension that connects to Ollama for local AI coding assistance:

Install Continue extension
Open Continue settings → add model:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 32B",
      "provider": "ollama",
      "model": "qwen2.5-coder:32b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen2.5-Coder 7B (fast)",
    "provider": "ollama",
    "model": "qwen2.5-coder:7b"
  }
}

Use the 32B model for chat/explanations and the 7B for tab autocomplete (autocomplete needs to be fast — 7B delivers better latency).

Cost Comparison: Local vs. Cloud

Scenario: Developer using AI coding assistant

Assumptions: 500K tokens/day, 50% input / 50% output, 22 working days/month = 11M tokens/month

Provider	Cost	Notes
GPT-4o	$2.50/$10 per 1M	~$65/month
Claude Sonnet 4.6	$3/$15 per 1M	~$72/month
Groq (Llama 3.3 70B)	$0.59/$0.79 per 1M	~$7.30/month
Qwen2.5-Coder 32B local	$0/month	Hardware amortized

Hardware payback period:

RTX 4090 (24GB): ~$1,600–1,800 new
At $65/month savings vs GPT-4o: payback in 25–28 months
At $7.30/month savings vs Groq: payback in 18+ years (Groq wins for low-volume)

The math favors local when: You use AI heavily (>500K tokens/day), have an existing GPU, or already have a desktop workstation where GPU cost is shared with gaming/other work.

Cloud wins for low-volume: If you're doing 50K tokens/day, Groq at $0.73/month is unbeatable.

Model Quality Reality Check

Don't believe the hype or the FUD. Realistic assessment for 2026:

Task	Local 7B	Local 32B	GPT-4o	Claude Opus 4.6
Simple Q&A	✅ Good	✅ Great	✅ Great	✅ Great
Code generation (common patterns)	✅ Good	✅ Great	✅ Great	✅ Great
Complex reasoning	⚠️ Mediocre	✅ Good	✅ Great	✅ Best
Math (competition level)	❌ Poor	✅ Good (R1)	✅ Good	✅ Great
Long document analysis	❌ Limited	✅ Good	✅ Great	✅ Best
Creative writing	✅ Good	✅ Great	✅ Great	✅ Best
Multilingual	✅ Good (Qwen)	✅ Great	✅ Great	✅ Great

The 32B tier is genuinely competitive for day-to-day development work. The gap vs. frontier models shows most clearly in complex multi-step reasoning, mathematical proofs, and tasks requiring judgment about nuanced tradeoffs.

Recommended Setups by Budget

Budget: No Dedicated GPU (Under $0 extra)

Hardware: M1/M2/M3 MacBook (8–16GB), or any modern CPU with 16GB RAM
Model: deepseek-r1:7b or qwen3:8b
Speed: 15–30 t/s on Apple Silicon, 5–10 t/s on CPU-only x86
Best for: Light coding assistance, Q&A, summarization

Mid-Range: ~$400–800

Hardware: Used RTX 3090 (~~$500 used) or new RTX 4070 Ti Super (~~$800)
Model: qwen2.5-coder:32b or deepseek-r1:32b
Speed: 30–50 t/s
Best for: Full-time development assistant, replaces GitHub Copilot + ChatGPT

High-End: ~$1,600–3,600

Hardware: RTX 4090 (24GB) or 2× RTX 3090
Model: qwen2.5-coder:32b (single 4090) or llama3.3:70b (dual 3090)
Speed: 40–80 t/s
Best for: Production inference server, team-shared AI endpoint, agentic workflows

Qwen3 vs Qwen2.5: What Changed

Qwen3, released mid-2025, introduced a "thinking mode" similar to DeepSeek R1's chain-of-thought reasoning. Qwen3 models support two modes:

Thinking mode (/think): Extended reasoning with visible thought process. Use for complex coding, math, and multi-step problems.
Non-thinking mode (/no_think): Fast conversational responses. Use for Q&A, summarization, and simple code completions.

# Qwen3 with thinking mode
ollama run qwen3:8b
>>> /think Write a recursive function to flatten nested lists in Python.

# Qwen3 without thinking (faster)
>>> /no_think What does the zip() function do?

For teams that previously used separate models for "fast chat" and "deep reasoning," Qwen3 consolidates this into a single model. Run the 8B for solo developers; the 30B (requires 24GB VRAM) for team-shared endpoints.

Running a Shared Team Endpoint

With Ollama + Open WebUI, you can run a local inference server that your whole team connects to:

# docker-compose.yml — team AI server
services:
  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    volumes:
      - open-webui:/app/backend/data
    ports:
      - "3000:8080"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

A single RTX 4090 machine serves 3–5 concurrent users on 32B models. 5 developers sharing a $1,800 GPU = $360 each — paid back in 3–5 months vs. ChatGPT Plus subscriptions.

Choosing Models for Specific Tasks

Not every use case needs the same model. Here's how to match model to task:

Use case	Recommended model	Why
Code autocomplete	`qwen2.5-coder:7b`	Fast response time matters more than depth
Code review / refactoring	`qwen2.5-coder:32b`	Needs broad context and reasoning
Math / logic problems	`deepseek-r1:14b`	Chain-of-thought reasoning
Document summarization	`qwen2.5:14b`	Strong instruction following, context length
Multilingual content	`qwen2.5:7b`	Qwen models excel at non-English languages
Agentic workflows	`deepseek-r1:32b`	Better multi-step planning

The biggest mistake new local LLM users make is running a single large model for everything. Use 7B models where speed matters (autocomplete, quick lookups) and 14B–32B models where quality matters (code review, complex reasoning). Ollama handles multiple loaded models with separate GPU memory allocation.

Troubleshooting Common Issues

Model loads but responses are slow (CPU offloading)

If Ollama outputs llm_load_tensors: offloaded X/Y layers to GPU, some model layers are running on CPU because they don't fit in VRAM. Options: use a smaller model, use a more aggressive quantization (Q4_0 instead of Q4_K_M), or add more VRAM.

# Check what's happening during load
OLLAMA_DEBUG=1 ollama run deepseek-r1:14b

Out of memory errors

# Reduce context window (default is model-max, often 4096–32768)
ollama run deepseek-r1:14b --num-ctx 2048

Reducing context from 32K to 4K can cut VRAM usage by 30–50% with minimal impact for most conversational use.

Multiple users hitting the same endpoint

Ollama processes one request at a time by default. For concurrent users, set:

OLLAMA_NUM_PARALLEL=4  # Process up to 4 requests simultaneously

This enables batching but increases per-request VRAM usage.

Browse all AI self-hosting guides at OSSAlt.

How to Keep a Private AI Stack Useful After Launch

The hard part of a self-hosted AI stack is not getting the first model to answer a prompt. The hard part is building a system people continue to trust after the novelty fades. That means choosing a narrow set of approved models, documenting which one is the default for chat, extraction, and coding, and instrumenting latency so users know whether a bad answer came from the model itself or from an overloaded GPU. Teams that skip this governance stage often end up with a chaotic playground: five half-configured models, two abandoned vector stores, and nobody certain which workflow should be used for production tasks. A better pattern is to define tiers. Use a fast local model for internal drafting, a stronger model for longer-form reasoning, and a deterministic workflow layer for retrieval, approvals, and handoff.

This is also why adjacent tooling matters more than model benchmarks suggest. Dify guide is useful when you need repeatable workflows, prompt versioning, and API exposure rather than just a chat box. n8n guide matters because many valuable AI automations are not conversational at all; they are document triage, summarization, enrichment, and notification chains triggered by ordinary business events. And Authentik guide closes a gap that many AI teams ignore: once the stack contains internal docs, tickets, and customer data, you need role-aware access and auditability instead of a shared admin password on a sidecar dashboard.

Where Self-Hosted AI Wins and Where It Still Does Not

Self-hosted AI clearly wins when privacy, marginal cost, and workflow control dominate the decision. It is hard to justify sending internal runbooks, legal drafts, or product strategy documents to a third-party model API if a competent local setup handles the workload acceptably. The economics are also favorable for high-volume teams. Once the hardware is purchased or rented, the per-query cost becomes predictable, and experimentation becomes cheaper because nobody is afraid of API burn from testing prompts and embeddings. That changes behavior. Teams iterate more, keep more institutional knowledge in retrieval systems, and are more willing to build automations around routine analysis.

Where self-hosted AI still loses is turnkey convenience at the very top end of model quality. Frontier hosted models remain easier to access and often stronger for ambiguous reasoning, multimodal synthesis, and long-context work. The mature way to handle this is not ideology. It is workload routing. Keep sensitive, repetitive, and operationally embedded tasks on your infrastructure. Reserve external APIs for the few cases where a measurable quality gap justifies the trade-off. Articles on self-hosted AI are stronger when they acknowledge that split, because that is how experienced teams actually deploy these systems.

The SaaS-to-Self-Hosted Migration Guide (Free PDF)