Skip to main content

Self-Host Paperless-ngx: Document Management 2026

·OSSAlt Team
paperless-ngxdocument-managementocrself-hostingdocker2026

TL;DR

Paperless-ngx (GPL 3.0, ~20K GitHub stars, Python/TypeScript) eliminates physical filing cabinets. Scan documents, drop PDFs into a watched folder, and Paperless automatically OCRs them, suggests tags and correspondents, and makes them full-text searchable. Adobe Acrobat charges $12.99/month for OCR and PDF management. Paperless-ngx is free and stores everything locally. After setup: every receipt, tax document, medical form, and letter is searchable in under 2 seconds.

Key Takeaways

  • Paperless-ngx: GPL 3.0, ~20K stars — OCR + full-text search + tag-based document organization
  • Auto-tagging: ML-based classifier learns from your manual tags and auto-suggests on new documents
  • Consumed folder: Drop files in a folder → Paperless automatically imports and OCRs them
  • Full-text search: OCR makes every word in every PDF searchable
  • Correspondents: Track who documents are from (IRS, Bank of America, doctor's office, etc.)
  • Document types: Categorize by type (Invoice, Receipt, Medical, Tax, Contract, etc.)

Part 1: Docker Setup

# docker-compose.yml
services:
  broker:
    image: redis:7-alpine
    restart: unless-stopped

  db:
    image: postgres:15-alpine
    restart: unless-stopped
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: "${POSTGRES_PASSWORD}"
    volumes:
      - db_data:/var/lib/postgresql/data

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    container_name: paperless
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - paperless_data:/usr/src/paperless/data
      - paperless_media:/usr/src/paperless/media
      - /path/to/consume:/usr/src/paperless/consume  # Watch this folder
      - /path/to/export:/usr/src/paperless/export    # Export goes here
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: "${POSTGRES_PASSWORD}"
      PAPERLESS_DBNAME: paperless
      PAPERLESS_SECRET_KEY: "${SECRET_KEY}"
      PAPERLESS_URL: "https://docs.yourdomain.com"
      PAPERLESS_ADMIN_USER: admin
      PAPERLESS_ADMIN_PASSWORD: "${ADMIN_PASSWORD}"
      PAPERLESS_ADMIN_MAIL: admin@yourdomain.com
      PAPERLESS_TIME_ZONE: America/Los_Angeles
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TIKA_ENABLED: 1  # Enable for Office docs (DOCX, XLSX, etc.)
      PAPERLESS_TIKA_GOTENBERG_ENDPOINT: http://gotenberg:3000
      PAPERLESS_TIKA_ENDPOINT: http://tika:9998
    depends_on:
      - broker
      - db

  # Required for Office document conversion:
  gotenberg:
    image: docker.io/gotenberg/gotenberg:7.10
    restart: unless-stopped
    command:
      - "gotenberg"
      - "--chromium-disable-javascript=true"
      - "--chromium-allow-list=file:///tmp/.*"

  tika:
    image: ghcr.io/paperless-ngx/tika:latest
    restart: unless-stopped

volumes:
  db_data:
  paperless_data:
  paperless_media:
# .env
POSTGRES_PASSWORD=your-db-password
SECRET_KEY=your-50-char-secret-key
ADMIN_PASSWORD=your-admin-password

# Create consume folder:
mkdir -p ~/paperless/consume ~/paperless/export

docker compose up -d

Part 2: HTTPS with Caddy

docs.yourdomain.com {
    reverse_proxy localhost:8000
}

Part 3: Import Documents

Method 1: Consume folder (automated)

Drop any file into the consume folder:

# Any of these formats work:
cp ~/Downloads/tax-return-2025.pdf ~/paperless/consume/
cp ~/Downloads/bank-statement.pdf ~/paperless/consume/
cp ~/Desktop/receipt.jpg ~/paperless/consume/

# Paperless watches the folder and automatically:
# 1. Moves file to media storage
# 2. Runs OCR (Tesseract)
# 3. Extracts text
# 4. Suggests tags/correspondent/document type via ML classifier
# 5. Makes it searchable

Method 2: Upload via web UI

  1. Documents → Upload → drag and drop files
  2. Multiple files at once

Method 3: Email ingestion

# Add to docker-compose.yml environment:
PAPERLESS_EMAIL_TASK_CRON: "*/10 * * * *"

# In Paperless web UI:
# Settings → Mail → Add mail account:
PAPERLESS_EMAIL_IMAP_SERVER: mail.yourdomain.com
PAPERLESS_EMAIL_USERNAME: paperless@yourdomain.com
PAPERLESS_EMAIL_PASSWORD: your-email-password

Emails matching rules are automatically imported as documents.


Part 4: Organization System

Correspondents

Track who documents are from:

  • Settings → Correspondents → Add:
    • IRS, Bank of America, Blue Cross, Employer, Landlord

Paperless auto-assigns based on patterns you define, or learns from your corrections.

Document types

Categorize by type:

  • Invoice, Receipt, Tax Return, Medical Record, Insurance, Contract, Letter

Tags

Tag freely:

  • 2025-taxes, medical-2025, car, home, reimbursable

Tags are the primary organization tool — a document can have multiple tags.

Date extraction

Paperless extracts dates from document content automatically. For receipts or letters, it finds the date in the text.


# Search examples in the web UI:
"electric bill"           → finds all utility bills
correspondent:IRS         → all IRS documents
tag:2025-taxes            → all 2025 tax documents
type:Invoice              → all invoices
created:[2025-01-01 TO 2025-12-31]  → documents from 2025
content:"account number"  → documents containing that phrase

Combine filters:

correspondent:IRS tag:2025-taxes type:"Tax Return"

Part 6: Scanner Integration

Network scanners (SANE)

# Scan directly to consume folder via command line:
scanimage --device="brother5:net1;dev0" \
  --format=pdf \
  --resolution=300 \
  --mode=Color \
  > ~/paperless/consume/scan-$(date +%Y%m%d-%H%M%S).pdf

iOS/Android scanning

Use a scanning app that saves directly to your consume folder:

  • iOS: Scanner Pro, Microsoft Lens → save to Nextcloud → watched by Paperless
  • Android: Adobe Scan, Microsoft Lens → save to synced folder

Automatic scan workflow

Scanner app (iOS/Android)
  → Saves to Nextcloud folder (auto-sync)
  → Nextcloud folder is also your Paperless consume path
  → Paperless auto-imports and OCRs
  → Document searchable within 60 seconds

Part 7: ML Auto-Classifier

Paperless learns from your tagging behavior:

# Train the classifier manually:
docker exec paperless python manage.py document_create_classifier

# After training, Paperless suggests:
# - Correspondent (who it's from)
# - Document type
# - Tags
# - Storage path

# The more you correct suggestions, the better it gets.

Custom matching rules

# Settings → Tags → Edit tag → Add matching rule:
Tag: "medical"
Algorithm: "Any word"
Pattern: "physician diagnosis prescription copay deductible"
Case insensitive: Yes

Part 8: Export and Backup

# Export all documents (preserves metadata):
docker exec paperless document_exporter /usr/src/paperless/export

# This creates:
# export/
# ├── document_001.pdf          ← original file
# ├── document_001.json         ← metadata (tags, date, correspondent)
# ├── document_002.jpg
# └── ...

# The JSON metadata lets you re-import to a fresh Paperless instance.

# Database backup:
docker exec paperless-db-1 pg_dump -U paperless paperless \
  | gzip > paperless-db-$(date +%Y%m%d).sql.gz

# Media backup:
tar -czf paperless-media-$(date +%Y%m%d).tar.gz \
  $(docker volume inspect paperless_paperless_media --format '{{.Mountpoint}}')

Maintenance

# Update:
docker compose pull
docker compose up -d

# Check Paperless status:
docker exec paperless python manage.py status

# Reprocess a document (e.g., if OCR failed):
docker exec paperless python manage.py document_retagger --id=42

# Re-run classifier on all documents:
docker exec paperless python manage.py document_create_classifier

# Logs:
docker compose logs -f webserver

See all open source document management tools at OSSAlt.com/categories/productivity.

Comments