How to Install Paperless-ngx with Docker (Go Paperless)

How to Install Paperless-ngx with Docker (Go Paperless at Home)

I have a filing cabinet in the spare room that I opened twice in the last five years. Once to file an insurance renewal, once to find a gas safety certificate that the letting agent needed in 48 hours and that took me two hours to locate. That was the day I set up Paperless-ngx. Every document I own is now scanned, OCR’d, tagged, and searchable in seconds. The filing cabinet is still there, but I cannot tell you what is in it any more because I stopped caring.

From the homelab: Paperless-ngx transformed how I handle documents. Every bill, receipt, and letter gets scanned and OCR’d automatically. I can find any document in seconds instead of digging through filing cabinets. It is one of those services that makes you wonder how you managed without it.

Organised document management interface showing tagged and categorised scanned documents

What Paperless-ngx Does

Paperless-ngx is a self-hosted document management system. You feed it documents — scanned paper, PDFs, images, office files — and it processes them through OCR (optical character recognition), extracts the text, and makes everything full-text searchable. Then you tag them, assign correspondents (who the document is from or to), and categorise them by document type.

The result: instead of rifling through a filing cabinet or searching a folder full of scan_2024_03_14.pdf files, you type “boiler service” and get every document related to your boiler, sorted by date, with the relevant text highlighted.

What sets Paperless-ngx apart from just dumping everything in a folder:

  • Full-text search via OCR. Even scanned paper documents become searchable text. The OCR is good — it handles printed text reliably and even manages decent handwriting recognition.
  • Automatic tagging rules. Set up rules once and new documents are tagged automatically. Any document containing “British Gas” gets tagged “Utilities” and assigned the correspondent “British Gas.” You never manually sort again.
  • Consumption directory. Drop files into a folder (locally or via network share) and Paperless processes them automatically. Scan with your phone, save to the folder, done.
  • Document types and correspondents. Organise by what the document is (invoice, certificate, contract) and who it is from. Combined with tags, this gives you a structured archive without rigid folder hierarchies.
  • Version history. Keeps the original file alongside the OCR’d version. Nothing is ever overwritten.

Career Context: Document management, data classification, and search indexing are enterprise problems. Understanding how OCR, full-text indexing, tagging taxonomies, and consumption pipelines work gives you vocabulary and concepts that translate to enterprise content management (SharePoint, Documentum, Alfresco) and data governance roles. The principles are identical — the scale is different.

Prerequisites

You will need:

  • Docker and Docker Compose installed. See the Docker installation guide.
  • A machine with at least 2 GB of RAM. Paperless-ngx runs PostgreSQL, Redis, and the main application. OCR processing is CPU-intensive during document ingestion but idle the rest of the time. A mini PC or VM with 4 GB is comfortable.
  • Some disk space. Your documents plus metadata. Most household document collections are small (a few gigabytes), but if you are scanning years of accumulated paperwork, plan accordingly. 20 GB is generous for most people.
  • Documents to scan. Start with the important ones: insurance policies, property documents, vehicle documents, medical records, certificates, warranties, contracts. You do not need to scan everything on day one.

Step 1: Create the Docker Compose Stack

Paperless-ngx runs three containers: the application itself, PostgreSQL for the database, and Redis for task queuing. Create a directory and the compose file:

mkdir -p ~/paperless-ngx && cd ~/paperless-ngx

Create docker-compose.yml:

services:
  broker:
    image: redis:7
    container_name: paperless-redis
    restart: unless-stopped
    volumes:
      - redis_data:/data

  db:
    image: postgres:16
    container_name: paperless-db
    restart: unless-stopped
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      POSTGRES_DB: paperless
      POSTGRES_USER: paperless
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}

  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:latest
    container_name: paperless-ngx
    restart: unless-stopped
    depends_on:
      - db
      - broker
    ports:
      - "8000:8000"
    volumes:
      - ./data:/usr/src/paperless/data
      - ./media:/usr/src/paperless/media
      - ./export:/usr/src/paperless/export
      - ./consume:/usr/src/paperless/consume
    environment:
      PAPERLESS_REDIS: redis://broker:6379
      PAPERLESS_DBHOST: db
      PAPERLESS_DBNAME: paperless
      PAPERLESS_DBUSER: paperless
      PAPERLESS_DBPASS: ${POSTGRES_PASSWORD}
      PAPERLESS_SECRET_KEY: ${PAPERLESS_SECRET_KEY}
      PAPERLESS_ADMIN_USER: ${PAPERLESS_ADMIN_USER}
      PAPERLESS_ADMIN_PASSWORD: ${PAPERLESS_ADMIN_PASSWORD}
      PAPERLESS_OCR_LANGUAGE: eng
      PAPERLESS_TIME_ZONE: Europe/London
      PAPERLESS_CONSUMER_POLLING: 30
      PAPERLESS_CONSUMER_RECURSIVE: "true"
      PAPERLESS_FILENAME_FORMAT: "{created_year}/{correspondent}/{title}"

volumes:
  redis_data:
  postgres_data:

Create the .env file:

# Generate these securely
POSTGRES_PASSWORD=$(openssl rand -base64 24)
PAPERLESS_SECRET_KEY=$(openssl rand -base64 48)
PAPERLESS_ADMIN_USER=admin
PAPERLESS_ADMIN_PASSWORD=your-strong-password-here

There are several things worth explaining in this configuration. The consume volume is where you drop files for automatic processing. The media volume stores the original documents and their OCR’d versions. The export volume is where Paperless puts its export archives for backup. The PAPERLESS_FILENAME_FORMAT organises the stored files by year and correspondent, which means the underlying file system is browsable even without Paperless running.

Pro tip: Generate the PAPERLESS_SECRET_KEY properly with openssl rand -base64 48 and save it somewhere safe. If you lose this key and need to rebuild, your session tokens and some encrypted data will be invalid. Treat it like any other secret.

Step 2: Start the Stack

docker compose up -d

The first startup takes a minute or two as PostgreSQL initialises and Paperless runs its migrations. Check the logs to confirm everything came up cleanly:

docker logs paperless-ngx -f

Once you see “Listening on 0.0.0.0:8000” (or the Gunicorn workers starting), open http://your-server-ip:8000 in your browser. Log in with the admin credentials from your .env file.

Step 3: Configure OCR

Paperless-ngx uses Tesseract for OCR, and it works well out of the box for English. If you need additional languages, add them to the PAPERLESS_OCR_LANGUAGE variable. For multiple languages:

PAPERLESS_OCR_LANGUAGE: eng+fra+deu

OCR settings in the admin panel (under Settings) let you control behaviour:

  • OCR mode: “Skip” skips OCR if the PDF already has embedded text (most digitally-created PDFs do). “Redo” forces OCR on everything. “Skip” is the sensible default — it saves processing time and avoids creating worse text from re-OCR’ing already-embedded text.
  • Output type: “PDF/A” is the archival format that embeds the OCR text alongside the original scan. This is what you want. The resulting PDF is searchable in any PDF viewer, not just Paperless.
  • Clean, deskew, rotate: Enable all three for scanned documents. Paperless will straighten crooked scans, clean up noise, and auto-rotate pages. These make a noticeable difference to OCR accuracy on imperfect scans.

OCR is CPU-intensive. When you first bulk-import a stack of documents, expect your CPU to run hard for a while. Paperless processes documents sequentially by default. You can increase parallelism with PAPERLESS_TASK_WORKERS (e.g., set to 2 or 4), but only if your machine has the cores and RAM to handle it. On a dual-core machine, leave it at 1. On a quad-core with 8 GB RAM, 2 workers is comfortable.

Step 4: The Consumption Directory

The consume directory is where the magic happens. Any file placed in this folder is automatically picked up by Paperless, OCR’d, and added to your archive. The original file is removed from the consumption directory after processing.

This is mapped to ~/paperless-ngx/consume on the host. You can feed it in several ways:

Network Share (Samba/NFS)

Share the consume directory over your network with Samba or NFS. Then any device on your LAN can drop files into it. This is especially useful if you have a flatbed scanner connected to another machine.

Mobile Scanning (The Killer Workflow)

This is the workflow that made Paperless indispensable for me. Install a scanning app on your phone — I use Microsoft Lens (free, surprisingly good edge detection) or Adobe Scan. Scan the document, which converts the phone photo into a flat, clean PDF. Save it to a cloud sync folder (Nextcloud, Syncthing, or even a shared SMB folder if you are on the LAN), which syncs to the Paperless consumption directory.

The flow: receive letter in post, scan with phone, drop phone, letter goes in recycling. By the time I have made a cup of tea, the document is OCR’d, tagged, and searchable. The paper is irrelevant.

Email Consumption

Paperless can monitor an email inbox and consume attachments. Configure it with:

PAPERLESS_EMAIL_TASK_CRON: "*/5 * * * *"

Then set up mail accounts in the admin panel under Mail > Mail Accounts. This is useful for documents you receive digitally — forward invoices, receipts, and statements to a dedicated email address and they are automatically ingested.

Pro tip: Create subfolders inside the consumption directory for different document types or sources. With PAPERLESS_CONSUMER_RECURSIVE set to true (which we did in the compose file), Paperless scans subdirectories too. I have consume/receipts/, consume/post/, and consume/work/. Combined with matching rules (below), documents are tagged based on which subfolder they arrived in.

Step 5: Tags, Correspondents, and Document Types

This is where Paperless goes from “a folder of PDFs” to “a genuine document management system.” Set up your taxonomy before bulk-importing. Planning the structure now saves massive reorganisation later.

Suggested Starting Tags

  • Finance: Invoice, Receipt, Bank Statement, Tax
  • Property: Mortgage, Insurance, Utilities, Council Tax, Maintenance
  • Vehicle: MOT, Insurance, Service Record, Registration
  • Medical: Prescription, Referral, Results, Insurance
  • Legal: Contract, Agreement, Will, Identification
  • Work: Payslip, P60, Contract, Reference
  • Certificates: Qualifications, Training, Professional

Correspondents

A correspondent is who the document is from or to. “British Gas,” “HMRC,” “Nationwide,” “NHS.” Create these as you go — you do not need an exhaustive list upfront. Paperless can learn to assign correspondents automatically based on document content.

Document Types

Types describe what the document is: Invoice, Certificate, Contract, Letter, Statement, Receipt. Keep this list short and general. The specificity comes from tags and correspondents, not document types.

Step 6: Matching Rules (Automatic Classification)

This is the feature that makes Paperless worth the setup time. Matching rules automatically assign tags, correspondents, and document types based on the content of incoming documents.

Go to any tag, correspondent, or document type and click Edit. Set the Matching algorithm:

  • Exact: Document must contain the exact phrase
  • Any word: Document contains any of the specified words
  • All words: Document contains all of the specified words
  • Regular expression: For complex patterns
  • Auto: Paperless uses machine learning to classify based on your existing documents

Examples:

  • Correspondent “British Gas” — matching algorithm “Any word,” match “british gas” or “centrica”
  • Tag “Invoice” — matching algorithm “Any word,” match “invoice” or “amount due” or “payment due”
  • Tag “Council Tax” — matching algorithm “All words,” match “council tax”

Once you have 20 or 30 documents classified, switch to the “Auto” matching algorithm. Paperless trains a classifier on your existing documents and starts predicting tags and correspondents for new ones. It gets surprisingly accurate after a modest training set. You will still occasionally need to correct it, and each correction improves the model.

Pro tip: Start with manual and keyword matching for the first month. Get a solid baseline of correctly classified documents. Then enable auto-matching and let the machine learning take over. If you enable auto too early with too few examples, it makes bad predictions that you then have to correct, which is more work than just doing it manually from the start.

Step 7: Backup Strategy

Your Paperless archive contains documents that might be irreplaceable — property deeds, insurance policies, identification documents, medical records. Backups are not optional here.

Paperless has a built-in export function:

docker exec paperless-ngx document_exporter ../export --zip

This creates a zip file in the export directory containing all documents, metadata, and database content. You can use this to restore a complete Paperless instance from scratch.

For automated nightly backups:

#!/bin/bash
# paperless-backup.sh
set -euo pipefail

BACKUP_DIR="/home/teky/backups/paperless"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=30

mkdir -p "${BACKUP_DIR}"

# Export using Paperless's built-in exporter
docker exec paperless-ngx document_exporter ../export

# Archive the export
tar czf "${BACKUP_DIR}/paperless_${TIMESTAMP}.tar.gz" \
  -C /home/teky/paperless-ngx export/

# Also backup the database directly
docker exec paperless-db pg_dump -U paperless paperless | \
  gzip > "${BACKUP_DIR}/db_${TIMESTAMP}.sql.gz"

# Retention
find "${BACKUP_DIR}" -name "*.tar.gz" -mtime +${RETENTION_DAYS} -delete
find "${BACKUP_DIR}" -name "*.sql.gz" -mtime +${RETENTION_DAYS} -delete

echo "[$(date)] Paperless backup completed"
# Add to cron
chmod +x ~/paperless-ngx/paperless-backup.sh
crontab -e
# Add: 0 3 * * * /home/teky/paperless-ngx/paperless-backup.sh >> /var/log/paperless-backup.log 2>&1

Store backups on a different physical device. The same advice as Vaultwarden: a backup on the same disk as the data it protects only guards against software corruption, not hardware failure. Rsync to a NAS, copy to an external drive, or push an encrypted archive offsite. These are your important life documents. Treat the backup accordingly.

Troubleshooting

Documents Stuck in Consumption Directory

Check the Paperless logs: docker logs paperless-ngx. Common causes: the file format is not supported (Paperless handles PDF, PNG, JPG, TIFF, and common office formats), the file is still being written (the consumer waits for the file to stop changing before processing), or there is a permissions issue. The consumption directory needs to be writable by the container user (UID 1000 by default).

OCR Quality Is Poor

Scan quality matters. Phone scans taken in bad lighting with motion blur will produce poor OCR results regardless of software. Use a dedicated scanning app that does edge detection and perspective correction (Microsoft Lens is excellent for this). For paper documents, a flatbed scanner at 300 DPI produces consistently better results than phone cameras. Enable the clean, deskew, and rotate options in Paperless settings.

Database Connection Errors on Startup

If Paperless fails to connect to PostgreSQL on first start, it is usually a race condition — the web application starts before PostgreSQL is ready. The depends_on in Docker Compose only waits for the container to start, not for the database to be ready. Restart the stack with docker compose restart webserver after a minute, or add a healthcheck to the database service:

  db:
    # ... existing config ...
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U paperless"]
      interval: 10s
      timeout: 5s
      retries: 5

Search Returns No Results for Scanned Documents

Check that OCR is actually running. In the document details view, look at the “Content” field. If it is empty, OCR did not process the document. This can happen if the OCR mode is set to “Skip” and the PDF already had (empty or garbled) embedded text. Try reprocessing the document or switching OCR mode to “Redo” for problem documents.

Running Out of Disk Space

Paperless stores both the original document and the archived (OCR’d) version, which roughly doubles your storage needs. Also, the OCR process creates temporary files. If you are bulk-importing hundreds of documents, make sure you have headroom. Monitor disk usage with df -h during large imports.

What to Tackle Next

You have a working document management system. Here is what complements it:

  • Vaultwarden Password Manager — your documents are now digitally archived. Your passwords should be too. Self-hosted, encrypted, and under your control.
  • Uptime Kuma Monitoring — monitor Paperless alongside your other services. If the ingestion pipeline stops working, you want to know before important documents pile up unprocessed.
  • Tailscale VPN — access your document archive from anywhere. Search for an insurance policy number from your phone while standing in a car park. That is the workflow that justifies the setup.
  • Build Your First Homelab — if Paperless is your first self-hosted service, plan the infrastructure to support it and more
  • Install Docker on Ubuntu 24.04 — the container platform underneath everything
  • Grafana and Prometheus Monitoring — track your server’s resource usage. OCR processing during bulk imports hits the CPU hard and you will want visibility.

Going paperless is one of those changes where you do not appreciate the value until you need a document urgently. The first time you search “MOT certificate” and have it on screen in three seconds instead of spending 20 minutes digging through a drawer, you will understand why this was worth the setup time. The second time it happens, you will start scanning everything.

Watch out for this: The OCR language packs are not installed by default for all languages. If you are processing non-English documents, add the language pack to your Docker environment variables or you will get blank text extraction.

Key Takeaways

  • Paperless-ngx turns your paper documents into a searchable, tagged, OCR’d digital archive. Drop a file in the consumption folder and it is processed automatically.
  • The stack runs three containers: Paperless, PostgreSQL, and Redis. Resource requirements are modest for normal use but OCR is CPU-intensive during bulk imports.
  • Set up matching rules for automatic tagging and correspondent assignment. Start with keyword matching, then switch to auto-matching once you have a training set of 20-30 classified documents.
  • The mobile scanning workflow is the killer feature: scan with your phone, save to a synced folder, document appears in Paperless within minutes, tagged and searchable.
  • Backups are critical. Your Paperless archive contains important life documents. Use the built-in exporter plus database dumps, and store copies on separate hardware.
  • Start with the important documents: insurance, property, vehicle, medical, legal. You do not need to scan everything at once. Build the habit gradually.

Related Guides

If you found this useful, these guides continue the journey:

The RTM Essential Stack - Gear I Actually Use

Enjoyed this guide?

New articles on Linux, homelab, cloud, and automation every 2 days. No spam, unsubscribe anytime.

Scroll to Top