OCR Document Intelligence: Classifying Vendor Docs at Scale
How I built an OCR + classification + auto-routing pipeline for hundreds of incoming vendor docs, bank statements, and cost models — turning a PDF folder into a searchable operational feed.
The problem
Every week: 50+ vendor invoices, bank statements, internal cost models — all PDFs, all in a shared drive. Needed a pipeline that reads, classifies, and routes them without a human doing it.
What you'll build
- An OCR layer that handles scanned PDFs, faxes, and mixed-quality docs
- A classifier that tells invoice from statement from cost model
- Auto-routing rules that drop documents into the right folder + surface metadata
- An audit log so every auto-decision is reversible
1. Running OCR that doesn't blow up on bad scans
Coming soon — EasyOCR vs. Tesseract vs. cloud, and when I use which.
2. The classification layer
Coming soon — rule-based vs. LLM-assisted, and when I combine them.
3. Routing + metadata extraction
Coming soon — the minimal metadata schema that made downstream queries simple.
Wrap-up
Coming soon — failure modes and what I watch for.