How to Get Better OCR Results from Scanned PDFs
A practical workflow to improve OCR accuracy on scanned PDFs — better inputs, correct language selection, and realistic expectations.
February 10, 2026 | 8 min read
Why OCR quality depends on the input
OCR accuracy is primarily an input problem. A blurry, tilted, or low-contrast scan gives the OCR engine less usable information to recognize characters correctly.
The good news: most OCR failures are fixable before you run the tool.
Step 1 — Start with a readable scan
Before running OCR, check these basics:
- Resolution: 300 DPI minimum. Most phone cameras exceed this, but heavily compressed photos may not.
- Alignment: Keep text straight. A 5° tilt can noticeably reduce accuracy on dense text.
- Contrast: Dark text on a white background. Avoid shadows from handheld scanning.
- Lighting: Even, diffuse light. Avoid harsh direct light that creates hotspots.
If you cannot rescan, most phone camera apps include a document mode that auto-corrects perspective and contrast.
Step 2 — Choose the correct language
This is the single most impactful setting in any OCR tool. Running English OCR on a Hindi or Arabic document will produce near-random output.
In the Basic OCR tool, select the primary language of the document before processing. For multilingual documents, choose the dominant language.
Step 3 — Compress images carefully
If source images are very large (10+ MB per page), you can use Compress Image to reduce file size before converting to PDF. Use a high quality setting — aggressive compression blurs fine characters and reduces OCR accuracy significantly.
Only compress when file size is causing problems. When in doubt, keep the original.
Step 4 — Validate output page by page
After OCR, verify critical values:
- Invoice numbers, dates, and monetary amounts
- Names and addresses
- The difference between O and 0, l and 1, rn and m
Re-scan individual failed pages rather than reprocessing entire documents.
Realistic limitations
Browser-based OCR using Tesseract.js works well for clear typed text. It handles these cases poorly:
- Cursive or handwritten text
- Dense multi-column tables
- Decorative fonts and logos
- Very small text (below 9pt equivalent)
- Stamps or text printed over background images
For these cases, manual entry or a dedicated desktop OCR application will produce better results.
Suggested workflow on PDFHarbor
- Prepare images → Compress Image if needed
- Bundle pages → Image to PDF
- Extract text → Basic OCR
- Assemble final document → Merge PDF
Common questions
What resolution should scans be for OCR?
300 DPI is the standard minimum. Higher is better for small or dense text.
Should I compress images before OCR?
Only if the file is too large to work with. Over-compression blurs characters and reduces accuracy.
Can browser OCR handle handwriting?
Not reliably. Typed text performs well; handwriting and cursive do not.
Why does OCR language selection matter so much?
OCR engines use language-specific character models. Wrong language means wrong character assumptions and poor output.