All guides

How to Get Better OCR Results from Scanned PDFs

A practical workflow to improve OCR accuracy on scanned PDFs — better inputs, correct language selection, and realistic expectations.

February 10, 2026 | 8 min read

Why OCR quality depends on the input

OCR accuracy is primarily an input problem. A blurry, tilted, or low-contrast scan gives the OCR engine less usable information to recognize characters correctly.

The good news: most OCR failures are fixable before you run the tool.

Step 1 — Start with a readable scan

Before running OCR, check these basics:

  • Resolution: 300 DPI minimum. Most phone cameras exceed this, but heavily compressed photos may not.
  • Alignment: Keep text straight. A 5° tilt can noticeably reduce accuracy on dense text.
  • Contrast: Dark text on a white background. Avoid shadows from handheld scanning.
  • Lighting: Even, diffuse light. Avoid harsh direct light that creates hotspots.

If you cannot rescan, most phone camera apps include a document mode that auto-corrects perspective and contrast.

Step 2 — Choose the correct language

This is the single most impactful setting in any OCR tool. Running English OCR on a Hindi or Arabic document will produce near-random output.

In the Basic OCR tool, select the primary language of the document before processing. For multilingual documents, choose the dominant language.

Step 3 — Compress images carefully

If source images are very large (10+ MB per page), you can use Compress Image to reduce file size before converting to PDF. Use a high quality setting — aggressive compression blurs fine characters and reduces OCR accuracy significantly.

Only compress when file size is causing problems. When in doubt, keep the original.

Step 4 — Validate output page by page

After OCR, verify critical values:

  • Invoice numbers, dates, and monetary amounts
  • Names and addresses
  • The difference between O and 0, l and 1, rn and m

Re-scan individual failed pages rather than reprocessing entire documents.

Realistic limitations

Browser-based OCR using Tesseract.js works well for clear typed text. It handles these cases poorly:

  • Cursive or handwritten text
  • Dense multi-column tables
  • Decorative fonts and logos
  • Very small text (below 9pt equivalent)
  • Stamps or text printed over background images

For these cases, manual entry or a dedicated desktop OCR application will produce better results.

Suggested workflow on PDFHarbor

  1. Prepare images → Compress Image if needed
  2. Bundle pages → Image to PDF
  3. Extract text → Basic OCR
  4. Assemble final document → Merge PDF

Common questions

What resolution should scans be for OCR?

300 DPI is the standard minimum. Higher is better for small or dense text.

Should I compress images before OCR?

Only if the file is too large to work with. Over-compression blurs characters and reduces accuracy.

Can browser OCR handle handwriting?

Not reliably. Typed text performs well; handwriting and cursive do not.

Why does OCR language selection matter so much?

OCR engines use language-specific character models. Wrong language means wrong character assumptions and poor output.

Try the tools