Why OCR Errors Matter: Building Robust Document AI Pipelines

Downstream Document AI — QA, summarization, retrieval — is built on text. But that text usually comes from an OCR engine, and OCR engines make mistakes. The question is not whether errors exist; it is what kind, how often, and how much they hurt. The answer, at least in real-world document corpora, is: a lot.

This post is about what we learned building REVISE, a framework for OCR error correction in practical information systems, which appeared at ACL 2025.

The problem is systematic, not random

Random noise is easy to handle — embed the noisy text and the embedder averages it out. OCR errors are not random. They are systematic and correlated with document properties:

Font degradation. Old scanned documents with worn ink produce consistent character-level substitutions: rn → m, l → 1, O → 0. These are predictable per-font, not random.
Multi-column layout errors. OCR engines optimized for single-column text misread reading order in two-column layouts, producing interleaved paragraphs that are syntactically malformed.
Table cell confusion. Cell boundary detection fails on borderless or lightly-bordered tables, merging adjacent cells or splitting single cells across rows.
Structural markers. Section headings, footnote markers, and figure captions are often dropped or merged into body text.

Each error type propagates differently. Character-level errors corrupt embeddings locally but can be smoothed by robust tokenizers. Layout errors corrupt paragraph structure, which breaks chunking. Structural marker loss corrupts the document hierarchy, which breaks retrieval routing.

Building a taxonomy first

Before building a correction model, we needed to know what we were correcting. We constructed a hierarchical taxonomy of OCR errors organized at three linguistic granularities: character level, word level, and column (layout) level. Each granularity has distinct causes and requires distinct correction strategies.

Character Level — Single-character errors

These are the most granular errors: individual characters are wrong, missing, or swapped.

Insertion — Addition of spurious characters from document noise, physical artifacts, or scanner interference. apple → applee
Deletion — Omission of legitimate characters due to poor contrast or faded ink. clamp → lamp, filter → filer
Substitution — Replacing a character with a visually similar alternative due to font peculiarities or resolution limitations. O → 0, é → e, blue → b1ue
Transposition — Character position swapping caused by bounding box coordinate miscalculations. Gauge → Guage

Word Level — Word-segmentation errors

These errors occur at token boundaries — the OCR engine misidentifies where one word ends and the next begins.

Over-segmentation — Fragmenting a cohesive compound word by inserting an incorrect space. greenhouse → green house
Under-segmentation — Erroneously merging two distinct words into one unit due to spacing misinterpretation or layout analysis failure. Not able → Notable

Column Level — Layout-reading errors

These are the most damaging error type for downstream tasks. An OCR engine processing a two-column document in left-to-right, top-to-bottom order will interleave the text from both columns rather than reading each column sequentially. The resulting output is syntactically and semantically incoherent — words from a sentence in column A are mixed with words from an unrelated sentence in column B. No character-level correction can fix this; restoring reading order requires structural understanding of the page layout.

Category	Name	Example
Character Level (Single-character)	Insertion	apple → applee
	Deletion	clamp → lamp · filter → filer
	Substitution	O → 0 · é → e · blue → b1ue
	Transposition	Gauge → Guage
Word Level (Word-segmentation)	Over-segmentation	greenhouse → green house
Word Level (Word-segmentation)	Under-segmentation	Not able → Notable
Column Level (Layout-reading)	Column Reading Order	Multi-column interleaving

This taxonomy matters because correction strategies differ by level. Character errors can be addressed with a seq2seq model trained on character perturbations. Structural errors require layout-aware modeling that understands spatial position, not just token sequence.

The synthetic data contamination strategy

Collecting real OCR errors with ground-truth corrections is expensive — you need the original clean document and its OCR output, annotated pair-by-pair. We bypassed this with a synthetic contamination strategy: take a clean document, inject OCR-like noise programmatically, and train the correction model on (noisy, clean) pairs.

The injection pipeline follows three steps:

Format raw text into structured templates — normalize clean text into fixed-length lines and single-column layout, establishing a clean reference for each document.
Simulate column reading order errors — reorder text segments to mimic the interleaved output an OCR engine produces when it naively scans a multi-column page left-to-right.
Inject character- and word-level errors probabilistically — apply each error type (insertion, deletion, substitution, transposition, over-/under-segmentation) with configurable per-type ratios, drawing from distributions estimated on a small set of real OCR outputs.

The contaminated datasets are then used to train two families of models: a generalist REVISE-meta that handles all error types simultaneously, and a suite of error-type-specific models (e.g., only-Segmentation, only-Column) that specialize in a single correction task. This gives practitioners the flexibility to apply targeted correction when the dominant error type in their corpus is known.

The key insight: you do not need to know which characters the OCR got wrong. You need to match the statistical profile of errors that OCR produces on documents of a given type. Synthetic contamination is a tractable proxy for this.

Experiments

We evaluated REVISE across four experiment types designed to measure different aspects of OCR correction quality, using four embedding models (bge-large-en-v1.5, e5-large-v2, jina-embeddings-v2-base, gte-base-en-v1.5) and two reader LLMs (Gemma-2-9b-it, Llama-3.1-8B).

1. Retrieval — VisualMRC and DUDE

We measured Recall@1, @3, and @5 on two retrieval benchmarks: VisualMRC (visually-rich documents) and DUDE (a dataset with highly regular column-based layouts).

Method	VisualMRC avg Recall	DUDE avg Recall
Baseline (raw OCR)	0.6578	0.2534
REVISE-meta	0.6666 (+1.3%)	0.2975 (+17.3%)
only Column	0.6612	0.3076 (best single)
only Segmentation	0.6665	0.2987

The 17.3% average recall gain on DUDE is large — and the cause is clear: DUDE's documents have highly regular multi-column layouts. Column reading order correction directly fixes the dominant error type, boosting average recall from 25.34% to 30.76% on that split alone. On VisualMRC, the improvement is smaller because the error distribution is more diverse and no single error type dominates. Notably, some single-type corrections occasionally underperform the baseline — over-correction, where a correction model modifies text that was actually read correctly, introducing new errors.

2. Similarity Assessment — DocVQA, CORD, FUNSD

We measured BERTScore between REVISE-corrected text and the ground-truth clean document text, for three datasets covering free-form queries (DocVQA) and structured forms (CORD, FUNSD).

Method	DocVQA	CORD	FUNSD
Baseline	0.4959	0.5390	0.5577
REVISE-meta	0.5137	0.5443	0.5647
only Segmentation	0.5096	0.5408	0.5601

REVISE-meta ranks first on all three datasets. Segmentation correction is the single most impactful error type on DocVQA (free-form queries), while combining all error corrections achieves the best overall performance on the structured CORD and FUNSD datasets. This demonstrates that REVISE enhances not just surface-level text quality but semantic consistency — the corrected text is more aligned with the meaning of the original clean document.

3. Question Answering — VisualMRC and CORD

We measured downstream QA performance using REVISE-corrected text as input to two reader LLMs. REVISE-meta consistently outperforms raw OCR across both models and both datasets:

Reader Model	Method	VisualMRC	CORD
Gemma-2-9b-it	Baseline	320.9	0.367
Gemma-2-9b-it	REVISE-meta	329.2 (+2.6%)	0.372 (+1.4%)
Llama-3.1-8B	Baseline	290.7	0.448
Llama-3.1-8B	REVISE-meta	293.1 (+0.8%)	0.450 (+0.4%)

The gains are consistent but moderate in QA — OCR correction improves the quality of the input text, which in turn helps the LLM read the document more accurately. The larger improvements on VisualMRC reflect the document type: VisualMRC documents have richer visual layouts (more OCR errors) than CORD's structured receipts, where OCR is already more reliable.

4. Qualitative Assessment — GPT-4o-mini Win Rate

We used GPT-4o-mini as a judge to compare REVISE-corrected text against raw OCR output on the same documents. The judge was shown both versions and asked to pick the higher-quality text for document understanding purposes.

Method	VisualMRC Win Rate	DUDE Win Rate
REVISE-meta	0.94	0.86
only Segmentation	0.92	0.92
only Column	0.74	0.89
only Deletion	0.84	0.64
only Insertion	0.61	0.59

REVISE-meta achieves a 0.94 win rate on VisualMRC — in 94 out of 100 comparisons, the LLM judge preferred the corrected text over raw OCR. On DUDE, only-Segmentation edges out REVISE-meta (0.92 vs. 0.86), reinforcing that for corpora where a single dominant error type is known, a specialized model can outperform the generalist. Insertion-only correction has the lowest win rate across both datasets, suggesting that spurious character additions are less damaging to document quality than segmentation or structural errors.

Limitations and future directions

REVISE has two significant current limitations. First, it operates on text-only documents — documents containing tables, charts, or embedded images are out of scope. When the document structure is primarily visual rather than textual, REVISE cannot help; a VLM-based approach is needed instead. Second, the contamination strategy relies on empirical error definitions and hand-tuned injection ratios. Without a comprehensive statistical analysis of OCR error distributions across domains and document types, the synthetic data may not precisely match real-world error profiles, which can limit generalization to new document categories.

The planned next steps address both: extending REVISE to handle documents with mixed text and visual elements (enabling correction of tables, charts, and image-embedded text), and conducting a large-scale statistical analysis of OCR error distributions across diverse domains to ground the contamination ratios in empirical measurements.

Practical lessons

Run OCR quality diagnostics before building. Measure error rates by document type in your corpus. You may find that 20% of documents account for 80% of errors.
Correct before chunking, not after. Chunking a structurally corrupted document produces corrupted chunks that no amount of retrieval tuning can fix.
Synthetic contamination is a viable alternative to annotation if you can characterize your OCR error distribution from a small sample of real documents.
Structural errors are the silent killer. Character errors are visible and easy to notice; structural errors are invisible until your retrieval pipeline mysteriously underperforms.