Back to Blog

Retrieval in Vision Space

Classic document retrieval treats every page as a bag of words. Run OCR, chunk into paragraphs, embed the text, and retrieve. This pipeline is well-understood — and it quietly throws away half the information on a typical page. Column alignment, table borders, font weight, spatial proximity of numbers to their labels: none of it survives OCR into a text string.

A newer line of work asks a different question: what if the retrieval model read the page as an image, the way a human does? This post covers two papers from my summer 2025 seminar that make this concrete — one that proposes the vision-based retrieval architecture, and one that stress-tests it against OCR-based alternatives under degraded scan conditions.

1. ColPali: Efficient Document Retrieval with Vision Language Models (ICLR 2025)

1.1 The Standard Pipeline and Its Failure Mode

The standard RAG pipeline for document retrieval is:

PDF → OCR → text chunks → text embedder → dense index → nearest-neighbor search

On clean, purely textual documents this works fine. But real-world document collections — financial filings, academic posters, product manuals — are visually rich. A figure caption next to a bar chart, a table where column headers span two rows, a flowchart with directional arrows: OCR extracts the characters and loses everything else. The retrieved chunk says "Q3 Revenue: 4.2B" with no indication that the number appeared in a row labeled "Asia Pacific" inside a table titled "Regional Breakdown."

The alternative — rendering each page as an image and using a vision-language model to understand it — was dismissed for years as too slow for indexing at scale. ColPali changes that calculus by satisfying three requirements simultaneously:

Prior VLM-based approaches failed R3 — re-encoding documents against each new query is prohibitively slow. ColPali achieves all three by separating the heavy VLM computation into offline indexing (once per document) and keeping query-time work to a lightweight MaxSim lookup.

1.2 Background: ColBERT and Late Interaction

Understanding ColPali requires understanding ColBERT, the retrieval architecture it builds on.

Classical dense retrieval is bi-encoder retrieval: a query encoder and a document encoder each produce a single vector; retrieval is nearest-neighbor search over the document index. This satisfies R2 and R3 but compresses all document semantics into a single point — fine-grained token-level signals are lost.

Cross-encoders are at the other extreme: query and document are fed together through a full transformer, enabling rich token-level interaction (R1). But they require a full forward pass per query-document pair — scoring a 1M-document corpus at query time is not feasible (fails R2).

ColBERT (Contextualized Late Interaction over BERT) bridges both paradigms. Instead of one vector per document, ColBERT produces one vector per token — a multi-vector representation. Document token embeddings are computed offline and stored. At query time, only the query token embeddings are computed, and relevance is scored via MaxSim:

LI(q, d) = Σᵢ max_j <E_q^(i) | E_d^(j)>

For each query token i, find the highest cosine similarity across all document token embeddings j; sum over query tokens. The document side is fully precomputed — querying is just matrix multiplication, satisfying all three requirements simultaneously.

1.3 Architecture: ColBERT × PaliGemma

ColPali applies ColBERT's late interaction principle to document images by replacing BERT with PaliGemma-3B as the document encoder.

PaliGemma as a Prefix-LM. PaliGemma processes image tokens and text with full bidirectional attention over the combined prefix. Rather than using the model autoregressively for generation, ColPali extracts the final hidden states from Gemma-2B's last transformer layer — one contextualized embedding per input token (image patch + query text token). These are the "bag of patch embeddings" that ColBERT's late interaction operates on.

Projection layer. The final hidden states (dimensionality of the language model) are projected down to D=128 via a learned linear layer, producing lightweight embeddings for large-scale nearest-neighbor indexing — a "bag of embeddings" analogous to ColBERT's bag of token vectors.

The full processing pipeline for a document page:

Document Image
  → SiglipImageProcessor
  → Pixel Values
  → PaliGemma (SigLIP ViT + Multi-modal Projector + Gemma LM)
  → Hidden States  (last transformer layer, one per patch/token)
  → Embedding Projection  (→ D=128)
  → L2 Normalization
  → Multi-vector Embeddings  (stored in index)

Module tree:

ColPali
├── ColPaliProcessor
│   ├── SiglipImageProcessor
│   └── GemmaTokenizer
└── ColPaliForRetrieval
    └── PaliGemma VLM
        ├── SigLIP Vision Encoder
        ├── Multi-modal Projector
        └── Gemma Language Model

Combined, ColPali encodes a query into a sequence of token embeddings E_q and each document page into a sequence of 784 patch embeddings E_d (448×448 input → 28×28 patch grid), then scores them with MaxSim.

1.4 ViDoRe: Visual Document Retrieval Benchmark

The paper introduces ViDoRe (Visual Document Retrieval Benchmark), a benchmark of 10 retrieval datasets spanning diverse document types: scientific figures, medical scans, administrative documents, and more. The primary metric is NDCG@5 — normalized discounted cumulative gain at rank 5.

ColPali's results on ViDoRe:

System NDCG@5 Indexing speed Retrieval granularity
Standard OCR pipeline 0.66 7.22 s/page Chunk-level (~2.1 chunks/page)
ColPali (PaliGemma-3B) 0.81 0.39 s/page Page-level (image)

ColPali achieves +15 NDCG@5 points over the OCR baseline while being 18× faster to index. The speed advantage comes from eliminating the OCR step entirely — a single VLM forward pass over the page image replaces the full OCR + chunking + embedding pipeline. Both systems output IDs pointing to stored content; the 18× figure covers the indexing phase only.

The granularity difference matters downstream: OCR pipelines retrieve at chunk level (a page split into ~2.1 text segments on average), handing the downstream reader a focused text span. ColPali's returned unit is the whole page image — the downstream MLLM or human must scan the full page rather than a targeted excerpt.

1.5 Why Late Interaction Works for Pages

A page image at 448×448 pixels is divided into 28×28 = 784 patches. Each patch becomes one entry in E_d. The query "What was Q3 revenue in Asia Pacific?" might have 9 tokens. MaxSim finds, for each query token, the single most-relevant patch — the token "Asia Pacific" will max-sim to the patch covering the relevant table row, not to a patch covering an unrelated figure on the same page.

The single-vector alternative must compress all 784 patches into one number — spatial layout is irrecoverably mixed together. Late Interaction preserves the spatial decomposition all the way through scoring.

In implementation, the batch MaxSim is computed as:

# batch_queries:  (B, N, D) — B queries, N query tokens, D dims
# batch_passages: (C, S, D) — C pages, S patch tokens, D dims
scores = (
    torch.einsum("bnd,csd->bcns", batch_queries, batch_passages)
    .max(dim=3)[0]   # for each (query, page, query-token): max over patches
    .sum(dim=2)      # sum over query tokens → final (B, C) score matrix
)

The max(dim=3) over the patch dimension implements MaxSim — each query token finds its best-matching patch. The sum(dim=2) aggregates across all query tokens to produce the final page score.

1.6 Limitations

ColPali's main cost is storage: 784 vectors per page instead of 1. For a 1M-page corpus, a 128-dim index stores 1M × 784 × 128 × 4 bytes ≈ 400 GB. Quantization (binary or int8) and Matryoshka dimension reduction are the standard mitigations.

A subtler limitation is retrieval granularity. ColPali returns a page; it has no mechanism to pinpoint a sub-page region. For tasks where localization matters — "what number appears in cell (3,2) of this table?" — the downstream reader must process the whole page. OCR chunk-level retrieval delivers a focused text span instead.

Finally, ColPali's page-level MaxSim is inherently localized: it scores one page against one query independently. For multi-hop or cross-page questions ("what is the revenue trend across Q1–Q4 in the annual report?"), the retriever returns pages in isolation, and the downstream MLLM must synthesize evidence across them. This is a general RAG limitation, but it is amplified when each retrieved unit is a full page rather than a targeted excerpt — the model must read more to extract the same information.

2. Lost in OCR Translation? (ACM DocEng 2025)

2.1 Motivation: Do VLMs Actually Beat OCR Under Scan Degradation?

ColPali's benchmark results come from digitally-born PDFs. Real enterprise document archives contain scanned physical documents — sometimes scanned at low DPI, with ink bleed, skew, or physical damage. Under these conditions, does a VLM-based retriever still dominate, or does a robust OCR pipeline with explicit degradation handling come back?

"Lost in OCR Translation?" is a systematic evaluation paper that constructs a controlled degradation benchmark and measures both families of systems under each condition.

2.2 DocDeg: The Degradation Benchmark

The paper introduces DocDeg, a benchmark of document pages at four degradation levels:

The progression simulates the actual distribution of documents in large enterprise archives, where levels 0–1 represent recently-digitized material and levels 2–3 represent older physical records.

2.3 Systems Compared

System Type Notes
ColQwen2 7B VLM-based retriever ColPali variant on Qwen2-VL 7B backbone
Nougat OCR-based (specialized) Scientific document OCR + text embedder
Llama 3.2 OCR OCR-based (LLM) LLM-powered OCR + text embedder

2.4 Findings

The results are counterintuitive at level 0 but consistent across levels 1–3:

2.5 Interpretation

This finding has a clear reading: VLM-based retrieval is not a universal improvement over OCR-based retrieval. ColPali/ColQwen2's strength is on visually complex documents where layout carries meaning that OCR discards. When the bottleneck is scan quality rather than layout complexity, a capable LLM doing OCR — which has been trained on noisy text from the web and is inherently robust to character-level noise — is a strong baseline.

The practical implication for system design: document archive composition matters. If the corpus is primarily scanned physical documents at variable quality, LLM-powered OCR + text retrieval is the pragmatic choice. If the corpus is visually-rich digitally-born PDFs, VLM-based retrieval offers meaningful gains. Most real archives contain both.

3. Takeaways