Multimodal RAG in Long-Context DocVQA

Standard RAG assumes the relevant context fits in a single retrieved chunk. For questions that span multiple pages, multiple documents, or require reasoning across figures and tables in different locations, single-chunk retrieval is fundamentally insufficient. You need a system that can retrieve the right pages from a large corpus, aggregate evidence across them, and generate a coherent answer from multimodal inputs.

This post covers two papers from my winter 2025 seminar that tackle this problem from different angles: M3DocRAG addresses the multi-document multi-page retrieval pipeline end-to-end, and MoLoRAG introduces a graph-based approach to logical page ordering for multi-hop retrieval.

1. M3DocRAG: Multi-Modal, Multi-Page, Multi-Document RAG (ICCV Workshop 2025)

1.1 The Problem Space

Consider a question like: "How does the investment strategy described in Company A's 2022 annual report compare to the risk factors disclosed in Company B's 2023 10-K?" This requires:

Identifying relevant pages from two different documents
Reading tables, charts, and prose across multiple pages
Synthesizing information in a single coherent answer

Text-only RAG collapses visual structure. Single-document systems can't cross document boundaries. Context-window stuffing fails at scale. M3DocRAG is designed to handle all three constraints simultaneously.

1.2 M3DocVQA: The Benchmark

The paper introduces M3DocVQA, a large-scale benchmark for this task:

Statistic	Value
Multi-hop questions	2,441
Total PDFs in corpus	3,368
Total pages	41,005
Documents per question (avg)	2–3

Questions require aggregating evidence from pages scattered across multiple documents — no single-document system can answer them correctly. All pages are treated as images throughout the pipeline.

1.3 Pipeline Architecture

M3DocRAG's pipeline has three stages:

1. Page Retrieval
   Query → ColPali → Top-K page images (from 41,005 pages across 3,368 PDFs)
   ↓
2. Context Selection
   VLM scores each retrieved page for relevance
   Select top-N pages as multimodal context
   ↓
3. Answer Generation
   Qwen2-VL 7B reads [query + N page images] → generates answer

The key design decision: treat every page as an image throughout, with no OCR at any stage. ColPali's visual embeddings retrieve candidate pages; Qwen2-VL reads the page images directly for both scoring and answer generation.

1.4 Results

The primary comparison is against a text-only RAG baseline using the same corpus:

System	Retriever	Reader	EM	F1
Text RAG baseline	ColBERT	Llama 3.1 8B	17.8	23.7
M3DocRAG	ColPali	Qwen2-VL 7B (4 pages)	31.4	36.5

M3DocRAG improves EM by +13.6 and F1 by +12.8 over the text RAG baseline. The gain comes entirely from visual understanding: questions that require reading a chart, table, or diagram in the source document cannot be answered by a text system that only has OCR output. Qwen2-VL reading the actual page image recovers this information.

1.5 Tradeoffs

The pipeline has two costs. First, Qwen2-VL reading 4 page images per query is significantly slower than a text LLM reading 4 text chunks — each image input generates hundreds of visual tokens. Second, the system works best when the correct pages are within the top-K retrieved by ColPali. If the retriever misses a key page, no amount of sophistication in the reader recovers it.

2. MoLoRAG: Modality-Logical Retrieval Augmented Generation (EMNLP 2025)

2.1 The Multi-Hop Retrieval Problem

In many long-document questions, relevant information is not co-located — it spans multiple pages that are connected by logical references: "see Figure 3" on page 12 pointing to a figure on page 8; a table summary on page 20 that expands data first presented on page 5. Standard embedding-based retrieval treats pages independently. A query about the relationship between two pieces of information may score neither page highly on its own, even though both are necessary together.

MoLoRAG addresses this by building a page graph over the document corpus and using it to propagate relevance through logical connections.

2.2 Page Graph Construction

MoLoRAG represents a document as a graph G = (V, E) where:

V = one node per page
E = undirected edges between pages whose embeddings exceed a similarity threshold

For pages pᵢ, pⱼ:
  edge(pᵢ, pⱼ) = 1  if cosine_sim(embed(pᵢ), embed(pⱼ)) > τ
               = 0  otherwise

Edges capture semantic proximity — pages that talk about the same topic or entity tend to have high embedding similarity, even if they appear far apart in the document. The graph is pre-computed at index time.

2.3 Scoring: Semantic + Logical Relevance

Given a query, MoLoRAG scores each page with a composite score combining two signals:

score(page) = α · semantic_score(query, page)
            + β · logical_relevance_score(query, page)

Semantic score is the standard embedding similarity between the query and the page.

Logical relevance score is computed by a VLM: the model reads the page image and the query and produces a relevance judgment. This is a binary or graded score based on the model's understanding of whether the page is logically connected to the question — even if not directly answering it.

After initial scoring, MoLoRAG performs graph traversal: pages adjacent to high-scoring pages in the graph are given a boost. The intuition is that if page 12 is highly relevant, page 8 (which page 12 references) is more likely to be relevant than a random page with the same embedding distance from the query.

2.4 Results on MMLongBench

MoLoRAG is evaluated on MMLongBench, a long-document multimodal benchmark. The key metric is Recall@1 and NDCG@1 — can the most relevant page be ranked first?

System	Recall@1	NDCG@1
Embedding-only baseline	—	—
MoLoRAG+	51.32%	66.86%

MoLoRAG+ (the full system with both semantic and logical scores + graph traversal) achieves Recall 51.32% and NDCG 66.86% at Top-1 on MMLongBench, outperforming systems that treat pages independently.

2.5 When Graph Structure Matters

The graph boost is most valuable for questions where:

The answer page contains no direct query keywords but is referenced from a page that does
Evidence is split across a section header page and a data page — high semantic similarity to each other, but one is more directly query-relevant
Multi-hop questions require first finding an intermediate fact, then finding a second page that uses that fact

The cost is the VLM logical relevance scoring step — running a forward pass per candidate page is expensive at scale. MoLoRAG applies this only to a shortlist of candidates identified by the semantic score, making the overall system tractable.

3. Synthesis: Where Do These Two Systems Fit?

M3DocRAG and MoLoRAG both extend RAG to multi-page multi-document settings, but they target different bottlenecks:

Aspect	M3DocRAG	MoLoRAG
Primary bottleneck addressed	Multi-document aggregation + visual understanding	Logical page connectivity for multi-hop retrieval
Corpus scope	Multi-document (3,368 PDFs)	Long single/multi-document
Retriever	ColPali (vision-based)	Embedding + graph traversal
Reader	Qwen2-VL 7B (multimodal)	VLM for logical scoring + generation

The two systems are architecturally complementary: M3DocRAG's retriever (ColPali) could be enhanced with MoLoRAG's graph-based re-ranking, and MoLoRAG's VLM scoring step could be extended to work across multiple documents. A combined system would address both the visual understanding and the logical connectivity problems simultaneously.

4. Takeaways

Multi-document multi-page RAG is a distinct problem from single-document QA. M3DocVQA's scale (41,005 pages, 3,368 PDFs) is representative of real enterprise corpora — existing benchmarks underrepresent this setting.
Visual retrieval outperforms text retrieval for document-rich corpora. ColPali + Qwen2-VL achieves +13.6 EM over ColBERT + Llama on M3DocVQA, primarily because tables and figures cannot survive OCR with their structure intact.
Page graphs capture structure that embeddings miss. MoLoRAG's logical connectivity through graph traversal addresses the multi-hop retrieval problem that pure embedding similarity cannot resolve.
VLM-in-the-loop retrieval is feasible but expensive. Using a VLM for page relevance scoring is powerful but requires careful shortlisting to remain tractable at corpus scale.