Back to Blog

Multimodal RAG in Long-Context DocVQA

Standard RAG assumes the relevant context fits in a single retrieved chunk. For questions that span multiple pages, multiple documents, or require reasoning across figures and tables in different locations, single-chunk retrieval is fundamentally insufficient. You need a system that can retrieve the right pages from a large corpus, aggregate evidence across them, and generate a coherent answer from multimodal inputs.

This post covers two papers from my winter 2025 seminar that tackle this problem from different angles: M3DocRAG addresses the multi-document multi-page retrieval pipeline end-to-end, and MoLoRAG introduces a graph-based approach to logical page ordering for multi-hop retrieval.

1. M3DocRAG: Multi-Modal, Multi-Page, Multi-Document RAG (ICCV Workshop 2025)

1.1 The Problem Space

Consider a question like: "How does the investment strategy described in Company A's 2022 annual report compare to the risk factors disclosed in Company B's 2023 10-K?" This requires:

Text-only RAG collapses visual structure. Single-document systems can't cross document boundaries. Context-window stuffing fails at scale. M3DocRAG is designed to handle all three constraints simultaneously.

1.2 M3DocVQA: The Benchmark

The paper introduces M3DocVQA, a large-scale benchmark for this task:

Statistic Value
Multi-hop questions 2,441
Total PDFs in corpus 3,368
Total pages 41,005
Documents per question (avg) 2–3

Questions require aggregating evidence from pages scattered across multiple documents — no single-document system can answer them correctly. All pages are treated as images throughout the pipeline.

1.3 Pipeline Architecture

M3DocRAG's pipeline has three stages:

1. Page Retrieval
   Query → ColPali → Top-K page images (from 41,005 pages across 3,368 PDFs)
   ↓
2. Context Selection
   VLM scores each retrieved page for relevance
   Select top-N pages as multimodal context
   ↓
3. Answer Generation
   Qwen2-VL 7B reads [query + N page images] → generates answer

The key design decision: treat every page as an image throughout, with no OCR at any stage. ColPali's visual embeddings retrieve candidate pages; Qwen2-VL reads the page images directly for both scoring and answer generation.

1.4 Results

The primary comparison is against a text-only RAG baseline using the same corpus:

System Retriever Reader EM F1
Text RAG baseline ColBERT Llama 3.1 8B 17.8 23.7
M3DocRAG ColPali Qwen2-VL 7B (4 pages) 31.4 36.5

M3DocRAG improves EM by +13.6 and F1 by +12.8 over the text RAG baseline. The gain comes entirely from visual understanding: questions that require reading a chart, table, or diagram in the source document cannot be answered by a text system that only has OCR output. Qwen2-VL reading the actual page image recovers this information.

1.5 Tradeoffs

The pipeline has two costs. First, Qwen2-VL reading 4 page images per query is significantly slower than a text LLM reading 4 text chunks — each image input generates hundreds of visual tokens. Second, the system works best when the correct pages are within the top-K retrieved by ColPali. If the retriever misses a key page, no amount of sophistication in the reader recovers it.

2. MoLoRAG: Modality-Logical Retrieval Augmented Generation (EMNLP 2025)

2.1 The Multi-Hop Retrieval Problem

In many long-document questions, relevant information is not co-located — it spans multiple pages that are connected by logical references: "see Figure 3" on page 12 pointing to a figure on page 8; a table summary on page 20 that expands data first presented on page 5. Standard embedding-based retrieval treats pages independently. A query about the relationship between two pieces of information may score neither page highly on its own, even though both are necessary together.

MoLoRAG addresses this by building a page graph over the document corpus and using it to propagate relevance through logical connections.

2.2 Page Graph Construction

MoLoRAG represents a document as a graph G = (V, E) where:

For pages pᵢ, pⱼ:
  edge(pᵢ, pⱼ) = 1  if cosine_sim(embed(pᵢ), embed(pⱼ)) > τ
               = 0  otherwise

Edges capture semantic proximity — pages that talk about the same topic or entity tend to have high embedding similarity, even if they appear far apart in the document. The graph is pre-computed at index time.

2.3 Scoring: Semantic + Logical Relevance

Given a query, MoLoRAG scores each page with a composite score combining two signals:

score(page) = α · semantic_score(query, page)
            + β · logical_relevance_score(query, page)

Semantic score is the standard embedding similarity between the query and the page.

Logical relevance score is computed by a VLM: the model reads the page image and the query and produces a relevance judgment. This is a binary or graded score based on the model's understanding of whether the page is logically connected to the question — even if not directly answering it.

After initial scoring, MoLoRAG performs graph traversal: pages adjacent to high-scoring pages in the graph are given a boost. The intuition is that if page 12 is highly relevant, page 8 (which page 12 references) is more likely to be relevant than a random page with the same embedding distance from the query.

2.4 Results on MMLongBench

MoLoRAG is evaluated on MMLongBench, a long-document multimodal benchmark. The key metric is Recall@1 and NDCG@1 — can the most relevant page be ranked first?

System Recall@1 NDCG@1
Embedding-only baseline
MoLoRAG+ 51.32% 66.86%

MoLoRAG+ (the full system with both semantic and logical scores + graph traversal) achieves Recall 51.32% and NDCG 66.86% at Top-1 on MMLongBench, outperforming systems that treat pages independently.

2.5 When Graph Structure Matters

The graph boost is most valuable for questions where:

The cost is the VLM logical relevance scoring step — running a forward pass per candidate page is expensive at scale. MoLoRAG applies this only to a shortlist of candidates identified by the semantic score, making the overall system tractable.

3. Synthesis: Where Do These Two Systems Fit?

M3DocRAG and MoLoRAG both extend RAG to multi-page multi-document settings, but they target different bottlenecks:

Aspect M3DocRAG MoLoRAG
Primary bottleneck addressed Multi-document aggregation + visual understanding Logical page connectivity for multi-hop retrieval
Corpus scope Multi-document (3,368 PDFs) Long single/multi-document
Retriever ColPali (vision-based) Embedding + graph traversal
Reader Qwen2-VL 7B (multimodal) VLM for logical scoring + generation

The two systems are architecturally complementary: M3DocRAG's retriever (ColPali) could be enhanced with MoLoRAG's graph-based re-ranking, and MoLoRAG's VLM scoring step could be extended to work across multiple documents. A combined system would address both the visual understanding and the logical connectivity problems simultaneously.

4. Takeaways