Standard RAG assumes the relevant context fits in a single retrieved chunk. For questions that span multiple pages, multiple documents, or require reasoning across figures and tables in different locations, single-chunk retrieval is fundamentally insufficient. You need a system that can retrieve the right pages from a large corpus, aggregate evidence across them, and generate a coherent answer from multimodal inputs.
This post covers two papers from my winter 2025 seminar that tackle this problem from different angles: M3DocRAG addresses the multi-document multi-page retrieval pipeline end-to-end, and MoLoRAG introduces a graph-based approach to logical page ordering for multi-hop retrieval.
1. M3DocRAG: Multi-Modal, Multi-Page, Multi-Document RAG (ICCV Workshop 2025)
1.1 The Problem Space
Consider a question like: "How does the investment strategy described in Company A's 2022 annual report compare to the risk factors disclosed in Company B's 2023 10-K?" This requires:
- Identifying relevant pages from two different documents
- Reading tables, charts, and prose across multiple pages
- Synthesizing information in a single coherent answer
Text-only RAG collapses visual structure. Single-document systems can't cross document boundaries. Context-window stuffing fails at scale. M3DocRAG is designed to handle all three constraints simultaneously.
1.2 M3DocVQA: The Benchmark
The paper introduces M3DocVQA, a large-scale benchmark for this task:
| Statistic | Value |
|---|---|
| Multi-hop questions | 2,441 |
| Total PDFs in corpus | 3,368 |
| Total pages | 41,005 |
| Documents per question (avg) | 2–3 |
Questions require aggregating evidence from pages scattered across multiple documents — no single-document system can answer them correctly. All pages are treated as images throughout the pipeline.
1.3 Pipeline Architecture
M3DocRAG's pipeline has three stages:
1. Page Retrieval
Query → ColPali → Top-K page images (from 41,005 pages across 3,368 PDFs)
↓
2. Context Selection
VLM scores each retrieved page for relevance
Select top-N pages as multimodal context
↓
3. Answer Generation
Qwen2-VL 7B reads [query + N page images] → generates answer
The key design decision: treat every page as an image throughout, with no OCR at any stage. ColPali's visual embeddings retrieve candidate pages; Qwen2-VL reads the page images directly for both scoring and answer generation.
1.4 Results
The primary comparison is against a text-only RAG baseline using the same corpus:
| System | Retriever | Reader | EM | F1 |
|---|---|---|---|---|
| Text RAG baseline | ColBERT | Llama 3.1 8B | 17.8 | 23.7 |
| M3DocRAG | ColPali | Qwen2-VL 7B (4 pages) | 31.4 | 36.5 |
M3DocRAG improves EM by +13.6 and F1 by +12.8 over the text RAG baseline. The gain comes entirely from visual understanding: questions that require reading a chart, table, or diagram in the source document cannot be answered by a text system that only has OCR output. Qwen2-VL reading the actual page image recovers this information.
1.5 Tradeoffs
The pipeline has two costs. First, Qwen2-VL reading 4 page images per query is significantly slower than a text LLM reading 4 text chunks — each image input generates hundreds of visual tokens. Second, the system works best when the correct pages are within the top-K retrieved by ColPali. If the retriever misses a key page, no amount of sophistication in the reader recovers it.
2. MoLoRAG: Modality-Logical Retrieval Augmented Generation (EMNLP 2025)
2.1 The Multi-Hop Retrieval Problem
In many long-document questions, relevant information is not co-located — it spans multiple pages that are connected by logical references: "see Figure 3" on page 12 pointing to a figure on page 8; a table summary on page 20 that expands data first presented on page 5. Standard embedding-based retrieval treats pages independently. A query about the relationship between two pieces of information may score neither page highly on its own, even though both are necessary together.
MoLoRAG addresses this by building a page graph over the document corpus and using it to propagate relevance through logical connections.
2.2 Page Graph Construction
MoLoRAG represents a document as a graph G = (V, E) where:
V= one node per pageE= undirected edges between pages whose embeddings exceed a similarity threshold
For pages pᵢ, pⱼ:
edge(pᵢ, pⱼ) = 1 if cosine_sim(embed(pᵢ), embed(pⱼ)) > τ
= 0 otherwise
Edges capture semantic proximity — pages that talk about the same topic or entity tend to have high embedding similarity, even if they appear far apart in the document. The graph is pre-computed at index time.
2.3 Scoring: Semantic + Logical Relevance
Given a query, MoLoRAG scores each page with a composite score combining two signals:
score(page) = α · semantic_score(query, page)
+ β · logical_relevance_score(query, page)
Semantic score is the standard embedding similarity between the query and the page.
Logical relevance score is computed by a VLM: the model reads the page image and the query and produces a relevance judgment. This is a binary or graded score based on the model's understanding of whether the page is logically connected to the question — even if not directly answering it.
After initial scoring, MoLoRAG performs graph traversal: pages adjacent to high-scoring pages in the graph are given a boost. The intuition is that if page 12 is highly relevant, page 8 (which page 12 references) is more likely to be relevant than a random page with the same embedding distance from the query.
2.4 Results on MMLongBench
MoLoRAG is evaluated on MMLongBench, a long-document multimodal benchmark. The key metric is Recall@1 and NDCG@1 — can the most relevant page be ranked first?
| System | Recall@1 | NDCG@1 |
|---|---|---|
| Embedding-only baseline | — | — |
| MoLoRAG+ | 51.32% | 66.86% |
MoLoRAG+ (the full system with both semantic and logical scores + graph traversal) achieves Recall 51.32% and NDCG 66.86% at Top-1 on MMLongBench, outperforming systems that treat pages independently.
2.5 When Graph Structure Matters
The graph boost is most valuable for questions where:
- The answer page contains no direct query keywords but is referenced from a page that does
- Evidence is split across a section header page and a data page — high semantic similarity to each other, but one is more directly query-relevant
- Multi-hop questions require first finding an intermediate fact, then finding a second page that uses that fact
The cost is the VLM logical relevance scoring step — running a forward pass per candidate page is expensive at scale. MoLoRAG applies this only to a shortlist of candidates identified by the semantic score, making the overall system tractable.
3. Synthesis: Where Do These Two Systems Fit?
M3DocRAG and MoLoRAG both extend RAG to multi-page multi-document settings, but they target different bottlenecks:
| Aspect | M3DocRAG | MoLoRAG |
|---|---|---|
| Primary bottleneck addressed | Multi-document aggregation + visual understanding | Logical page connectivity for multi-hop retrieval |
| Corpus scope | Multi-document (3,368 PDFs) | Long single/multi-document |
| Retriever | ColPali (vision-based) | Embedding + graph traversal |
| Reader | Qwen2-VL 7B (multimodal) | VLM for logical scoring + generation |
The two systems are architecturally complementary: M3DocRAG's retriever (ColPali) could be enhanced with MoLoRAG's graph-based re-ranking, and MoLoRAG's VLM scoring step could be extended to work across multiple documents. A combined system would address both the visual understanding and the logical connectivity problems simultaneously.
4. Takeaways
- Multi-document multi-page RAG is a distinct problem from single-document QA. M3DocVQA's scale (41,005 pages, 3,368 PDFs) is representative of real enterprise corpora — existing benchmarks underrepresent this setting.
- Visual retrieval outperforms text retrieval for document-rich corpora. ColPali + Qwen2-VL achieves +13.6 EM over ColBERT + Llama on M3DocVQA, primarily because tables and figures cannot survive OCR with their structure intact.
- Page graphs capture structure that embeddings miss. MoLoRAG's logical connectivity through graph traversal addresses the multi-hop retrieval problem that pure embedding similarity cannot resolve.
- VLM-in-the-loop retrieval is feasible but expensive. Using a VLM for page relevance scoring is powerful but requires careful shortlisting to remain tractable at corpus scale.