Understanding Hierarchical Document Parsing for RAG Systems

Most RAG pipelines treat a document as a bag of fixed-size text chunks. Chunk the PDF into 512-token windows, embed them, and retrieve the top-k. Simple — and brittle. The moment you face a 50-page technical manual, a legal contract with cross-referencing clauses, or an annual report mixing tables and prose, flat chunking falls apart. Evidence that spans sections gets split at an arbitrary boundary; structural cues like headings, captions, and footnotes disappear.

This post walks through why hierarchy matters for document parsing and retrieval, drawing on lessons from building HiKEY and the Document-level LVLM-Parser.

What flat chunking loses

A document is not a sequence of tokens — it is a tree. A section has subsections; a subsection has paragraphs; a paragraph may refer to a figure that lives in a different section. When you split at the token level, you destroy that structure. Three concrete problems follow:

Evidence fragmentation. An answer may require combining a table header (Section 2) with a data row (Section 4) with a footnote disclaimer (Section 4, footer). No single chunk contains this; retrieval fails.
Context loss. "It" in chunk #37 refers to a concept introduced in chunk #12. The LLM sees #37 in isolation and hallucenates the referent.
Retrieval noise. Boilerplate — repeated headers, page numbers, legal disclaimers — scores highly against certain queries because it is dense with common terms. It pollutes the top-k.

The hierarchical alternative

Instead of treating parsing as "split into chunks," treat it as "recover the document's logical tree." A hierarchy-aware parser produces something like:

Document
├── Section 1: Introduction
│   ├── Paragraph 1.1
│   └── Paragraph 1.2
├── Section 2: Methodology
│   ├── Subsection 2.1: Data
│   │   ├── Table 2.1 (caption + cells)
│   │   └── Paragraph 2.1.1
│   └── Subsection 2.2: Model
└── ...

Each node carries its content and its ancestry. Retrieval can then operate at multiple granularities: coarse routing to the right section, fine retrieval to the right paragraph.

HiKEY: coarse-to-fine retrieval over a document graph

HiKEY builds this structure as a heterogeneous offline graph. Nodes are document elements (sections, paragraphs, tables, figures). Edges encode parent-child hierarchy and cross-references. At query time, retrieval is two-stage:

Coarse routing. A lightweight encoder scores each section-level node. This is cheap and narrows the search space from the whole document to a handful of sections.
Fine retrieval. Within the shortlisted sections, a stronger encoder rescores paragraph- and table-level nodes. Candidate assembly then uses ancestry paths — not just the matched node, but its ancestors and siblings — to pass coherent context to the LLM.

The result on multi-page ODQA benchmarks: up to +4.5% over text-based RAG and +6.8% over full-page RAG in EM/ANLS. The gain comes almost entirely from queries that require multi-section evidence synthesis.

What makes parsing hard in practice

Recovering the logical tree is harder than it looks. Real PDFs are rendered as a stream of glyphs with x/y coordinates. There is no semantic tag saying "this is a section heading." You have to infer hierarchy from font size, indentation, whitespace, reading order, and visual layout — simultaneously.

Multi-column layouts, spanning table cells, embedded figures, and scanned pages each require a different heuristic. No single parser gets all of them right.

In the Document-level LVLM-Parser project, we sidestep the heuristic problem by using a vision-language model (Qwen3-VL-4B) to jointly understand page images and text. A 50M-parameter cross-attention decoder regresses bounding boxes from pooled LLM hidden states at special layout-query tokens. The layout grounding module is decoupled from content generation, which lets us train them with different objectives and freeze the LLM once it reaches acceptable NLU quality.

Takeaways

Flat chunking is a reasonable default for simple single-page documents. It breaks on anything complex.
Hierarchy recovery is a first-class parsing problem, not a post-processing step.
Retrieval and parsing are coupled — the granularity your parser produces should match the granularity your retriever is designed to score.
Ancestry-aware context assembly (giving the LLM the matched chunk plus its structural context) is cheap and meaningfully improves answer quality.