Most RAG pipelines treat a document as a bag of fixed-size text chunks. Chunk the PDF into 512-token windows, embed them, and retrieve the top-k. Simple — and brittle. The moment you face a 50-page technical manual, a legal contract with cross-referencing clauses, or an annual report mixing tables and prose, flat chunking falls apart. Evidence that spans sections gets split at an arbitrary boundary; structural cues like headings, captions, and footnotes disappear.
This post walks through why hierarchy matters for document parsing and retrieval, drawing on lessons from building HiKEY and the Document-level LVLM-Parser.
What flat chunking loses
A document is not a sequence of tokens — it is a tree. A section has subsections; a subsection has paragraphs; a paragraph may refer to a figure that lives in a different section. When you split at the token level, you destroy that structure. Three concrete problems follow:
- Evidence fragmentation. An answer may require combining a table header (Section 2) with a data row (Section 4) with a footnote disclaimer (Section 4, footer). No single chunk contains this; retrieval fails.
- Context loss. "It" in chunk #37 refers to a concept introduced in chunk #12. The LLM sees #37 in isolation and hallucenates the referent.
- Retrieval noise. Boilerplate — repeated headers, page numbers, legal disclaimers — scores highly against certain queries because it is dense with common terms. It pollutes the top-k.
The hierarchical alternative
Instead of treating parsing as "split into chunks," treat it as "recover the document's logical tree." A hierarchy-aware parser produces something like:
Document
├── Section 1: Introduction
│ ├── Paragraph 1.1
│ └── Paragraph 1.2
├── Section 2: Methodology
│ ├── Subsection 2.1: Data
│ │ ├── Table 2.1 (caption + cells)
│ │ └── Paragraph 2.1.1
│ └── Subsection 2.2: Model
└── ...
Each node carries its content and its ancestry. Retrieval can then operate at multiple granularities: coarse routing to the right section, fine retrieval to the right paragraph.
HiKEY: coarse-to-fine retrieval over a document graph
HiKEY builds this structure as a heterogeneous offline graph. Nodes are document elements (sections, paragraphs, tables, figures). Edges encode parent-child hierarchy and cross-references. At query time, retrieval is two-stage:
- Coarse routing. A lightweight encoder scores each section-level node. This is cheap and narrows the search space from the whole document to a handful of sections.
- Fine retrieval. Within the shortlisted sections, a stronger encoder rescores paragraph- and table-level nodes. Candidate assembly then uses ancestry paths — not just the matched node, but its ancestors and siblings — to pass coherent context to the LLM.
The result on multi-page ODQA benchmarks: up to +4.5% over text-based RAG and +6.8% over full-page RAG in EM/ANLS. The gain comes almost entirely from queries that require multi-section evidence synthesis.
What makes parsing hard in practice
Recovering the logical tree is harder than it looks. Real PDFs are rendered as a stream of glyphs with x/y coordinates. There is no semantic tag saying "this is a section heading." You have to infer hierarchy from font size, indentation, whitespace, reading order, and visual layout — simultaneously.
Multi-column layouts, spanning table cells, embedded figures, and scanned pages each require a different heuristic. No single parser gets all of them right.
In the Document-level LVLM-Parser project, we sidestep the heuristic problem by using a vision-language model (Qwen3-VL-4B) to jointly understand page images and text. A 50M-parameter cross-attention decoder regresses bounding boxes from pooled LLM hidden states at special layout-query tokens. The layout grounding module is decoupled from content generation, which lets us train them with different objectives and freeze the LLM once it reaches acceptable NLU quality.
Takeaways
- Flat chunking is a reasonable default for simple single-page documents. It breaks on anything complex.
- Hierarchy recovery is a first-class parsing problem, not a post-processing step.
- Retrieval and parsing are coupled — the granularity your parser produces should match the granularity your retriever is designed to score.
- Ancestry-aware context assembly (giving the LLM the matched chunk plus its structural context) is cheap and meaningfully improves answer quality.