Back to Blog

Understanding Hierarchical Document Parsing for RAG Systems

Most RAG pipelines treat a document as a bag of fixed-size text chunks. Chunk the PDF into 512-token windows, embed them, and retrieve the top-k. Simple — and brittle. The moment you face a 50-page technical manual, a legal contract with cross-referencing clauses, or an annual report mixing tables and prose, flat chunking falls apart. Evidence that spans sections gets split at an arbitrary boundary; structural cues like headings, captions, and footnotes disappear.

This post walks through why hierarchy matters for document parsing and retrieval, drawing on lessons from building HiKEY and the Document-level LVLM-Parser.

What flat chunking loses

A document is not a sequence of tokens — it is a tree. A section has subsections; a subsection has paragraphs; a paragraph may refer to a figure that lives in a different section. When you split at the token level, you destroy that structure. Three concrete problems follow:

The hierarchical alternative

Instead of treating parsing as "split into chunks," treat it as "recover the document's logical tree." A hierarchy-aware parser produces something like:

Document
├── Section 1: Introduction
│   ├── Paragraph 1.1
│   └── Paragraph 1.2
├── Section 2: Methodology
│   ├── Subsection 2.1: Data
│   │   ├── Table 2.1 (caption + cells)
│   │   └── Paragraph 2.1.1
│   └── Subsection 2.2: Model
└── ...

Each node carries its content and its ancestry. Retrieval can then operate at multiple granularities: coarse routing to the right section, fine retrieval to the right paragraph.

HiKEY: coarse-to-fine retrieval over a document graph

HiKEY builds this structure as a heterogeneous offline graph. Nodes are document elements (sections, paragraphs, tables, figures). Edges encode parent-child hierarchy and cross-references. At query time, retrieval is two-stage:

  1. Coarse routing. A lightweight encoder scores each section-level node. This is cheap and narrows the search space from the whole document to a handful of sections.
  2. Fine retrieval. Within the shortlisted sections, a stronger encoder rescores paragraph- and table-level nodes. Candidate assembly then uses ancestry paths — not just the matched node, but its ancestors and siblings — to pass coherent context to the LLM.

The result on multi-page ODQA benchmarks: up to +4.5% over text-based RAG and +6.8% over full-page RAG in EM/ANLS. The gain comes almost entirely from queries that require multi-section evidence synthesis.

What makes parsing hard in practice

Recovering the logical tree is harder than it looks. Real PDFs are rendered as a stream of glyphs with x/y coordinates. There is no semantic tag saying "this is a section heading." You have to infer hierarchy from font size, indentation, whitespace, reading order, and visual layout — simultaneously.

Multi-column layouts, spanning table cells, embedded figures, and scanned pages each require a different heuristic. No single parser gets all of them right.

In the Document-level LVLM-Parser project, we sidestep the heuristic problem by using a vision-language model (Qwen3-VL-4B) to jointly understand page images and text. A 50M-parameter cross-attention decoder regresses bounding boxes from pooled LLM hidden states at special layout-query tokens. The layout grounding module is decoupled from content generation, which lets us train them with different objectives and freeze the LLM once it reaches acceptable NLU quality.

Takeaways