M.S. Candidate · Korea University NLP&AI Lab

Hi, I'm Gyuho Shim 👋

Multimodal/LLM AI Engineer specializing in document intelligence — building AI systems that are robust, practical, and reliable.

Master's student at Korea University NLP&AI Lab, advised by Prof. Heuiseok Lim.
Previously BS (Triple Major: CS, Math, Statistics) at University of Wisconsin–Madison.
Focused on Document AI, Multimodal RAG, LLM Training, and LLM Evaluation.
Seeking a 전문연구요원 (Technical Research Personnel, 신규편입) position as an AI Engineer / Research Scientist.

CV (PDF) Google Scholar GitHub LinkedIn

Publications

ACL 2025 · Oral

REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

Shim, G., Hong, S., & Lim, H.

OCR errors — from font degradation to complex multi-column layouts — fundamentally compromise downstream Document AI. REVISE addresses this by introducing a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data contamination strategy that injects realistic OCR-like noise at the character, word, and structural level. Trained on these synthetic datasets, the model learns to robustly reconstruct original document structure, significantly improving QA and retrieval accuracy without requiring costly real-error annotations.

arXiv GitHub

EMNLP 2025 · Oral

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks

Kim, D., Shim, G., Chun, Y. C., Kim, M., Park, C., & Lim, H.

Standard benchmark scores mask what abilities a model actually uses. Benchmark Profiling decomposes performance into 10 cognitively grounded abilities — from contextual recall to multi-step reasoning — using gradient-based importance scoring and targeted parameter ablation. The resulting Ability Impact Score (AIS) reveals that most benchmarks require a mixture of abilities, similarly-labeled datasets often rely on distinct ability profiles, and narrow domain fine-tuning yields only modest gains on code benchmarks. Analyzed across three instruction-tuned models and ten benchmarks.

arXiv GitHub

ACL 2026 · Main Oral

HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

Shin, J., Shim, G., Park, J., Seo, J., & Lim, H.

Flat text chunking loses the structural context that makes complex documents interpretable. HiKEY constructs an offline heterogeneous graph from DHP-parsed document hierarchies, then performs (1) hierarchical coarse-to-fine retrieval that rapidly narrows from global routing to local section-level candidates, and (2) an ancestry-aware subgraph assembly that captures cross-section dependencies. Across multi-page ODQA benchmarks, HiKEY outperforms text-based RAG by up to 4.5% and full-page RAG by up to 6.8%, with strong end-to-end EM/ANLS gains.

Project Page arXiv PDF

HCLT 2025 · Oral

KULLM-R: Efficient Korean Reasoning Model

Lee, S., Kim, D., Kim, M., Shim, G., Park, C., So, A., & Lim, H.

KULLM-R is a Korean reasoning model built on GRPO reinforcement learning in VERL. The work introduces verifiable reward functions for correctness and Korean answer consistency, along with an adaptive length penalty that reduces verbosity on easy problems while preserving reasoning depth on hard ones.

Model

HCLT 2024 · Oral

Mixture of Models: Towards Effective Domain Expert Ensemble of Large Language Models

Shim, G., Eo, S., Kim, J., Lee, J., & Lim, H.

Proposes a query-aware LLM routing framework that dynamically selects the best specialist model from a pool of 17 expert LLMs per query. Using confidence-based labeling, the routing classifier removes dependency on manual domain annotation and beats the domain-routing baseline on 12 out of 14 MMLU-Pro sub-domains, achieving +10pp over any single specialist on out-of-distribution benchmarks.

Technical Report · Jan 2026

VAETKI Technical Report

NC-AI Consortium (NC AI · ETRI · Korea University)

Technical report introducing the VAETKI series of large language models at 100B, 20B, and 7B scales, developed for national data sovereignty and industrial AI transformation. The models use a decoder-only Mixture-of-Experts (MoE) architecture with Multi-Latent Attention (MLA) and Local-Global Interleaving, achieving an 83% reduction in KV cache size and a 40% gain in training efficiency over conventional MLA. Pre-trained on a curated 5-trillion-token corpus and refined through a seven-stage pipeline including Reasoning-centric SFT, High-Quality SFT, and DPO on 8M instruction-tuning triplets. Evaluation on Korean benchmarks (CLIcK, KoBALT) and IFEval demonstrates strong multilingual and instruction-following capabilities.

PDF

Projects

WBL Independent AI Foundation Model (국가대표 AI, VAETKI)

NC-AI-consortium-VAETKI/VAETKI · Collaboration with NC AI & ETRI

Designed the Query Filtering Criterion adopted in the VAETKI Technical Report: a 6-dimensional clarity rubric (1–10, deterministic tie-breaking) filtering the 2M high-quality SFT subset, and trained the backing Qwen3-4B clarity tagger with per-domain thresholds.
Contributed to Correctness-Guaranteed Reasoning Synthesis for Math/Code, verifying synthesized reasoning chains against known answers.
Built a 200K-pair DPO set pairing verified syntheses against failure modes (wrong answers, repetition, verbosity) and long-context data.

Ko-VDR: Korean Vision-Language Document Retriever

johnandru/ko-vdr-preview · NLP&AI Lab, Korea University

Fine-tuned Qwen3-VL-Embedding-2B into a Korean retriever via LoRA (1.5% params, frozen vision encoder) on 325K pairs with FAISS hard-negative mining.
Designed a 3-layer nested loss (Matryoshka × SelfGuide × Cached InfoNCE) supporting 7 dims (128–2048) for retraining-free accuracy/size trade-offs.

Blog

Document-level LVLM-Parser

NLP&AI Lab, Korea University

Built a multimodal document parser on Qwen3-VL-4B with a 50M-param cross-attention decoder regressing bounding boxes from pooled LLM hidden states at layout-query tokens, decoupling content generation from layout grounding.
Designed a 3-stage curriculum (LM-only → joint → frozen-LLM grounding); gradient-masking hook trains only the 4 added layout-query embedding rows when the LLM is frozen.
Extending to document-level hierarchy modeling; work in progress toward conference submission.

KULLM Reasoning Model Training

nobrand/KULLM-R · NLP&AI Lab, Korea University

Implemented GRPO reinforcement learning in VERL for Korean reasoning tasks.
Designed verifiable reward functions for correctness, Korean answer consistency, and an adaptive length penalty that reduces verbosity on easy problems while preserving depth on hard ones.

DocGraph Copilot: Document AI Service

NLP&AI Lab, Korea University

Built end-to-end Document AI system (FastAPI + React) parsing PDF/DOCX/XLSX/PPTX/TXT via a unified pipeline orchestrating 6 layout backends (DETR, VGT, DHP, MinerU2.5, DocLayout-YOLO, heuristic) with per-model tuning and auto-generated registry.
Delivered evidence-first features: structure-aware search, evidence-packaged QA, hierarchical summarization, and multi-doc comparison.
Engineered a hierarchical document parser that infers logical document structure and organization, powering downstream chunking and analysis.

Synapse: AI Knowledge Graph Search Platform

NLP&AI Lab, Korea University

Architected enterprise GraphRAG platform unifying Slack/Jira/Drive into one knowledge graph with hybrid BM25 + Vector (HNSW) + Graph traversal.
Designed 5-stage confidence-gated extraction (≥0.7 threshold) prioritizing structured metadata over LLM calls to cut hallucinations and embedding cost.

KT–Korea University Collaborative Research

NLP&AI Lab, Korea University

Co-led 18-month KT Corporation collaboration on multi-LLM systems, authoring 2 papers (HCLT 2024 + ACL under review).
Led LLM-routing framework that selects the best of 17 expert models per query, beating any single specialist by +10pp on out-of-distribution benchmarks (MMLU-Pro, AGIEval).
Designed confidence-based labeling for the routing classifier, removing domain-annotation dependency; beat domain-routing baseline on 12/14 MMLU-Pro sub-domains.

Experience

NLP & AI Researcher

NLP&AI Lab, Korea University

Seoul, South Korea

May 2024 – Aug 2026

Research: Co-authored 10+ papers at top-tier venues (ACL, EMNLP), spanning document understanding, multimodal retrieval, LLM orchestration, reasoning, and mechanistic interpretability.
LLM/LVLM Training: Led training across LLMs, LVLMs, and vision-language retrievers; contributed to academic models and consortium-scale foundation models.
Document AI: Built multimodal document parsers and vision-language retrievers covering layout grounding, cross-page hierarchy, and visual document retrieval.
Multimodal RAG: Architected and implemented multimodal RAG frameworks end-to-end, covering multi-document reasoning and hierarchical knowledge integration.
Evaluation: Designed a multi-document QA benchmark and established reusable evaluation recipes and diagnostic workflows for model iterations.
Data: Built multi-million-scale data curation pipelines spanning clarity-based query filtering, correctness-verified reasoning synthesis, document image-text curation, and FAISS hard-negative mining.

Co-Founder & Co-Lead

KUDoc, Document AI Research Group, Korea University

Seoul, South Korea

May 2024 – Aug 2026

Co-founded and led KUDoc to 12 top-tier publications (ACL, CVPR, EMNLP), 4 industry–academia projects, and 3 technology transfers.

NLP & AI Research Assistant

Karumbaiah Lab, University of Wisconsin–Madison

Madison, WI

Sep 2023 – May 2024

Researched LLM-based automated essay feedback in Shamya Karumbaiah's Human-AI Lab, identifying model limitations and improvement areas.
Performed error analysis with Errudite and behavioral tests on linguistic attributes (negation, essay length, adjective count, location entities) to identify LLM failure patterns.

Data Scientist

Aivelabs

Seoul, South Korea

Aug 2021 – Jan 2022

Automated client reporting in Python/SQL/Excel, cutting generation time by 30%.
Collaborated with Amore Pacific marketing on SQL and Google Analytics data requests in weekly syncs.
Analyzed ad performance and app customer data (churn, ad placement value, conversions); applied NLTK sentiment analysis on reviews to inform promotional messaging.

Academic Coordinator

KSEA, Korean American Scientists and Engineers Association

Madison, WI

Jul 2023 – Jan 2024

Structured the semester's academic programming as a key executive, aligning projects with club objectives.
Led a 6-person data science team on a 'Zero Hunger' research project covering route optimization, supply chain, and food security analysis.

Software Coordinator

KCU, Korean Undergraduate Computer Science Union

Madison, WI

Sep 2022 – Jul 2023

Directed organizational planning through meeting agendas, goal-setting, and 1:1s with group leads.

Blog

Apr 2026 2026년 4월

Training a Korean Visual Document Retriever Model
A full walkthrough of Ko-VDR — the dual-encoder architecture, 325k-pair data pipeline, GradCache, Matryoshka loss, self-guide filtering, and 4-GPU DDP training.

Ko-VDR 전체 파이프라인 — 듀얼 인코더 구조, 32만5천 쌍 데이터 파이프라인, GradCache, Matryoshka 손실, Self-Guide 필터링, 4-GPU DDP 학습까지 상세 설명.
Apr 2026 2026년 4월

Understanding Hierarchical Document Parsing for RAG Systems
Why flat chunking breaks complex documents — and how hierarchy-aware parsing unlocks better retrieval and reasoning.

플랫 청킹이 복잡한 문서를 제대로 처리하지 못하는 이유와, 계층 인식 파싱이 검색·추론 성능을 높이는 방법.
Mar 2026 2026년 3월

Why OCR Errors Matter: Building Robust Document AI Pipelines
A look at how OCR noise propagates through Document AI pipelines and what systematic error correction can do about it.

OCR 노이즈가 Document AI 파이프라인에 어떻게 전파되는지, 그리고 체계적인 오류 수정이 이를 어떻게 해결하는지 살펴봅니다.
Dec 2025 2025년 12월

Multimodal RAG in Long-Context DocVQA
How M3DocRAG and MoLoRAG extend retrieval-augmented generation to multi-page, multi-document settings — visual retrieval with ColPali, page graph traversal, and VLM-based logical scoring.

M3DocRAG와 MoLoRAG가 멀티 페이지·멀티 문서 환경으로 RAG를 확장하는 방법 — ColPali 시각 검색, 페이지 그래프 탐색, VLM 기반 관련성 점수.
Oct 2025 2025년 10월

Document AI with LLM
A tour of two paradigms for layout-aware document understanding — DocLLM's Disentangled Spatial Attention and LayTextLLM's Spatial Layout Projector with Partial LoRA and Shuffled-OCR SFT.

레이아웃 인식 문서 이해를 위한 두 패러다임 비교 — DocLLM의 Disentangled Spatial Attention과 LayTextLLM의 Spatial Layout Projector, Partial LoRA, Shuffled-OCR SFT.
Sep 2025 2025년 9월

Retrieval in Vision Space
ColPali's Late Interaction over visual patches achieves +15 NDCG@5 over OCR baselines at 18× faster indexing — and why VLM-based retrieval still loses to LLM OCR on degraded scans.

ColPali의 시각 패치 Late Interaction이 OCR 기준 대비 NDCG@5 +15 달성, 18배 빠른 인덱싱 — 그리고 VLM 기반 검색이 품질 저하 스캔에서 LLM OCR에 여전히 뒤지는 이유.
Jan 2025 2025년 1월

Unveiling Document AI #1: Paper Review of DocLLM
A deep dive into DocLLM (ACL 2024) — how its Disentangled Spatial Attention fuses text and layout, and why it achieves best performance on 12 out of 16 document AI benchmarks.

DocLLM(ACL 2024) 심층 리뷰 — Disentangled Spatial Attention이 텍스트와 레이아웃을 결합하는 방식과 16개 Document AI 벤치마크 중 12개에서 최고 성능을 달성하는 이유.

Hi, I'm Gyuho Shim 👋

안녕하세요, 심규호입니다 👋

Education

학력

Technical Skills

기술 스택

Publications

논문

Projects

프로젝트

Experience

경력

Blog

블로그

Contact

연락처