Gyuho Shim

Hi, I'm Gyuho Shim ๐Ÿ‘‹

์•ˆ๋…•ํ•˜์„ธ์š”, ์‹ฌ๊ทœํ˜ธ์ž…๋‹ˆ๋‹ค ๐Ÿ‘‹

I build LLM systems โ€” from evaluation pipelines to reasoning models.

Master's student at Korea University NLP&AI Lab, advised by Prof. Heuiseok Lim.
Previously BS (Triple Major: CS, Math, Statistics) at University of Wisconsinโ€“Madison.
Interested in Document AI, Multimodal RAG, LLM Evaluation, and Mechanistic Interpretability.

Education

M.S. in Computer Science
Korea University
Advisor: Prof. Heuiseok Lim ยท GPA: 4.33 / 4.5
Document AI ยท Multimodal RAG ยท LLM Agents
Expected Aug 2026 Seoul, South Korea
B.S. โ€” Triple Major: CS, Mathematics, Statistics
University of Wisconsinโ€“Madison
GPA: 3.55 / 4.0
May 2024 Madison, WI

Technical Skills

Core MLPyTorch ยท JAX ยท Hugging Face ยท Polars
LLM TrainingDeepSpeed ยท FSDP ยท Megatron-LM ยท PEFT (LoRA, QLoRA)
RL & EvaluationGRPO ยท GSPO ยท RLHF (TRL) ยท LM-Eval-Harness ยท EvalChemy
InferencevLLM ยท SGLang ยท TensorRT-LLM ยท KV Cache ยท Quantization
MLOpsCUDA ยท Triton ยท Docker ยท Kubernetes ยท AWS ยท W&B

Publications

REVISE figure
ACL 2025 ยท Oral
REVISE: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy
Shim, G., Hong, S., & Lim, H.
OCR errors โ€” from font degradation to complex multi-column layouts โ€” fundamentally compromise downstream Document AI. REVISE addresses this by introducing a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data contamination strategy that injects realistic OCR-like noise at the character, word, and structural level. Trained on these synthetic datasets, the model learns to robustly reconstruct original document structure, significantly improving QA and retrieval accuracy without requiring costly real-error annotations.
Benchmark Profiling figure
EMNLP 2025 ยท Oral
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
Kim, D., Shim, G., Chun, Y. C., Kim, M., Park, C., & Lim, H.
Standard benchmark scores mask what abilities a model actually uses. Benchmark Profiling decomposes performance into 10 cognitively grounded abilities โ€” from contextual recall to multi-step reasoning โ€” using gradient-based importance scoring and targeted parameter ablation. The resulting Ability Impact Score (AIS) reveals that most benchmarks require a mixture of abilities, similarly-labeled datasets often rely on distinct ability profiles, and narrow domain fine-tuning yields only modest gains on code benchmarks. Analyzed across three instruction-tuned models and ten benchmarks.
HiKEY figure
ACL 2026 ยท Main Oral
HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
Shin, J., Shim, G., Park, J., Seo, J., & Lim, H.
Flat text chunking loses the structural context that makes complex documents interpretable. HiKEY constructs an offline heterogeneous graph from DHP-parsed document hierarchies, then performs (1) hierarchical coarse-to-fine retrieval that rapidly narrows from global routing to local section-level candidates, and (2) an ancestry-aware subgraph assembly that captures cross-section dependencies. Across multi-page ODQA benchmarks, HiKEY outperforms text-based RAG by up to 4.5% and full-page RAG by up to 6.8%, with strong end-to-end EM/ANLS gains.
Technical Report ยท Jan 2026
VAETKI Technical Report
NC-AI Consortium (NC AI ยท ETRI ยท Korea University)
Technical report introducing the VAETKI series of large language models at 100B, 20B, and 7B scales, developed for national data sovereignty and industrial AI transformation. The models use a decoder-only Mixture-of-Experts (MoE) architecture with Multi-Latent Attention (MLA) and Local-Global Interleaving, achieving an 83% reduction in KV cache size and a 40% gain in training efficiency over conventional MLA. Pre-trained on a curated 5-trillion-token corpus and refined through a seven-stage pipeline including Reasoning-centric SFT, High-Quality SFT, and DPO on 8M instruction-tuning triplets. Evaluation on Korean benchmarks (CLIcK, KoBALT) and IFEval demonstrates strong multilingual and instruction-following capabilities.

Projects

WBL Independent AI Foundation Model (VAETKI)
Collaboration with NC AI & ETRI
  • Owned engineering of large-scale evaluation pipeline standardizing 50+ benchmarks into a single reproducible runtime with automated regression tracking.
  • Implemented end-to-end checkpoint-to-evaluation automation and integrated vLLM to accelerate inference throughput.
  • Built W&B Weave trace analysis dashboards to debug reasoning failures and support slice-level comparisons.
  • Implemented data-mixture contribution analysis to quantify which dataset combinations drove metric gains.
  • Introduced contamination defenses including deduplication and overlap scans for fair, comparable evaluations.
KULLM Reasoning Model Training
nobrand/KULLM-R ยท NLP&AI Lab, Korea University
  • Implemented GRPO reinforcement learning in VERL with multi-rollout group scoring for Korean reasoning tasks.
  • Designed verifiable custom reward functions with adaptive length penalty โ€” reducing verbosity on easy problems while preserving depth on hard ones.
  • Tuned reward weights and RL hyperparameters (KL coefficient, rollout settings, max response length) for accuracy-compute-quality balance.
  • Ran iterative evaluations on Korean math and reasoning benchmarks (Pass@1 and length diagnostics).
Document AI Service
NLP&AI Lab, Korea University
  • Built end-to-end Document AI MVP: FastAPI + React app parsing enterprise documents (PDF, DOCX, XLSX, PPTX) supporting 6+ parsing backends.
  • Integrated multi-model parser orchestration for 6 document layout models (DETR, VGT, DHP, MinerU2.5, DocLayout-YOLO) with NMS optimization.
  • Implemented evidence-first AI features: structure-aware search, QA with evidence packaging, hierarchical summarization, and multi-document comparison.
  • Engineered document hierarchy parsing with token-cache-based M3DocDep model and automated page rendering.

Experience

NLP & AI Researcher
NLP&AI Lab, Korea University
Seoul, South Korea
Jan 2024 โ€“ Feb 2026
  • Co-authored 10+ papers at top-tier ACL venues (ACL, EMNLP, NAACL).
  • Drove benchmark design, evaluation recipes, regression policies, and dataset versioning across lab and consortium model iterations.
  • Supported Korean LLM post-training: GRPO in VERL for KULLM Reasoning, instruction tuning for KULLM3 and Ko Gemma.
  • Developed SAE-based feature discovery, steering, and causal intervention workflows for mechanistic interpretability.
  • Operated shared GPU infrastructure (40+ A100/H100) for 30+ researchers โ€” Kubernetes, Docker, experiment tracking.
  • Built safety audit harnesses with jailbreak/refusal checks, PII screening, and toxicity regression monitoring.
NLP & AI Research Assistant
Karumbaiah Lab, UWโ€“Madison
Madison, WI
Sep 2023 โ€“ May 2024
  • One of 7 RAs in Shamya Karumbaiah's Humanโ€“AI Research Lab, focusing on NLP model enhancement.
  • Error analysis using Errudite to assess LLMs providing automated feedback on student essays.
  • Behavioral tests analyzing linguistic attributes (negation, length, adjectives, entities) in LLMs.
Data Scientist
Aivelabs
Seoul, South Korea
Aug 2021 โ€“ Jan 2022
  • Created report automation using Python, SQL, and Excel, improving report generation speed by 30%.
  • Collaborated with Amore Pacific marketing teams to fulfill data requests via SQL and Google Analytics.
  • Applied sentiment analysis with NLTK on customer reviews to inform promotional message creation.

Contact

Feel free to reach out for research collaborations, opportunities, or just to say hi.