Constructing Korean Benchmark Suite for Reliable Evaluation of Foundation Models
Abstract
Reliable evaluation of foundation models in Korean requires benchmarks that measure intended capabilities rather than artifacts introduced by translation, localization, or evaluation protocol. In practice, Korean evaluation often adapts established English benchmarks, but literal translation can alter task difficulty, reduce prompt naturalness, or change what the task is intended to evaluate. We present a Korean Benchmark Suite comprising Ko-ARC, Ko-GSM8K, Ko-EQ-Bench, Ko-WinoGrande, Ko-LAMBADA, and Ko-IFEval, covering six capabilities across 9,396 items. Rather than treating translation as a single preprocessing step, we construct each subset using one of three routes: expert-reviewed translation and localization, direct Korean construction, or a hybrid of localized adaptation and Korean-specific redesign. For multiple-choice subsets, we also report NPSQ-based accuracy to assess whether models rely on question evidence rather than superficial choice preference. Evaluation results show that model strengths differ across tasks, and that larger models are not always the best-performing models. We further find that different scoring methods can lead to different interpretations depending on the task, highlighting the need to report benchmark scores together with their evaluation protocol.