Poster
in
Workshop: DataWorld: Unifying data curation frameworks across domains
EvalX: A Platform for Code LLM Evaluation in the Wild
Wayne Chi · Valerie Chen · Anastasios Angelopoulos · Wei-Lin Chiang · Aditya Mittal · Naman Jain · Tianjun Zhang · Ion Stoica · Chris Donahue · Ameet Talwalkar
Keywords: [ evaluation ] [ llm ] [ code ]
Evaluating in-the-wild coding capabilities of large language models (LLMs) is a challenging endeavor with no existing solution. We introduce EvalX, a platform to collect user preferences through native integration into a developer's working environment. EvalX comprises a novel interface for comparing pairs of model outputs, a sampling strategy to reduce experienced latency, and a prompting scheme to enable code completion functionality. EvalX has served over 4.5 million suggestions from 10 models and collected over 11k pairwise judgements. Our results highlight the importance of model evaluations in integrated settings. We find that model rankings from EvalX differ from those of existing evaluations, which we attribute to the unique distribution of data and tasks contained in EvalX. We also identify novel insights into human preferences on code such as an observed consistency in user preference across programming languages yet significant variation in preference due to task category. We will open-source EvalX and release data to enable human-centric evaluations and improve understanding of coding assistants.