Position: Improved Documentation is Necessary for Benchmarking AI Systems in Geometry
Anna Genevaux ⋅ Simon Frieder
Abstract
This position paper argues that documentation is infrastructure for reproducible geometry reasoning: a benchmark for formal geometry problems to test AI systems is not usable in research unless its documented vocabulary is matched by executable, versioned behavior and minimal runnable examples. We use JGEX (as implemented by Newclid) as a case study of how documentation--implementation gaps and missing examples can silently constrain expressivity, fragment tool interoperability, and bias benchmark construction. To make our point, we introduce "A JGEX Dataset", a curated collection of $78$ Euclidean geometry problems with (i) original natural-language statements and sources, (ii) a JGEX-oriented rewrite that makes formalization steps explicit, (iii) executable JGEX code validated under a pinned solver version, and (iv) rich metadata. To make the target language auditable, we also provide a predicate-level support matrix for the $33$ documented predicates, generated from minimal test instances, and categorize predicates as supported, unsupported, or unstable due to missing accessible examples. Finally, we release validation scripts and a concise tutorial with worked walk-throughs. Our broader claim is that benchmark authors, tool maintainers, and reviewers should treat language documentation and conformance evidence as first-class artifacts—on par with datasets and evaluation code—if cross-tool, cross-version reproducibility is the goal.
Successful Page Load