Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Zihao Zhou · Shudong Liu · Maizhen Ning · Wei Liu · Derek Wong · Jindong Wang · Qiufeng Wang · Kaizhu Huang


Abstract:

Exceptional mathematical reasoning ability is one of the key features that demonstrate the powerful capabilities of large language models (LLMs). How to comprehensively define and evaluate the true mathematical abilities of large language models, and even reflect the user experience in real-world usage, has become a very important matter. Most existing benchmarks focus solely on problem solving, posing a significant risk of overfitting and struggling to reflect the real mathematical reasoning abilities. In this paper, we argue that if a model really understands a problem, it can robustly apply it to various tasks. Motivated by this, we introduce MathCheck, a well-designed checklist for testing task generalization and reasoning robustness, as well as an automatic tool to generate checklists quickly. MathCheck includes multiple mathematical reasoning tasks and robustness test types to facilitate a comprehensive evaluation of mathematical reasoning ability and reasoning behavior testing. Utilizing MathCheck, we develop MathCheck-GSM and MathCheck-GEO to assess mathematical textual reasoning and multimodal reasoning capabilities, respectively, serving as upgraded versions of benchmarks including GSM8K, GeoQA, UniGeo, and Geometry3K. Our extensive experiments on 19 LLMs and 11 MLLMs show that GPT-4o achieves the best performance in both MathCheck-GSM and MathCheck-GEO. Further experiments reveal that (1) Compared with the paradigm of mainstream benchmark, MathCheck is closer to the indeed mathematical reasoning ability. (2) Observing the reasoning consistency, most of the models achieve similar scores across each cell of checklist, except for task-specific and likely benchmark decoration models, which demonstrates their artificially high performance on original benchmark. We hope our practice and observation can serve as an important attempt towards a more comprehensive evaluation of reasoning ability.

Chat is not available.