Poster
in
Affinity Event: The 6th Muslims in ML (MusIML) Workshop

Semantic Contracts as the Missing Middle Layer for Reliable AI Mathematics

ZhangHao

Project Page

Abstract

A language model solves a math problem step by step. The reasoning reads fluently, but should we trust it? Today's dominant strategy -- generating Python code for execution -- handles arithmetic reliably yet is structurally blind to semantic unit errors: apples + dollars compiles and runs without complaint. We argue that the missing ingredient is not a better generator or a stronger verifier, but an explicit semantic middle layer that makes the structure of a solution machine-checkable before any numbers are computed. We propose SC-IR (Semantic Contract Intermediate Representation), a typed contract language whose type system tracks ontology-aware quantity kinds -- Count[apple], Rate[km,litre], Frac -- and enforces kind consistency via six division-aware typing rules. SC-IR maps a reasoning trace to a typed contract and either accepts it (with discharged proof obligations) or rejects it with one of three failure labels: reduction, typing, or verification failure. Each label implies a distinct repair strategy, making the pipeline's failures operationally actionable. We evaluate SC-IR on the full GSM8K test set (1,319 problems) under a true blind protocol -- the model sees only the problem text, no answer hints -- and compare against Program-of-Thought (PoT). When SC-IR accepts a contract, it is correct 87.0% of the time, compared with PoT's 77.0% overall accuracy, showing the precision of selective typed acceptance. With DeepSeek-V4-Pro, SC-IR reaches 61.8% coverage, 59.8% overall accuracy, and 96.8% accepted-contract precision; PoT with the same generator remains the stronger accuracy baseline at 95.0%. Critically, SC-IR catches 21 semantic unit errors that PoT silently executes. An agentic repair loop adds +8.5% cumulative accuracy, with verification failures showing the highest repair rate (30.4%) -- confirming that structured failure attribution enables targeted correction. Removing ontology typing raises the false-accept rate from 2.1% to 3.6%.