How Powerful are LLMs in Generating Program Specifications?
Abstract
Formal verification provides strong guarantees of software correctness, but its adoption is limited by the high cost of writing precise formal specifications. While recent large language models (LLMs) have demonstrated impressive capabilities in theorem proving and verified code generation, how powerful they truly are in generating program specifications remains unclear. Existing evaluations require either verifying implementation conformance or proving semantic equivalence between specifications, both are formidably difficult, yielding sparse and often inconclusive results about specification quality. To address this problem, we introduce Coins, a Coq-based evaluation framework that assesses specification quality through provable behavioral correctness on instantiated test cases. This design aligns the evaluation with the asymmetric nature of formal reasoning, where successful proofs provide reliable evidence while proof failures are inherently ambiguous. Using Coins, we conduct a large-scale empirical study of specification generation on HumanEval, supported by a curated set of human-written Coq specifications. Our results show that even generating specifications remains a formidable challenge, and that verification complexity substantially obscures genuine differences in specification quality. Overall, we find that accurately evaluating specifications—rather than increasing model capacity alone—is the central challenge in understanding the power of LLMs for specification synthesis, and that the test-case--based formal reasoning offers a more faithful and discriminative measure of progress.