Oral
in
Workshop: Agentic Markets Workshop
Consistency Checks for Language Model Forecasters
Abhimanyu Pallavi Sudhir · Alejandro Alvarez · Adam Shen · Daniel Paleka
Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how to benchmark and evaluate those instantaneously? Following the consistency check framework, we measure forecasting performance on certain topics according to how consistent the predictions on different logically related questions are. The main consistency metric we use is one of arbitrage: for example, if a forecasting AI predicts 60% probability for both the Democratic and Republican parties to win the 2024 US presidential election, an arbitrageur could trade against the forecaster's predictions and make a profit. We build an automated evaluation system: starting from the instruction "query the forecaster's predictions on the topic of X," our evaluation system generates a set of base questions, instantiates the consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We conclude with the possible applications of our work in steering and evaluating superhuman AI oracle systems.