Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Next Generation of AI Safety

Consistency Checks for Language Model Forecasters

Abhimanyu Pallavi Sudhir · Alejandro Alvarez · Adam Shen · Daniel Paleka

Keywords: [ evaluation ] [ forecasting ] [ markets ] [ trading ] [ eval ] [ robustness ] [ LLM ]


Abstract:

Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how do we even evaluate these predictions? Following the consistency check framework, we measure forecasting ability on certain topics according to how consistent the predictions on different logically related questions are. The main consistency metric we use is one of arbitrage: for example, if a forecasting AI predicts 60% probability for both the Democratic and Republican parties to win the 2024 US presidential election, an arbitrageur could trade against the forecaster's predictions and make a profit. We build an automated evaluation system: starting from the instruction "query the forecaster's predictions on the topic of X," our evaluation system generates a set of base questions, instantiates the consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We conclude with the possible applications of our work in steering and evaluating superhuman AI oracle systems.

Chat is not available.