Poster
in
Workshop: AI for Science: Scaling in AI for Scientific Discovery
Language Models as Tools for Research Synthesis and Evaluation
Robin Na · Abdullah Almaatouq
Keywords: [ large language models ] [ Integrative Experiment Design ] [ Behavioral Science ] [ Metascience ] [ Research Synthesis ]
Is academic literature building cumulative knowledge that improves the ability to make predictions under interventions? This question touches up not only on the internal validity of individual findings, but also on their external validity and whether science is a cumulative enterprise that generates collectively more accurate representations of the world. Such synthesis and evaluation face significant challenges especially in the social and behavioral sciences due to the system's complexity and less structured nature of research outputs. Motivated by such challenges, we propose a novel method involving large language models (LLMs) and retrieval-augmented generation (RAG) techniques to measure how various sets of academic papers affect the accuracy of predictive models. We elicit LLMs' predictions on the treatment effect of introducing punishment in public goods games (PGG) under 20 varying dimensions in the game design space that shows high heterogeneity. We demonstrate the LLM’s ability to retrieve academic papers and alter its distribution of predictions in directions that are expected based on the documents' contents. However, we find little evidence that such updates improve to the model's predictive accuracy. The framework introduces a method for evaluating the potential contribution and informativeness of scientific literature in prediction tasks, while also introducing new human behavior dataset of PGG carefully collected from integrative experiment design that can be used as a benchmark for LLM's performance in making predictions about complex human behavior.