Towards Diverse Scientific Hypothesis Search with Large Language Models
Abstract
Large language models are increasingly used to accelerate scientific discovery, especially in iteratively searching scientific hypotheses. Yet in many discovery settings the goal is not to identify a single ``best'' hypothesis: validation is noisy and expensive, multiple hypotheses can remain plausible, and scientists benefit from a set of high-quality but meaningfully diverse hypotheses that hedge against downstream uncertainty. Nevertheless, commonly used evolutionary search recipes tend to underemphasize this requirement, implicitly prioritizing optimization over exploration, and the resulting selection pressure during the search process leads to diversity collapse. Motivated by these limitations, we formulate hypothesis search as a sampling problem, where the objective is to efficiently produce diverse, high-quality hypotheses under fixed validation budget. Building on this perspective, we propose, EvoDiverse, an evolutionary framework inspired by the classical parallel tempering algorithm that searches hypotheses at multiple temperature levels and enables principled information exchange across temperatures to improve exploration without disrupting convergence. Across domains including molecular discovery, equation discovery, and algorithm discovery, our approach consistently improves both hypothesis quality and diversity under the same validation budget, and produces candidate sets that remain robust under more expensive downstream computational validations.