Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models
BioinformaticsBench: A collaboratively built large language model benchmark for Bioinformatics reasoning
Varuni Sarwal · Seungmo Lee · Rosemary He · Aingela Kattapuram · Xiaoxuan Wang · Yijia Xiao · Serghei Mangul · Wei Wang
Most of the existing Large Language Model (LLM) benchmarks on bioinformatics problem reasoning focus on problems grounded to niche research domains where datasets contain a small number of samples and, therefore are not truly representative of the broad domain of bioinformatics. To systematically examine the reasoning capabilities required for solving complex bioinformatics problems, we introduce an expansive benchmark suite BioinformaticsBench for LLMs. BioinformaticsBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from several bioinformatics domains, such as genetics, genomics, single celled analysis, proteomics, and metagenomics. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that current LLMs are able to deliver a satisfactory performance, with an overall best score of 74%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that while different models have different domains of expertise, GPT-4o is the best performing model overall. We envision that BioinformaticsBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.