Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

BioinformaticsBench: A collaboratively built large language model benchmark for Bioinformatics reasoning

Varuni Sarwal · Seungmo Lee · Rosemary He · Aingela Kattapuram · Xiaoxuan Wang · Yijia Xiao · Serghei Mangul · Wei Wang


Abstract:

Most of the existing Large Language Model (LLM) benchmarks on bioinformatics problem reasoning focus on problems grounded to niche research domains where datasets contain a small number of samples and, therefore are not truly representative of the broad domain of bioinformatics. To systematically examine the reasoning capabilities required for solving complex bioinformatics problems, we introduce an expansive benchmark suite BioinformaticsBench for LLMs. BioinformaticsBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from several bioinformatics domains, such as genetics, genomics, single celled analysis, proteomics, and metagenomics. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that current LLMs are able to deliver a satisfactory performance, with an overall best score of 74%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that while different models have different domains of expertise, GPT-4o is the best performing model overall. We envision that BioinformaticsBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

Chat is not available.