Poster
in
Workshop: Accessible and Efficient Foundation Models for Biological Discovery

BioinformaticsBench: A collaboratively built large language model benchmark for Bioinformatics reasoning

Varuni Sarwal · Seungmo Lee · Rosemary He · Aingela Kattapuram · Mandy Wang · Eleazar Eskin · Wei Wang · Serghei Mangul

Keywords: dataset LLMs benchmarking

Project Page [ OpenReview]

Abstract

Most of the existing Large Language Model (LLM) benchmarks on bioinformatics problem reasoning focus on problems grounded to nicheresearch domains where datasets contain a small number of samples and, therefore are not truly representative of the broad domain of bioinformatics. To systematically examine the reasoning capabilities required for solving complex bioinformatics problems, we introduce an expansive benchmark suite BioinformaticsBench for LLMs. BioinformaticsBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from several bioinformatics domains, such as genetics, genomics, single celled analysis, proteomics, and metagenomics. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that current LLMs are able to deliver a satisfactory performance, with an overall best score of 74%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities.Our analysis indicates that while different models have different domains of expertise, GPT-4o is the best performing model overall. We envision that BioinformaticsBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.

Chat is not available.