ICML Evaluating the Casual Reasoning Abilities of Large Language Models

Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning

Evaluating the Casual Reasoning Abilities of Large Language Models

Isha Puri · Hima Lakkaraju

[ Abstract ] [ Project Page ]

[ Slides]

Abstract:

Large language models have developed at a breathtaking pace, quickly advancing in their ability to generate, summarize, and work with long and short-form text. As these advances become further integrated into society, however, it becomes necessary to question and evaluate how well these models are actually capable of true reasoning, rather than simply mimicking their large training corpora. We argue that eliciting reasoning from language models is the new "explainability method" and introduce CReDETS, a novel and first-of-its-kind causal reasoning dataset with annotated hand-written explanations. We benchmark the latest and most powerful generation of transformer neural network models GPT-3, GPT-3.5 (chatGPT), and GPT-4 and discuss their accuracy, coherence, and consistency. Our staggering results show that even the most recent LLMs have stark weaknesses in reasoning ability that must be ameliorated before they can be integrated with public-facing applications worldwide.

Chat is not available.

Poster in Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning

Evaluating the Casual Reasoning Abilities of Large Language Models

Isha Puri · Hima Lakkaraju

Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning