Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning
Evaluating the Casual Reasoning Abilities of Large Language Models
Isha Puri · Hima Lakkaraju
Large language models have developed at a breathtaking pace, quickly advancing in their ability to generate, summarize, and work with long and short-form text. As these advances become further integrated into society, however, it becomes necessary to question and evaluate how well these models are actually capable of true reasoning, rather than simply mimicking their large training corpora. We argue that eliciting reasoning from language models is the new "explainability method" and introduce CReDETS, a novel and first-of-its-kind causal reasoning dataset with annotated hand-written explanations. We benchmark the latest and most powerful generation of transformer neural network models GPT-3, GPT-3.5 (chatGPT), and GPT-4 and discuss their accuracy, coherence, and consistency. Our staggering results show that even the most recent LLMs have stark weaknesses in reasoning ability that must be ameliorated before they can be integrated with public-facing applications worldwide.