Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning

Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting

Rylan Schaeffer · Kateryna Pistunova · Samar Khanna · Sarthak Consul · Sanmi Koyejo


Abstract:

Language models can be prompted to reason through problems in a manner that significantly improves performance. However, \textit{why} reason-based prompting improves performance is unclear. Recent work showed that using logically \textit{invalid} Chain-of-Thought (CoT) prompting achieves almost the same performance gains as logically valid CoT prompting and that editing CoT prompts to replace problem-specific information with either abstract information or out-of-distribution information typically doesn't harm performance. Critics have responded that these findings are based on too few and too easy tasks to draw broad generalizations. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest subset of tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically \textit{invalid} reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that confounders beyond logically valid reasoning are responsible for performance improvements.

Chat is not available.