Timezone: »

Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
Rylan Schaeffer · Kateryna Pistunova · Samar Khanna · Sarthak Consul · Sanmi Koyejo

Language models can be prompted to reason through problems in a manner that significantly improves performance. However, \textit{why} reason-based prompting improves performance is unclear. Recent work showed that using logically \textit{invalid} Chain-of-Thought (CoT) prompting achieves almost the same performance gains as logically valid CoT prompting and that editing CoT prompts to replace problem-specific information with either abstract information or out-of-distribution information typically doesn't harm performance. Critics have responded that these findings are based on too few and too easy tasks to draw broad generalizations. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest subset of tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically \textit{invalid} reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that confounders beyond logically valid reasoning are responsible for performance improvements.

Author Information

Rylan Schaeffer (Stanford University)
Kateryna Pistunova (Stanford University)
Samar Khanna (Stanford University)
Sarthak Consul (Stanford University )
Sanmi Koyejo (Stanford University)

More from the Same Authors