Chain-of-Thought Is Not Explainability

July 15, 2025

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Leem, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar and Yoshua Bengio

View Journal Article / Working Paper >

Chains-of-thought (CoT) allow language models to verbalise multi-step rationales before producing their final answer. While this technique often boosts task performance and offers an impression of transparency into the model’s reasoning, we argue that rationales generated by current CoT techniques can be misleading and are neither necessary nor sufficient for trustworthy interpretability.

By analysing faith-fulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions. Despite this, CoT is increasingly relied upon in high-stakes domains such as medicine, law, and autonomous systems—our analysis of 1,000 recent CoT-centric papers find that ~25% explicitly treat CoT as an interpretability technique—and among them, papers in high-stakes domains specifically hinge on such interpretability claim heavily.

Building on prior work in interpretability, we make three proposals:(i) avoid treating CoT as being sufficient for interpretability without additional verification while continuing to use CoT for its communicative benefits (ii) adopt rigorous methods that assess faithfulness for downstream decision-making, and(iii) develop causal validation methods (e.g., activation patching, counterfactual interventions, verified models) to ground explanations in model internals.

Mapping Industry Practices to the EU AI Act’s GPAI Code of Practice Safety and Security Measures

July 11, 2025

Verification for International AI Governance

July 3, 2025