Chains-of-thought (CoT) allow language models to verbalise multi-step rationales before producing their final answer. While this technique often boosts task performance and offers an impression of transparency into the model’s reasoning, we argue that rationales generated by current CoT techniques can be misleading and are neither necessary nor sufficient for trustworthy interpretability.
By analysing faith-fulness in terms of whether CoTs are not only human-interpretable, but also reflect underlying model reasoning in a way that supports responsible use, we synthesise evidence from previous studies. We show that verbalised chains are frequently unfaithful, diverging from the true hidden computations that drive a model’s predictions, and giving an incorrect picture of how models arrive at conclusions. Despite this, CoT is increasingly relied upon in high-stakes domains such as medicine, law, and autonomous systems—our analysis of 1,000 recent CoT-centric papers find that ~25% explicitly treat CoT as an interpretability technique—and among them, papers in high-stakes domains specifically hinge on such interpretability claim heavily.
Building on prior work in interpretability, we make three proposals:(i) avoid treating CoT as being sufficient for interpretability without additional verification while continuing to use CoT for its communicative benefits (ii) adopt rigorous methods that assess faithfulness for downstream decision-making, and(iii) develop causal validation methods (e.g., activation patching, counterfactual interventions, verified models) to ground explanations in model internals.