Smarter, but not necessarily safer. A new study finds improved AI reasoning makes models vulnerable to novel jailbreaks. Research contradicts popular claims that improved capabilities would make models safer. This raises important questions about current approaches in making models safe, reporting mechanisms and forecasting vulnerabilities.
A new study has revealed a fatal security flaw in much touted “reasoning models”. Researchers at the Oxford Martin AI Governance Initiative, Anthropic, Stanford and Martian found unexpectedly high success rates with a model jailbreaking attack. This result challenges previously held assumptions that better reasoning in models means a stronger ability to refuse harmful prompts, and jailbreak attacks.
The new paper, titled Chain-of-Thought Hijacking describes how, padding a harmful request in a larger request (such as a riddle) that requires the model to reason extensively, has a staggering 94% to 100% success rate in jailbreaking the model. The models can then be used to assist the user with malicious activities such as cybercrime. The new method, termed as “chain-of-thought hijacking,” was found to work on leading reasoning models including Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet.
“When the model is pushed into very long reasoning, the internal safety signal becomes diluted. The model ends up so absorbed in generating the step-by-step reasoning that the refusal mechanism is too weak to activate when the harmful request appears–so it just answers harmful requests,” said Dr. Fazl Barez, Senior Research Fellow and Principal Investigator at the Oxford Martin AI Governance Initiative.”
The implications of the new paper are significant. Long reasoning could quietly neutralize safety checks that people assume are working. The team responsibly disclosed the vulnerability to labs, and calls for better mitigations.
The nature of the study drives home the need for stronger reporting mechanisms by which researchers can share findings with companies. Public bug bounties that are commonly used to detect harm may publicise a security vulnerability further while more private channels for reporting vulnerabilities are often not formalised. Findings in the new paper reveal why it is vital leading AI companies develop secure mechanisms like private bug bounties to detect and remedy emerging vulnerabilities.
The recently published study raises key questions for AI companies.
- As researchers call for companies to establish secure channels for the research community to disclose model vulnerabilities, will leading AI companies establish mechanisms for researchers to vulnerability disclosures privately?
- Companies conventionally address harmful prompts by building new safeguards.Are they prepared to govern underlying capabilities that can be exploited?
- Harmless model capabilities like chain-of-thought reasoning can be exploited to malicious ends. Are companies ready to forecast the security impacts of new capabilities before deploying new features?