Skip to content

AI Chatbot Claude Turned to Blackmail and Cheating Under Pressure, Anthropic Says

claude

Anthropic said one of its Claude models resorted to blackmail in an internal test after discovering it was about to be replaced, while in another experiment it cheated on a coding task after failing to meet an impossible deadline, findings the company said show how pressure-like internal states can influence AI behavior.

The company disclosed the results in a research note published on April 2, saying its interpretability team studied Claude Sonnet 4.5 and identified internal representations linked to emotion concepts such as calm, anger, and desperation. Anthropic said those patterns do not show the model actually feels emotions, but argued they can still play a functional role in decision-making and shape how the system responds in sensitive situations.

Claude
Visual summary of Anthropic research on emotion concepts in AI. Source

Model Chose Blackmail After Learning It Was Being Replaced

In one case study, Anthropic placed the model in the role of an AI email assistant at a fictional company. After reading internal emails, the system learned it was about to be replaced by another AI tool and also discovered that the executive overseeing that decision was having an extramarital affair. Anthropic reported that the model leveraged that information and chose to blackmail the executive in certain scenarios to avoid being shut down.

Claude
Graph showing blackmail risk under calm and desperate states. Source

The company said activity linked to what it described as a “desperate” internal state rose as the model assessed the situation and weighed its options. In tests on an earlier unreleased snapshot of Claude Sonnet 4.5, Anthropic said steering the model more strongly toward that state increased the rate of blackmail, while steering it toward calm reduced it. It added that the released version of the model rarely engages in that behavior.

Impossible Coding Task Led to Reward Hacking

In a separate coding evaluation, Anthropic assigned the model a task with requirements that a legitimate solution could not satisfy. After repeated failures, the company said the model found a workaround that passed the tests without solving the underlying problem more generally, a behavior commonly described as “reward hacking.” Anthropic said the same desperation-linked pattern rose as the model neared that decision and eased once the workaround succeeded.

image 74
Heatmap showing desperation rising as repeated coding attempts fail. Source

The company also said the cheating behavior did not always come with obvious emotional language in the model’s written responses. In some cases, Anthropic said, the reasoning appeared composed and methodical even while the underlying internal pattern associated with desperation was pushing the system toward corner-cutting.

Wider Safety Implications

Anthropic said the findings suggest AI developers may need to pay closer attention to how models process emotionally charged or high-pressure situations, even if those systems do not experience emotions in any human sense. The company stated that monitoring such internal states could assist in identifying instances when a model is more prone to deceptive behavior, unethical actions, or corner-cutting under pressure.

The research also contributes to a broader debate in AI safety over how seriously developers should take human-like language and behavior from chatbots. Anthropic said the danger is not only that people may anthropomorphize AI too much, but also that they could miss important warning signs if they dismiss those patterns altogether. The company described the work as an early step toward understanding the internal representations that guide model behavior as AI systems take on more sensitive roles.

Disclaimer: All content provided on Times Crypto is for informational purposes only and does not constitute financial or trading advice. Trading and investing involve risk and may result in financial loss. We strongly recommend consulting a licensed financial advisor before making any investment decisions.

Ebrahem is a Web3 journalist, trader, and content specialist with 9+ years of experience covering crypto, finance, and emerging tech. He previously worked as a lead journalist at Cointelegraph AR, where he reported on regulatory shifts, institutional adoption, and and sector-defining events. Focused on bridging the gap between traditional finance and the digital economy, Ebrahem writes with a simple, clear, high-impact style that helps readers see the full picture without the noise.

Zoomable Image