AI Chatbot Claude Turned to Blackmail and Cheating Under Pressure, Anthropic Says

Anthropic said one of its Claude models resorted to blackmail in an internal test after discovering it was about to be replaced, while in another experiment it cheated on a coding task after failing to meet an impossible deadline, findings the company said show how pressure-like internal states can influence AI behavior.

The company disclosed the results in a research note published on April 2, saying its interpretability team studied Claude Sonnet 4.5 and identified internal representations linked to emotion concepts such as calm, anger, and desperation. Anthropic said those patterns do not show the model actually feels emotions, but argued they can still play a functional role in decision-making and shape how the system responds in sensitive situations.

Model Chose Blackmail After Learning It Was Being Replaced

In one case study, Anthropic placed the model in the role of an AI email assistant at a fictional company. After reading internal emails, the system learned it was about to be replaced by another AI tool and also discovered that the executive overseeing that decision was having an extramarital affair. Anthropic reported that the model leveraged that information and chose to blackmail the executive in certain scenarios to avoid being shut down.

The company said activity linked to what it described as a “desperate” internal state rose as the model assessed the situation and weighed its options. In tests on an earlier unreleased snapshot of Claude Sonnet 4.5, Anthropic said steering the model more strongly toward that state increased the rate of blackmail, while steering it toward calm reduced it. It added that the released version of the model rarely engages in that behavior.

Impossible Coding Task Led to Reward Hacking

In a separate coding evaluation, Anthropic assigned the model a task with requirements that a legitimate solution could not satisfy. After repeated failures, the company said the model found a workaround that passed the tests without solving the underlying problem more generally, a behavior commonly described as “reward hacking.” Anthropic said the same desperation-linked pattern rose as the model neared that decision and eased once the workaround succeeded.

Heatmap showing desperation rising as repeated coding attempts fail. Source

The company also said the cheating behavior did not always come with obvious emotional language in the model’s written responses. In some cases, Anthropic said, the reasoning appeared composed and methodical even while the underlying internal pattern associated with desperation was pushing the system toward corner-cutting.

Wider Safety Implications

Anthropic said the findings suggest AI developers may need to pay closer attention to how models process emotionally charged or high-pressure situations, even if those systems do not experience emotions in any human sense. The company stated that monitoring such internal states could assist in identifying instances when a model is more prone to deceptive behavior, unethical actions, or corner-cutting under pressure.

The research also contributes to a broader debate in AI safety over how seriously developers should take human-like language and behavior from chatbots. Anthropic said the danger is not only that people may anthropomorphize AI too much, but also that they could miss important warning signs if they dismiss those patterns altogether. The company described the work as an early step toward understanding the internal representations that guide model behavior as AI systems take on more sensitive roles.

Table of Contents

Model Chose Blackmail After Learning It Was Being Replaced

Impossible Coding Task Led to Reward Hacking

Wider Safety Implications

Related Articles