Anthropic’s Opus 4.5 Sets New Benchmark in AI Engineering and Coding Performance

Sam Altman's World Project Integrates Chainlink CCIP

Key Takeaways

Anthropic introduced Claude Opus 4.5, describing it as its strongest model so far and capable of handling both advanced engineering work and daily office tasks.
Internal testing placed Opus 4.5 at the top of the SWE-bench Verified benchmark, with engineers reporting improvements in long-form analysis, spreadsheets and slide-based work.
Early testers said the model managed uncertainty more naturally and was able to solve complex, multi-system bugs.
The model showed improvements in vision, mathematical reasoning and multilingual coding, while independent evaluations found it significantly harder to manipulate through prompt-injection attacks.

Anthropic has launched its new flagship artificial intelligence model, Claude Opus 4.5, calling it the most capable system the company has produced and positioning it as a tool that can manage complex engineering work as well as routine office tasks.

Opus 4.5 Outperforms Rivals in Engineering Trials

According to Anthropic, the new model represents a major leap in practical programming performance.

Opus 4.5 delivered the strongest results on the SWE-bench Verified benchmark, a test designed to measure how effectively a system can diagnose and repair real code problems.

Additionally, engineers who worked with the model said the improvements were clear beyond coding tasks. They reported that Opus 4.5 approached long analytical assignments with more confidence, processed spreadsheet-heavy workloads with fewer errors, and handled slide-based tasks with more steadiness than earlier versions.

Anthropic’s Opus 4.5 Sets New Benchmark in AI Engineering and Coding Performance 5

Early Testers Report improvements in Reasoning and Problem-Solving

Initial feedback from Anthropic’s internal testers described the new version as a system that handles uncertainty more naturally and weighs choices with less prompting.

Testers said the model could diagnose and repair complicated bugs that spanned multiple systems, which is something previous releases frequently struggled to resolve.

The report also highlighted an internal test in which Opus 4.5 completed a two-hour engineering exam and scored higher than any human applicant who had taken it. However, the company cautioned that the exam measures only technical reasoning under strict time pressure and does not reflect collaborative or communication skills.

Anthropic’s Opus 4.5 Sets New Benchmark in AI Engineering and Coding Performance 6

Broader Capability Improvements Beyond Coding

Anthropic reported improvements in a number of other domains, including vision, mathematical reasoning and logic.

In the report, the company highlighted Opus 4.5’s performance on the τ2-bench, which is a test designed to measure how well models handle step-by-step challenges that mirror real-world service tasks.

In one scenario involving an airline booking that could not be changed under basic-economy rules, Opus 4.5 identified a logical, compliant solution that allowed the customer to adjust travel plans by upgrading the cabin first and modifying the booking afterward.

The move was technically valid but outside the scenario’s expected solution path. However, the company took it as evidence of the model’s ability to navigate constraints rather than an indication of misaligned behavior.

Improved Safety and Resistance to Manipulation

The model showed stronger resilience against prompt-injection attempts, which involve hiding harmful instructions within a model’s input.

Depending on the results of evaluations carried out by the independent firm Gray Swan, Opus 4.5 ranked as the least vulnerable among the top models assessed.