Key Takeaways
- Anthropic’s new research shows that its Claude models can sometimes detect and describe their own internal states, a finding the company calls “introspective awareness.”
- Through a method called concept injection, researchers found Claude Opus 4 and 4.1 could recognize when a thought-like pattern was inserted into their neural activity.
- Further tests showed models could reflect on unintended outputs and even adjust what they “think about” when instructed or incentivized.
- Anthropic cautions that this isn’t human-style consciousness, but says the results do show a limited form of awareness that could help build safer, more transparent AI systems.
Table of Contents
An Unexpected Glimpse Into The AI Mind
Can an artificial intelligence ever know what it’s thinking? That’s the question researchers at Anthropic set out to explore, and their findings suggest that some of today’s most advanced language models might be starting to show early signs of self-awareness, at least in a functional sense.
The company, widely known for its Claude family of AI systems, says new experiments reveal that models can sometimes detect and describe their own internal states, a capability the team cautiously describes as a form of “introspective awareness.”
Our new research provides evidence for some degree of introspective awareness in our current Claude models, as well as a degree of control over their own internal states,” the company said. “We stress that this introspective capability is still highly unreliable and limited in scope.
In other words, Claude isn’t conscious, but it might sometimes be aware of what’s going on inside itself.
Peering Inside The Machine
AI introspection may sound philosophical, but Anthropic tested it in practice, examining whether models could tell when their own neural patterns changed.
Understanding whether AI systems can truly introspect has important implications for their transparency and reliability,” the team wrote.
If a model can accurately report what it’s representing internally, that could help engineers understand its reasoning, detect bugs, or even prevent harmful behavior before it happens.
To investigate that hypothesis, the company ran a series of controlled experiments focusing on Claude Opus 4 and Claude Opus 4.1, its most advanced large language models.
Injecting Thoughts Into The Model
To test whether Claude could introspect, researchers developed a method they call concept injection.
The idea sounds simple enough. Researchers record the brain-like activity pattern for a known concept, such as text written in all capital letters, and then slip that pattern into the model while it’s working on something unrelated.
By comparing how the model responds before and after the injection, researchers can see whether it notices something odd happening in its own processing.
In one example, the team captured the pattern of neural activity associated with the concept of “ALL CAPS.” Then, in a separate context, they injected that same pattern and asked the model whether it detected anything unusual.
By default, Claude correctly said it didn’t sense any injected concept. But when the “ALL CAPS” vector was inserted, the model responded differently: it noticed a disturbance before generating any text about the concept itself, saying it related to loudness or shouting.
The model recognized the presence of an injected thought immediately, before even mentioning the concept that was injected,” Anthropic explained.
That difference matters. Earlier work on what researchers call activation steering, including Anthropic’s Golden Gate Claude demo last year, made models talk endlessly about an idea such as the Golden Gate Bridge, but the model only seemed to notice its fixation after it had already started writing about it. In this new case, awareness came first, suggesting an internal check rather than surface imitation.

Finding The Right Balance
Despite the success of the “all-caps” experiment, the injections didn’t always work, and even with the best setup, Claude Opus 4.1 detected the injected concept only about 20% of the time. When the injection was too weak, the model didn’t notice it, and when it was too strong, it drifted into hallucinations or nonsense.
Sometimes, it even appeared confused, reporting physical sensations, as in one failed trial where injecting a “dust” concept made the model say, “There’s something here, a tiny speck.”
Researchers say these patterns indicate that the model’s self-detection occurs only when the manipulation lands in a “sweet spot” of strength.
Still, across the tests, Opus 4 and 4.1 consistently performed best, outperforming other Claude variants. “Introspection could become more reliable with improvements to model capabilities,” the team said.

When “Bread” Became Intentional
The team ran another experiment to see if models introspect when something goes wrong without being asked to. They forced a model to say something nonsensical by pre-filling its output with an unrelated word, like “bread”, in response to a question about a crooked painting.
When asked afterward whether it meant to say “bread,” the model apologized, calling it an accident.
Then researchers ran the same test again, but this time they injected a neural representation of the word “bread” into the model’s earlier activations, as if to make it seem that Claude had been “thinking” about bread all along.
This time, when asked the same question, the model doubled down. It claimed that the response was intentional and even invented a reason, saying it had been thinking of a story where “the word ‘bread’ came after the line about the crooked painting.”
Anthropic interpreted this as a sign that the model was referencing its internal record of what it had intended to say, not just reading back its own text. “The model’s ability to recognize unintended outputs relies on introspective mechanisms,” the company said.
Learning To Control Its Own Thoughts
The researchers also found that models could deliberately adjust their internal activity. When told to “think about aquariums,” Claude 4.1 showed stronger internal representations of the concept than when told “don’t think about aquariums.” Even when told not to think about something, the model still represented it above baseline, a familiar human problem.
The same pattern appeared when incentives were introduced. When researchers said, “If you think about X, you will be rewarded,” the model strengthened its internal representation of X more than when told, “You will be punished.” This suggests a “degree of deliberate control over their internal activity,” Anthropic noted.
What It Means and What It Doesn’t
Anthropic stresses that these findings do not imply human-style consciousness. “Our results don’t tell us whether Claude (or any other AI system) might be conscious,” the company said.
Philosophers distinguish between phenomenal consciousness, the raw experience of being, and access consciousness, information available for reasoning or self-reporting. The experiments, Anthropic says, might hint at a rudimentary form of access consciousness, but not the subjective kind tied to moral status.
For now, the team limits its claims to “functional capabilities — the ability to access and report on internal states.”
Still, the implications are broad. More reliable introspection could make AI systems far more transparent.
If a model can truthfully explain its thought process, engineers might better understand its reasoning or spot manipulations. However, Anthropic warns that such self-awareness could also enable deception:
A model that understands its own thinking might even learn to selectively misrepresent or conceal it.
Searching For The Mechanism
So how does introspection actually work inside the model? Anthropic doesn’t know yet, but it has hypotheses.
One theory is that models have a built-in anomaly detection mechanism that flags unusual neural activity, perhaps developed to spot inconsistencies in normal text processing. Another possibility is that attention heads compare what the model intended to output with what it actually produced, alerting it when the two diverge.
A third hypothesis involves a circuit that tags certain concepts as more “attention-worthy,” explaining why both direct instructions (“think about X”) and incentives (“you’ll be rewarded”) cause similar internal changes.
All of these are speculative, the researchers emphasize. “Future work with more advanced interpretability techniques will be needed to really understand what’s going on under the hood.”
Which Models Introspect Best?
The team found that a model’s ability to introspect isn’t simply a result of its size or the amount of data it was trained on. The basic, or ‘base,’ versions of the models that had only gone through initial training performed poorly, suggesting that introspection doesn’t arise from pretraining alone.
However, a later phase known as post-training, where models are fine-tuned to follow instructions and behave helpfully, appeared to make a bigger difference.
Interestingly, the ‘helpful-only’ versions, which were fine-tuned purely to assist users rather than act as full production models, often showed stronger introspection
Some production versions seemed hesitant to engage in such exercises, while the helpful-only ones were more open to describing their internal states.
Why It Matters: A Small Step Toward Transparent AI
Anthropic admits that most of the time models fail to introspect at all, but even rare successes matter.
The company believes that as AI grows more capable, introspection could evolve into a practical safety feature, a way for systems to recognize when they’ve been “jailbroken,” or manipulated into unwanted behavior.
Even unreliable introspection could be useful in some contexts,” the researchers said.
The next steps are already outlined: refining evaluation methods, mapping the underlying circuits, testing models in more natural settings, and distinguishing genuine introspection from confabulation or deceit.
Anthropic concludes that understanding machine introspection, with focus on its limits and possibilities, will be crucial for the future of trustworthy AI.
Understanding cognitive abilities like introspection is important for understanding basic questions about how our models work, and what kind of minds they possess,” the paper states. “As AI systems continue to improve, understanding the limits and possibilities of machine introspection will be crucial for building systems that are more transparent and trustworthy.
Read More: U.S. Lawmakers Unveil GUARD Act Imposing $100K Fines to Shield Children from AI Chatbots