Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

Source: Venture Beat

Anthropic has developed a new method for peering inside large language models like Claude, revealing for the first time how these AI systems process information and make decisions.

The research, published today in two papers (available here and here), shows these models are more sophisticated than previously understood — they plan ahead when writing poetry, use the same internal blueprint to interpret ideas regardless of language, and sometimes even work backward from a desired outcome instead of simply building up from the facts.

The work, which draws inspiration from neuroscience techniques used to study biological brains, represents a significant advance in AI interpretability. This approach could allow researchers to audit these systems for safety issues that might remain hidden during conventional external testing.

“We’ve created these AI systems with remarkable capabilities, but because of how they’re trained, we haven’t understood how those capabilities actually emerged,” said Joshua Batson, a researcher at Anthropic, in an exclusive interview with VentureBeat. “Inside the model, it’s just a bunch of numbers —matrix weights in the artificial neural network.”

New techniques illuminate AI’s previously hidden decision-making process

Large language models like OpenAI’s GPT-4o, Anthropic’s Claude, and Google’s Gemini have demonstrated remarkable capabilities, from writing code to synthesizing research papers. But these systems have largely functioned as “black boxes” — even their creators often don’t understand exactly how they arrive at particular responses.

Anthropic’s new interpretability techniques, which the company dubs “circuit tracing” and “attribution graphs,” allow researchers to map out the specific pathways of neuron-like features that activate when models perform tasks. The approach borrows concepts from neuroscience, viewing AI models as analogous to biological systems.

“This work is turning what were almost philosophical questions — ‘Are models thinking? Are models planning? Are models just regurgitating information?’ — into concrete scientific inquiries about what’s literally happening inside these systems,” Batson explained.

Claude’s hidden planning: How AI plots poetry lines and solves geography questions

Among the most striking discoveries was evidence that Claude plans ahead when writing poetry. When asked to compose a rhyming couplet, the model identified potential rhyming words for the end of the next line before it began writing — a level of sophistication that surprised even Anthropic’s researchers.

“This is probably happening all over the place,” Batson said. “If you had asked me before this research, I would have guessed the model is thinking ahead in various contexts. But this example provides the most compelling evidence we’ve seen of that capability.”

For instance, when writing a poem ending with “rabbit,” the model activates features representing this word at the beginning of the line, then structures the sentence to naturally arrive at that conclusion.

The researchers also found that Claude performs genuine multi-step reasoning. In a test asking “The capital of the state containing Dallas is…” the model first activates features representing “Texas,” and then uses that representation to determine “Austin” as the correct answer. This suggests the model is actually performing a chain of reasoning rather than merely regurgitating memorized associations.

By manipulating these internal representations — for example, replacing “Texas” with “California” — the researchers could cause the model to output “Sacramento” instead, confirming the causal relationship.

Beyond translation: Claude’s universal language concept network revealed

Another key discovery involves how Claude handles multiple languages. Rather than maintaining separate systems for English, French, and Chinese, the model appears to translate concepts into a shared abstract representation before generating responses.

“We find the model uses a mixture of language-specific and abstract, language-independent circuits,” the researchers write in their paper. When asked for the opposite of “small” in different languages, the model uses the same internal features representing “opposites” and “smallness,” regardless of the input language.

This finding has implications for how models might transfer knowledge learned in one language to others, and suggests that models with larger parameter counts develop more language-agnostic representations.

When AI makes up answers: Detecting Claude’s mathematical fabrications

Perhaps most concerning, the research revealed instances where Claude’s reasoning doesn’t match what it claims. When presented with difficult math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn’t reflected in its internal activity.

“We are able to distinguish between cases where the model genuinely performs the steps they say they are performing, cases where it makes up its reasoning without regard for truth, and cases where it works backwards from a human-provided clue,” the researchers explain.

In one example, when a user suggests an answer to a difficult problem, the model works backward to construct a chain of reasoning that leads to that answer, rather than working forward from first principles.

“We mechanistically distinguish an example of Claude 3.5 Haiku using a faithful chain of thought from two examples of unfaithful chains of thought,” the paper states. “In one, the model is exhibiting ‘bullshitting‘… In the other, it exhibits motivated reasoning.”

Inside AI Hallucinations: How Claude decides when to answer or refuse questions

The research also provides insight into why language models hallucinate — making up information when they don’t know an answer. Anthropic found evidence of a “default” circuit that causes Claude to decline to answer questions, which is inhibited when the model recognizes entities it knows about.

“The model contains ‘default’ circuits that cause it to decline to answer questions,” the researchers explain. “When a model is asked a question about something it knows, it activates a pool of features which inhibit this default circuit, thereby allowing the model to respond to the question.”

When this mechanism misfires — recognizing an entity but lacking specific knowledge about it — hallucinations can occur. This explains why models might confidently provide incorrect information about well-known figures while refusing to answer questions about obscure ones.

Safety implications: Using circuit tracing to improve AI reliability and trustworthiness

This research represents a significant step toward making AI systems more transparent and potentially safer. By understanding how models arrive at their answers, researchers could potentially identify and address problematic reasoning patterns.

“We hope that we and others can use these discoveries to make models safer,” the researchers write. “For example, it might be possible to use the techniques described here to monitor AI systems for certain dangerous behaviors—such as deceiving the user—to steer them towards desirable outcomes, or to remove certain dangerous subject matter entirely.”

However, Batson cautions that the current techniques still have significant limitations. They only capture a fraction of the total computation performed by these models, and analyzing the results remains labor-intensive.

“Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude,” the researchers acknowledge.

The future of AI transparency: Challenges and opportunities in model interpretation

Anthropic’s new techniques come at a time of increasing concern about AI transparency and safety. As these models become more powerful and more widely deployed, understanding their internal mechanisms becomes increasingly important.

The research also has potential commercial implications. As enterprises increasingly rely on large language models to power applications, understanding when and why these systems might provide incorrect information becomes crucial for managing risk.

“Anthropic wants to make models safe in a broad sense, including everything from mitigating bias to ensuring an AI is acting honestly to preventing misuse — including in scenarios of catastrophic risk,” the researchers write.

While this research represents a significant advance, Batson emphasized that it’s only the beginning of a much longer journey. “The work has really just begun,” he said. “Understanding the representations the model uses doesn’t tell us how it uses them.”

For now, Anthropic’s circuit tracing offers a first tentative map of previously uncharted territory — much like early anatomists sketching the first crude diagrams of the human brain. The full atlas of AI cognition remains to be drawn, but we can now at least see the outlines of how these systems think.



Read Full Article