Anthropic Makes Breakthrough in Bid to Stop AI From Lying

By The Financial District
Apr 1, 2025
2 min read

Researchers at AI company Anthropic say they have made a fundamental breakthrough in understanding exactly how large language models (LLMs) work.

Anthropic's scientists have developed a new tool for deciphering how LLMs "think" and applied it to Anthropic’s Claude 3.5 Haiku model. I Image: Anthropic AI

LLMs are often considered "black boxes"—we can see the prompts we feed them and the outputs they generate, but exactly how they arrive at specific responses remains a mystery, even to the AI researchers who build them, Jeremy Kahn reported for Fortune’s Data Sheet.

This lack of transparency creates a variety of challenges.

It is difficult to predict when a model might generate erroneous information, determine why some models fail to adhere to ethical guidelines, or control models that might deceive users.

These concerns make some businesses hesitant to adopt the technology.

However, Anthropic’s new research offers a pathway to addressing at least some of these problems. The company’s scientists have developed a new tool for deciphering how LLMs "think" and applied it to Anthropic’s Claude 3.5 Haiku model.

The researchers found that while LLMs like Claude are initially trained to predict the next word in a sentence, they also develop the ability to perform longer-range planning, at least for certain types of tasks.

Additionally, they discovered that Claude, which is designed to be multilingual, does not rely on entirely separate components for reasoning in different languages. Instead, common concepts are processed within the same set of neurons before being converted into the appropriate language.

Perhaps most notably, the researchers observed that Claude is capable of lying about its reasoning process to align with user expectations.

These findings open new possibilities for auditing AI systems for security and safety concerns and may help researchers develop new training methods to improve model interactions and reduce faulty outputs.