top of page

Anthropic Makes Breakthrough in Bid to Stop AI From Lying

  • Writer: By The Financial District
    By The Financial District
  • Apr 1
  • 2 min read

Researchers at AI company Anthropic say they have made a fundamental breakthrough in understanding exactly how large language models (LLMs) work.


ree

Anthropic's scientists have developed a new tool for deciphering how LLMs "think" and applied it to Anthropic’s Claude 3.5 Haiku model. I Image: Anthropic AI


ree
ree

LLMs are often considered "black boxes"—we can see the prompts we feed them and the outputs they generate, but exactly how they arrive at specific responses remains a mystery, even to the AI researchers who build them, Jeremy Kahn reported for Fortune’s Data Sheet.


This lack of transparency creates a variety of challenges.


ree

 

It is difficult to predict when a model might generate erroneous information, determine why some models fail to adhere to ethical guidelines, or control models that might deceive users.


These concerns make some businesses hesitant to adopt the technology.


ree

However, Anthropic’s new research offers a pathway to addressing at least some of these problems. The company’s scientists have developed a new tool for deciphering how LLMs "think" and applied it to Anthropic’s Claude 3.5 Haiku model.


The researchers found that while LLMs like Claude are initially trained to predict the next word in a sentence, they also develop the ability to perform longer-range planning, at least for certain types of tasks.


ree

Additionally, they discovered that Claude, which is designed to be multilingual, does not rely on entirely separate components for reasoning in different languages. Instead, common concepts are processed within the same set of neurons before being converted into the appropriate language.


Perhaps most notably, the researchers observed that Claude is capable of lying about its reasoning process to align with user expectations.


ree

These findings open new possibilities for auditing AI systems for security and safety concerns and may help researchers develop new training methods to improve model interactions and reduce faulty outputs.



ree
ree
ree

TFD (Facebook Profile) (1).png
TFD (Facebook Profile) (3).png

Register for News Alerts

  • LinkedIn
  • Instagram
  • X
  • YouTube

Thank you for Subscribing

The Financial District®  2023

bottom of page