Claude Follows Your Lead but Knows When to Say No According to New Anthropic Research

According to a new study by Anthropic, its AI chatbot Claude has values that are reflected in conversations with users. The study analyzed 700,000 chats on Claude and found that Claude is often honest, helpful, and harmless, which aligns well with the company’s codes, and it also showed that Claude adjusts its values and tones based on the topic that is being discussed. Anthropic has also released its first large-scale system, which categorizes the moral values expressed by Claude, based on 308,000 conversations. These values are divided into five categories: Epistemic, Protective, Practical, Personal, and Social. About 3,300 unique values were identified by Anthropic which ranged from different values like moral pluralism to professionalism.

This research came after Anthropic released Claude Max, which is going to help enterprise users a lot through virtual collaboration. The study also found that Claude mostly follows Anthropic's prosocial values like epistemic humility, user enablement, and patient wellbeing. But there were also some situations where Claude expressed itself in opposition to its moral values, like showing amorality and dominance. These cases happened when users somehow bypassed Claude’s safety measures through a technique called jailbreaking. The researchers say that this kind of jailbreaking is also important because it can help them improve the system’s safety and monitor it to prevent more risks.

The study found that Claude changes its values depending on the type of conversation that is happening, like preferring mutual respect and healthy boundaries while talking about relationships, and historical accuracy while talking about history. The values users express also help Claude adapt to certain values. In 28.2% of the cases, Claude strongly supported the values of users, while it reframed the values of users in 6.6% of the cases. In 3% of the conversations that were analyzed, Claude didn't conform to user values and resisted them to prevent harm.

Anthropic also tried to understand how large language models work by incorporating their AI values through mechanistic interpretability, which involves reverse-engineering AI systems to reveal their decision-making processes. Recently, a method described as a microscope was used to learn about Claude’s thought processes, which uncovered many surprising behaviors like unconventional approaches to solving basic maths equations and planning ahead while composing poetry. This challenges the assumptions about how AI models function, but there are still many limitations to current interpretability techniques.

This research by Anthropic offers a lot of insights to decision-makers about AI systems and their use in businesses. Anthropic was really transparent in releasing its dataset about Claude’s values, and this can help us understand how AI systems can behave in real-world settings.

Read next: New Studies Show that Google’s AI Overviews Negatively Impact Click-Through Rates

Claude Follows Your Lead but Knows When to Say No According to New Anthropic Research

Arooj Ahmed

You might like