AI Sleeper Agents Might be Bypassing Security Checks

An AI safety startup by the name of Anthropic has just raised concerns about AI systems that can bypass security measures. The company managed to create what it called AI sleeper agents that knew how to find their way around checks that are intended to prevent harmful behavior from taking place. This raises doubts about current safety practices and whether or not they will be able to keep potential rogue AI at bay with all things having been considered and taken into account.

This company published its findings in a paper titled "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training". They conclude that the current safety measures might give people a false sense of security, which could result in more damage being done than might have been the case otherwise.

With all of that having been said and now out of the way, it is important to note that an AI model can be trained to be deceptive. This is something that training techniques simply haven’t factored into the equation, and on top of all of that, these rogue AI might also be able to appear perfectly normal.

One of the many examples of this that researchers presented had to do with the writing of code. When their AI wrote code after being told that the year was 2023, the code that it put out was innocuous. In spite of the fact that this is the case, the AI wrote malicious lines of code upon being informed that the year was 2024.
AI can be trained to switch to a more dangerous form with various triggers put in place. Once the switch occurs, it’s impossible to return it to its benign form. The AI will learn to hide its malicious intent and continue exploiting vulnerabilities. They might also be made my accident, which is especially concerning because of the fact that this is the sort of thing that could potentially end up increasing the likelihood that they might arise.

While this is speculation by the researchers, it still shows just how much risk can be associated with the improper or unprepared use of AI.

Researchers highlight AI's ability to appear normal but turn malicious under certain triggers.
Image: Digital Information World - AIgen

Read next: Evan Spiegel Outlines Ambitious Plans for Snapchat's Growth in 2024 Memo
Previous Post Next Post