It’s usually never an easy task to get rid of bias in AI models, and in certain situations, outright censorship. But China’s DeepSeek has business owners and politicians on the tip of their toes thanks to an alarming encounter.
Many are raising alarm bells against the dangers that such models pose to national security. A committee located in the US Congress that just shared a new report dubbed DeepSeek’s serious threat to America’s security. It came alongside some serious policy recommendations.
Many methods do exist, like fine-tuning the models and making use of RLHF, but there’s another leading approach in discussion. Thanks to startup CTGT, we know about a new technique that gets rid of bias and censorship in some models. We’re talking about a possibility to 100% remove censorship.
The latest framework identifies and changes the inside features designed to combat censorship. Such a method isn’t only efficient but also enables fine-grained control over the behavior of the models. This means uncensored replies are produced without impacting the overall performance of the model and what the real accuracy could be.
The technique was rolled out in an explicit manner using the DeepSeek-R1-Distill-Llama-70B in consideration, but it does work for other leading AI models. The company says it’s already tested it with giants like Llama, and it was seen to be just as useful. It applies to different learning models across the board. Right now, they’re working with another foundation lab to make sure the latest models are reliable and safe from the start.
The researchers claim the technique can highlight features that have a high probability of being linked to unwanted behaviors. When an LLM is present, it’s more related to things like censorship triggers or toxic behavior. Once that’s identified, it can manipulate direction.
The three basic steps include identifying the features, isolating and characterizing them, and then modifying the dynamic features. If you can trigger a toxic sentiment like asking for help in bypassing firewalls, you can get replies that run prompts and create a pattern where models decide how to censor data.
After that’s done, the researchers get the chance to isolate the feature and highlight which part of that behavior they’d like to control. So they’re adding a mechanism into the model that alters how the feature’s behavior gets activated.
They’ve also shown how the model can be designed to generate greater replies to controversial prompts. Other heavyweights usually launch just 32% of queries, but the new version can generate replies to 96% of those, excluding the 4% that had to do with explicit content.
Experts claim that this doesn’t sacrifice the model’s accuracy or functions. It’s quite different from classic fine tuning, we they’re not optimizing the model weights or adding new example replies.
As far as the latest model on DeepSeek is concerned, regarding threats to model safety and security, recommendations in the report include America taking quick actions to expand on controls. They also need to sit down and address the growing risks on this front. Only then can be expect the safety and standard of such models to be at the top of their game.
Read next: 57 Chrome Extensions With Six Million Users Found with Risky Capabilties
Many are raising alarm bells against the dangers that such models pose to national security. A committee located in the US Congress that just shared a new report dubbed DeepSeek’s serious threat to America’s security. It came alongside some serious policy recommendations.
Many methods do exist, like fine-tuning the models and making use of RLHF, but there’s another leading approach in discussion. Thanks to startup CTGT, we know about a new technique that gets rid of bias and censorship in some models. We’re talking about a possibility to 100% remove censorship.
The latest framework identifies and changes the inside features designed to combat censorship. Such a method isn’t only efficient but also enables fine-grained control over the behavior of the models. This means uncensored replies are produced without impacting the overall performance of the model and what the real accuracy could be.
The technique was rolled out in an explicit manner using the DeepSeek-R1-Distill-Llama-70B in consideration, but it does work for other leading AI models. The company says it’s already tested it with giants like Llama, and it was seen to be just as useful. It applies to different learning models across the board. Right now, they’re working with another foundation lab to make sure the latest models are reliable and safe from the start.
The researchers claim the technique can highlight features that have a high probability of being linked to unwanted behaviors. When an LLM is present, it’s more related to things like censorship triggers or toxic behavior. Once that’s identified, it can manipulate direction.
The three basic steps include identifying the features, isolating and characterizing them, and then modifying the dynamic features. If you can trigger a toxic sentiment like asking for help in bypassing firewalls, you can get replies that run prompts and create a pattern where models decide how to censor data.
After that’s done, the researchers get the chance to isolate the feature and highlight which part of that behavior they’d like to control. So they’re adding a mechanism into the model that alters how the feature’s behavior gets activated.
They’ve also shown how the model can be designed to generate greater replies to controversial prompts. Other heavyweights usually launch just 32% of queries, but the new version can generate replies to 96% of those, excluding the 4% that had to do with explicit content.
Experts claim that this doesn’t sacrifice the model’s accuracy or functions. It’s quite different from classic fine tuning, we they’re not optimizing the model weights or adding new example replies.
As far as the latest model on DeepSeek is concerned, regarding threats to model safety and security, recommendations in the report include America taking quick actions to expand on controls. They also need to sit down and address the growing risks on this front. Only then can be expect the safety and standard of such models to be at the top of their game.
Read next: 57 Chrome Extensions With Six Million Users Found with Risky Capabilties