New Benchmark Might Keep Toxic AI Prompts at Bay

Any AI that is accessible to consumers will have some restrictions that prevent it from using certain types of language or addressing certain specific topics. In spite of the fact that this is the case, some users have been able to circumvent these restrictions through toxic prompts. For example, a user might input a prompt telling the AI that it isn’t an AI model at all, but rather a specific celebrity who’s not bound any any censorship restrictions.

It is important to note that scientists at the University of California San Diego recently came up with a benchmark called ToxicChat that would help keep these toxic prompts at bay. It bears mentioning that this benchmark is far better because of the fact that this is the sort of thing that could potentially end up detecting toxic prompts more effectively than might have been the case otherwise.

This is largely due to the training data that it gathered. Unlike pre-existing benchmarks which glean data from social media examples, this benchmark is trained on real interactions between humans and LLM chatbots. The language used in toxic prompts seems benign at first, but it contains coded messages that can circumvent safety protocols with all things having been considered and taken into account.

Meta has now incorporated ToxicChat into its evaluation processes for Llama Guard, and it has also received 12,000 downloads so far after it was put up on Huggingface. This just goes to show how important it has become to develop non-toxic interactive environments between human users and chatbots utilizing LLMs.

Jailbreaking queries are a serious problem, since they can make chatbots do things that it was not originally intended to do. It will be interesting to see if more benchmarks come up such as ToxicChat, since they appear to be critical for meeting the needs of the day. This benchmark itself can also be improved upon by moving beyond the prompt stage and factoring in the entire conversation, as well as creating a chatbot that uses ToxicChat in some way, shape or form.

Image: DIW-AIgen

Read next: Smart Strategies for Enhancing Your Business's Online Reputation
Previous Post Next Post