Study Shows Higher Error Rates in AI Responses When Queries Use African American, Indian, or Singaporean English

A new audit of Amazon’s AI chatbot, Rufus, reveals that the system fails more often when users speak in everyday English, especially when the phrasing reflects dialects like African American English (AAE), Indian English, or Singaporean English. Researchers at Cornell University found that Rufus often gives less accurate or more uncertain answers when users write in these dialects, even if the question itself is clear.

The study didn’t just look at spelling mistakes or grammar slips. It focused on how Rufus reacts when the same question is asked using different dialects, sometimes with small changes like dropping punctuation or using casual phrasing. These shifts, which are common in real-world chats, made the chatbot less helpful.

“Rufus is more likely to produce low-quality responses when prompted in minoritized dialects,” the researchers reported.

What They Tested and Why

The audit was based on a new framework developed to measure what the authors call “quality-of-service harms.” That refers to situations where an AI assistant works fine for some people, but not for others, because of the way they speak or write.

To run the test, the team used 120 shopping-related questions that Amazon itself had suggested. These were transformed into five different dialects using a tool called Multi-VALUE, which changes sentence structures to match real dialect features. For example, one version of a question might ask, “Is this jacket machine washable?” while another might say, “This here jacket machine washable?”

The researchers ran each of the 720 total prompts through Rufus using Amazon’s app. They rated each response based on two measures: whether it was wrong, and whether the bot appeared unsure.

When Grammar Trips the Bot

In many cases, Rufus had trouble handling grammatical features common in dialects like AAE or Singaporean English. The most common failure came when users dropped the linking verb “is”, a structure known as zero copula in linguistics. This is a natural part of speech in many English dialects, but it caused the AI to stumble.

When the verb was missing, Rufus often skipped over the real question and instead suggested unrelated items. Out of 36 test prompts that dropped the verb, Rufus failed 25 times. That’s a big jump compared to just 6 failures out of 108 prompts written in standard grammar.

The researchers called the pattern “concerning,” noting that this kind of error could affect many users who naturally write in these dialects.

More Errors with Typos and Informality

The audit didn’t stop with grammar alone. To mirror how people actually type on their phones, the team also introduced typos and removed punctuation. These tweaks made things worse.

Prompts written in Indian and Singaporean English became especially vulnerable to incorrect or vague responses once errors were added. According to the report, “responses to prompts written in AAE, IndE, and SgE are significantly more incorrect than responses to prompts written in SAE.”

The presence of typos made it harder for Rufus to recover meaning when the question came from a non-standard dialect. The bot wasn’t just confused by the dialect, but by how that dialect showed up in typical online writing.

Copying Rufus for Testing

To test whether audits could be done without access to Amazon’s full system, the researchers also built a copy of Rufus. They recreated the chatbot using a publicly available base model, GPT-4o-mini, and used a leaked version of what they believed was Rufus’s prompt template.

While this replica didn’t have access to Amazon’s product data, its patterns of uncertainty matched those of the real Rufus in more than 90% of the cases. This supports the idea that chatbot behavior can still be meaningfully evaluated even when full system access is off-limits.

What the Study Means

The Cornell team argues that companies should design their AI systems with dialect variation in mind, especially when deploying them to millions of users. Right now, many chatbots are trained mostly on formal, standard English, which doesn’t reflect how most people actually write.

“Chatbots should provide responses of equal quality when prompted with semantically equivalent text written in different dialects,” the paper notes.

The findings also push for more regular audits, not just of the models themselves, but of the systems built on top of them. According to the authors, it’s not enough to check for offensive language or safety risks, if the AI doesn’t understand how people naturally speak, it fails at its most basic job.

The Bigger Picture

Although the audit focused on Rufus, the implications extend to any AI assistant used in public-facing services. When a chatbot can’t parse common dialects or handle informal speech, it puts some users at a disadvantage. That may seem like a small technical glitch, but it reflects deeper patterns of exclusion.

As large language models spread into education, customer service, and public systems, the researchers say it’s time to treat language variation as a core design concern, not a corner case. If AI is going to help everyone, it has to understand everyone first.


Notes: This post was edited/created using GenAI tools. Image: DIW-Aigen.

Read next: 

Can AI Stay Neutral? Grok’s Israel-Palestine Replies Raise Doubts

• Google, OpenAI, and Anthropic Models Behave Differently When Faced with Risk and Repeated Competition

• New Research Shows Language Choice Alone Can Guide AI Output Toward Eastern or Western Cultural Outlooks

• Chatbots Lean Toward Inaction, Flip on Wording, and Falter in Personal Moral Choices

Previous Post Next Post