When LLMs Lies Smoothly: Study Reveals Gaps Between AI's Decisions and Explanations

Large language models have become eerily good at sounding human. That’s not new. But there’s a deeper issue gaining attention i.e. when these models explain their answers, can we actually trust what they say?

Researchers at Microsoft and MIT's CSAIL think we need better tools for answering that question. They’ve come up with a new way to test whether AI explanations really reflect the decision-making that led to a given output—or if the model is just giving a polished, plausible-sounding excuse.

This matters. A lot. If an AI makes decisions based on biased or irrelevant factors, but then justifies itself with something reasonable, people may believe it's being fair or accurate. And they may not think to question it—especially in places like hiring, medicine, or legal work, where stakes are high.

One troubling case: a well-known model ranked female nursing candidates higher than male ones, even when the only change was the gender listed. Yet when asked to explain, it pointed to skills and experience, not gender. That’s a textbook case of an “unfaithful” explanation—the logic doesn’t match the answer.

To make this easier to catch, the team developed what they call causal concept faithfulness. It’s a mouthful, but the idea is simple: check if the concepts the AI says influenced its answer are the ones that actually mattered.

How do you test that? First, you need to know what concepts are in the question. For that, the researchers used a second AI model—basically a helper—to pull out the key ideas. Then, they changed those ideas one by one and fed the altered questions back to the original model. If the answer changes, you’ve got evidence that the concept was causally important. If the model didn’t mention that concept in its explanation? It’s hiding something.

This kind of testing doesn’t come cheap. You need lots of examples and lots of back-and-forth testing with the models, which racks up computing time. To make it manageable, the team used a statistical shortcut—a Bayesian hierarchical model—to estimate multiple effects at once rather than testing each in isolation.

In practice, the method uncovered some uncomfortable patterns. On a dataset designed to test for social bias, language models sometimes made decisions that clearly leaned on race, gender, or income—but then claimed otherwise, pointing to other attributes like personality or behavior. In another dataset based on patient care, explanations left out crucial medical details that had clearly influenced the outcome.

It’s not a perfect system. The helper model that identifies and edits concepts can slip up, and it’s still hard to untangle concepts that appear together often. The researchers say future versions might experiment with adjusting multiple inputs at once.

But even with those limits, the method gives developers and users a powerful tool: the ability to see when AI is presenting a clean narrative that doesn’t line up with what’s actually under the hood. That opens the door to targeted fixes. If you know a model is quietly favoring one gender, you can stop using it for comparisons—or go in and fix the bias directly.

In the long run, tools like this may be critical for building AI systems that people can actually trust—not because they sound smart, but because their reasoning holds up when tested.


Image: DIW-Aigen

Read next: Study Warns of Surge in Security Breaches Linked to Social Media Impostors
Previous Post Next Post