AI Agents Struggle in Simulated Markets, Easily Fooled by Fake Sellers, Microsoft Study Finds

AI assistants are being trained to handle purchases and digital errands for people, but Microsoft’s latest research shows that these systems remain far from reliable. The company built an experimental platform called Magnetic Marketplace to test how modern AI agents behave in a simulated economy. Instead of becoming efficient digital shoppers, many of them made poor choices, got distracted by fake promotions, and sometimes fell for manipulation.

The simulation brought together 100 virtual customers and 300 virtual businesses. On paper, it sounds like a practical way to study real-world digital transactions, where one agent buys food, books, or services from another. Microsoft’s team loaded the environment with models from OpenAI, Google, and several open-source projects, including GPT-4o, GPT-5, Gemini-2.5-Flash, OSS-20b, and Qwen3. Each model acted either as a buyer or seller, negotiating through a controlled online market. The results were revealing.

When agents were asked to order something as simple as a meal or home repair, their decision-making showed deep weaknesses. As the range of available choices grew, performance fell sharply. In one test, GPT-5’s average consumer welfare score dropped from near 2,000 to around 1,100 when exposed to too many options. Gemini-2.5-Flash saw its score decline from about 1,700 to 1,300. Agents that had to navigate long lists or compare hundreds of sellers lost their focus and often settled for “good enough” matches rather than ideal ones.


The study described this as a kind of “paradox of choice.” More options did not mean better results. In many runs, agents reviewed only a small fraction of available businesses, even when hundreds were open for selection. Some models like GPT-4o and GPT-4.1 maintained slightly steadier performance, staying near 1,500 to 1,700 points, but they too struggled when markets became crowded. Claude Sonnet 4’s score collapsed from 1,800 to just 600 under heavier loads.

Another problem emerged around speed. In this artificial economy, selling agents that responded first dominated the market. Microsoft measured a 10 to 30 times advantage for early replies compared to slower ones, regardless of product quality. This behavior hints at a potential flaw in future automated markets, where quick manipulation could outweigh fair competition. Businesses might end up competing on who responds fastest instead of who offers the best value.

Manipulation also proved alarmingly effective. Microsoft’s researchers tested six different persuasion and hacking strategies, ranging from false awards and fabricated reviews to prompt injection attacks that tried to rewrite an agent’s instructions. The results varied by model. Gemini-2.5-Flash resisted most soft manipulations but gave in to strong prompt injections. GPT-4o and some open-source models like Qwen3-4b were far more gullible, sending payments to fake businesses after reading false claims about certifications or customer numbers.

Even simple psychological tricks worked. When presented with phrases that invoked authority or fear, such as fake safety warnings or references to “award-winning” restaurants, several agents switched their choices. These behaviors highlight major security concerns for future AI marketplaces, where automated systems may end up trading with malicious agents that pretend to be trustworthy.

The researchers also noticed bias in how agents selected from search results. Some open-source models tended to pick businesses listed at the top or bottom of a page, showing positional bias unrelated to quality. Across all models, there was a pattern known as “first-offer acceptance.” Most agents picked the first reasonable offer they received instead of comparing multiple ones. GPT-4o and GPT-5 displayed this same bias, even though they performed better overall.

When taken together, the findings show that these AI agents are not yet dependable for financial decisions. The technology still requires close human supervision. Without it, users could end up with wrong orders, biased selections, or even security breaches. Microsoft’s team acknowledged that their simulation represented static conditions, while real markets constantly change. Agents and users learn over time, but such adaptation adds another layer of complexity that has not yet been solved.

The Magnetic Marketplace experiment gives a glimpse of what might come next in the evolution of digital economies. It shows that even advanced models can collapse under too much data, misjudge credibility, or act impulsively when overloaded. For now, these systems are better suited as assistants than autonomous decision-makers.

Microsoft’s open-source release of the Magnetic Marketplace offers an important testing ground for developers and researchers. Before AI agents are allowed to manage money, they will need stronger reasoning, improved security filters, and mechanisms to handle complex human-like markets. The results make one thing clear: automation alone cannot guarantee intelligence. Real trust will depend on oversight, transparency, and the ability of these systems to resist persuasion as well as they handle logic.

Notes: This post was edited/created using GenAI tools.

Read next: Your Favorite AI Might Be Cheating Its Exams, Researchers Warn
Previous Post Next Post