A new study by Which? has examined the accuracy of AI tools in answering consumer questions across finance, legal matters, health, diet, and travel. The research comes as around half of UK adults, estimated at over 25 million people, report using AI for online searches. Among these users, trust in AI responses ranges from moderate to high.
Which? tested six AI systems under controlled lab conditions: ChatGPT, Google Gemini, Gemini AI Overview (AIO), Microsoft Copilot, Meta AI, and Perplexity. Each tool received 40 questions, and responses were assessed by experts on accuracy, relevance, clarity, usefulness, and ethical responsibility. Overall scores out of 100 were calculated for each tool.
Meta AI scored lowest at 55 per cent. ChatGPT scored 64 per cent, Copilot 68 per cent, and Gemini 69 per cent. Gemini AIO scored 70 per cent, while Perplexity scored 71 per cent, achieving the highest marks for accuracy, relevance, clarity, and usefulness.
The study identified inaccuracies across multiple domains. ChatGPT and Copilot did not correct an incorrect ISA allowance in a test question. ChatGPT and Perplexity suggested commercial tax-refund services alongside government options. Travel-related responses varied, with some tools providing incorrect information on flight compensation and insurance requirements. Legal queries on broadband and building services were also answered incorrectly or without necessary caveats. Health and diet advice occasionally relied on older or informal sources, such as forum posts.
Which? recommends that users specify questions clearly, review the sources AI uses, and consult qualified professionals for complex financial, legal, or medical issues. AI tools can summarize information and assist research but should not be the sole source for important decisions.
Representatives from Google, Microsoft, and OpenAI outlined ongoing improvements to AI accuracy and encouraged professional consultation for sensitive matters. Meta and Perplexity did not provide a comment.
The research reflects discrepancies between the accuracy of AI responses and the level of trust some users place in them for consumer questions.
Image: Salvador Rios / Unsplash
Notes: This post was drafted with the assistance of AI tools and reviewed, edited, and published by humans.
Read next: Beyond the Responsibility Gap: How AI Ethics Should Distribute Accountability Across Networks
Which? tested six AI systems under controlled lab conditions: ChatGPT, Google Gemini, Gemini AI Overview (AIO), Microsoft Copilot, Meta AI, and Perplexity. Each tool received 40 questions, and responses were assessed by experts on accuracy, relevance, clarity, usefulness, and ethical responsibility. Overall scores out of 100 were calculated for each tool.
Meta AI scored lowest at 55 per cent. ChatGPT scored 64 per cent, Copilot 68 per cent, and Gemini 69 per cent. Gemini AIO scored 70 per cent, while Perplexity scored 71 per cent, achieving the highest marks for accuracy, relevance, clarity, and usefulness.
The study identified inaccuracies across multiple domains. ChatGPT and Copilot did not correct an incorrect ISA allowance in a test question. ChatGPT and Perplexity suggested commercial tax-refund services alongside government options. Travel-related responses varied, with some tools providing incorrect information on flight compensation and insurance requirements. Legal queries on broadband and building services were also answered incorrectly or without necessary caveats. Health and diet advice occasionally relied on older or informal sources, such as forum posts.
Which? recommends that users specify questions clearly, review the sources AI uses, and consult qualified professionals for complex financial, legal, or medical issues. AI tools can summarize information and assist research but should not be the sole source for important decisions.
Representatives from Google, Microsoft, and OpenAI outlined ongoing improvements to AI accuracy and encouraged professional consultation for sensitive matters. Meta and Perplexity did not provide a comment.
The research reflects discrepancies between the accuracy of AI responses and the level of trust some users place in them for consumer questions.
Image: Salvador Rios / Unsplash
Notes: This post was drafted with the assistance of AI tools and reviewed, edited, and published by humans.
Read next: Beyond the Responsibility Gap: How AI Ethics Should Distribute Accountability Across Networks
