Sally’s Siblings Stump GPT-4: Why Hinton Says AI Isn’t Ready for Real Reasoning

Geoffrey Hinton, recognized with the 2024 Nobel Prize in physics for advancing machine learning, recently admitted placing a little too much faith in OpenAI's GPT-4, the AI system he frequently relies on. During a televised conversation with CBS, he shared an interaction where GPT-4 misunderstood a simple logic question — a reminder, he suggested, that today's most advanced systems still struggle with reasoning tasks.

The example he chose was a classic riddle, that is, Sally has three brothers, and each brother has two sisters. Despite the correct interpretation pointing to a single sister — Sally being one of two in total — GPT-4 answered incorrectly, claiming there were two sisters in addition to Sally.

Hinton, who has contributed extensively to AI's foundation, seemed surprised that a model capable of acing law and college admission exams could stumble on such a basic problem. For him, the incident served as a subtle warning that even widely trusted systems remain far from infallible.

While GPT-4 failed the riddle, other versions, such as GPT-4o and GPT-4.1, reportedly answered correctly when users tried the puzzle after the interview aired. These updates, which OpenAI launched after GPT-4’s 2023 debut, aim to improve speed, accuracy, and cross-modal capabilities. GPT-4o, now the default model in ChatGPT, promises smarter voice, text, and vision processing. GPT-4.5 and GPT-4.1 further build on that, though OpenAI hasn't officially addressed the riddle issue.

According to AI model rankings from Chatbot Arena — a crowd-sourced performance leaderboard — Google’s Gemini 2.5-Pro currently holds the top position, with OpenAI's GPT-4 variants closely trailing. These standings reflect the intense rivalry among companies racing to refine general-purpose AI systems.

Elsewhere in the AI research space, findings by model evaluation firm Giskard suggest that when users ask chatbots to respond briefly, those bots tend to generate more false or fabricated statements. The effect was most visible across major models like GPT-4o, Claude, and Mistral, highlighting a trade-off between response length and factual reliability.

Hinton, while optimistic about future improvements, acknowledged that AI’s progress remains uneven. Despite his admiration for the tools he helped inspire, he reminded audiences that intelligence in machines doesn’t always equal sound judgment.

Image: DIW-Aigen

Read next: How to View Any Website’s Past Versions Using the Wayback Machine
Previous Post Next Post