Study Reveals ChatGPT-4's Remarkable 'Theory of Mind' Abilities, Outperforming Previous Models

A new study published in Proceedings of the National Academy of Sciences reveals that many large language models (LLMs) like ChatGPT are showing “theory of mind” abilities which are seen in humans. While testing ChatGPT-4, the researchers found that it can perform 75% of the tasks that a six year old can too. This shows that LLMs are showing improvement in their reasoning abilities. Theory of mind refers to the ability of humans to understand beliefs, emotions and mental states of other people, and then they interact with them on the basis of that. In humans, this ability is developed in their early childhood and continues to develop throughout their lives.

The researcher, Michal Kosinski, said that LLMs can predict preferences of users based on what websites they visits, what products they purchases, their music choices and other behavioral data. While predicting the behaviors, it is also important to know the psychological processes of the individuals. For the study on LLMs, the researcher used a false-belief task, a psychological test, to understand the ability of LLMs to predict responses.

Two types of tasks, the Unexpected Contents task and the Unexpected Transfer task, were used for the false-belief test. In the Unexpected Contents task, a subject sees an object with a misleading title and assumes the misleading title to be accurate. In an Unexpected Transfer task, an object gets moved without the subject knowing and the subject searches for the object in the same place. The LLMs tested had to predict and conclude what a human would do if he encountered these two situations. Kosinski evaluated 11 LLMs and created 40 false beliefs to test them. Each false-belief scenario targeted the model's comprehension and understanding of the real world.

The results of the tests showed that GPT-1 and GPT-2 weren't able to solve false-belief tasks, concluding that earlier models of ChatGPT don't have ability to do so. On the other hand, 20% of the tasks were performed accurately by ChatGPT-3 which is equivalent to tasks performed by a three year old. The LLM with the best performance was ChatGPT-4 which was able to complete 75% of the tasks accurately. It predicted 90% of Unexpected Contents tasks while 60% of Unexpected Transfer tasks. The results also showed that ChatGPT-4 was able to adjust its predictions based on context and reasoning instead of simple patterns.

Image: DIW-Aigen

Read next: AI Chatbots Provide Non-Judgmental Mental Health Support but Struggle with Memory and Complex Issues
Previous Post Next Post