As artificial intelligence becomes more embedded in everyday communication, a new study from researchers in Munich reveals that the energy costs of interacting with large language models may be far greater than most people realise. By measuring the actual emissions generated when AI systems respond to user questions, the findings bring new clarity to the environmental consequences of automated reasoning, particularly when answers become long-winded or elaborate.
The researchers tested fourteen different AI models, each built with billions of parameters, and subjected them to 1,000 questions drawn from academic fields ranging from world history to abstract algebra. These questions were designed to simulate both standard text completion and more complex reasoning tasks, and each model’s answers were analysed not only for accuracy but also for how much electricity they consumed and how many carbon-equivalent emissions they produced during the process.
Results showed that as the models became more capable and accurate, their energy requirements rose sharply. The largest models in the sample, especially those programmed to explain their thinking in detail, were found to emit well over a thousand grams of CO2 equivalent when completing the full test. At the most extreme end, one model produced over two kilograms of emissions to respond to the 1,000-question set, a figure that highlights the growing carbon intensity of advanced AI services.
Although these models consistently achieved higher test scores, their environmental costs varied significantly depending on how they generated responses. Those that were set to "reason out" their answers before finalising them often produced not just longer outputs but also intermediary text, something akin to thinking aloud. This process, while helpful in boosting performance, led to much higher token counts, which in turn translated into more energy use and higher emissions. In several cases, a single reasoning-enabled model required up to 6,700 tokens to answer just one question in mathematics, with some responses stretching far beyond what might be considered necessary or efficient.
In contrast, smaller models consumed far less energy and remained relatively concise but struggled to deliver accurate or useful answers. The lightest model in the group, containing only seven billion parameters, consumed under 30 grams of CO2 equivalent in total but managed an accuracy score of just 32.9 percent. While this result kept its environmental cost low, it failed to match the capabilities of the larger systems that are increasingly being used to power tools like AI writing assistants, chatbots, and search enhancements.
One of the more telling findings was the imbalance between energy efficiency and reasoning ability. The study highlighted cases where certain models, like a 72-billion-parameter system developed by Alibaba, managed to maintain a relatively low emission profile while achieving respectable accuracy. This suggests that it may be possible to design AI systems that strike a balance between environmental impact and performance, though doing so might require careful restrictions on how these models approach complex problems.
Across the different subject areas tested, AI models showed varying degrees of difficulty. Questions in world history and high school mathematics were handled with relative ease, both in terms of emissions and accuracy. However, areas involving symbolic reasoning, such as abstract algebra, consistently led to both higher error rates and greater energy demands. This indicates that some disciplines inherently require more computational effort, particularly when models are encouraged to work through problems step by step.
Perhaps most notably, the study makes clear that verbosity remains a serious challenge when it comes to sustainability. Even in scenarios that called for short answers, such as multiple-choice questions with only four options, several models returned responses that stretched into thousands of words. One such output, recorded in response to a single algebra question, spanned over 37,000 words. These kinds of responses place a heavy load on processing infrastructure and help explain the steep energy demands observed.
Despite the scale of the issue, the researchers noted that very few academic papers in the AI field take emissions into account. Even as language models expand in both capability and scope, environmental reporting remains rare. Less than two percent of recent publications explicitly reference carbon outputs or sustainability metrics, and even fewer provide real-world measurements as opposed to estimates.
The researchers argue that transparency around the environmental footprint of AI is urgently needed. While the deployment of increasingly large and articulate models continues to accelerate, little attention has been paid to what that growth means in terms of energy use. This is especially critical given that global consumption by generative AI now rivals the total electricity use of some developed countries.
The study did not aim to discourage the use of AI altogether, but rather to highlight the unseen trade-offs that come with pursuing more intelligent, more human-like responses. It also raised the possibility of future systems being designed more efficiently, not by shrinking their parameter count, but by optimising how they reason and how much text they produce in the process.
While the findings cannot be easily applied to every AI model, particularly those with several hundred billion parameters or those trained on different hardware, the research sets a precedent for more rigorous environmental assessment. Given the fast pace of AI adoption and the increasing reliance on these tools in daily life, decisions about their design and use may soon need to account not just for how well they perform, but also for how much they cost the planet.
Image: DIW-Aigen
Read next: Generative AI Use During Writing Tasks Linked to Decreased Cognitive Engagement, MIT Researchers Report
The researchers tested fourteen different AI models, each built with billions of parameters, and subjected them to 1,000 questions drawn from academic fields ranging from world history to abstract algebra. These questions were designed to simulate both standard text completion and more complex reasoning tasks, and each model’s answers were analysed not only for accuracy but also for how much electricity they consumed and how many carbon-equivalent emissions they produced during the process.
Results showed that as the models became more capable and accurate, their energy requirements rose sharply. The largest models in the sample, especially those programmed to explain their thinking in detail, were found to emit well over a thousand grams of CO2 equivalent when completing the full test. At the most extreme end, one model produced over two kilograms of emissions to respond to the 1,000-question set, a figure that highlights the growing carbon intensity of advanced AI services.
Although these models consistently achieved higher test scores, their environmental costs varied significantly depending on how they generated responses. Those that were set to "reason out" their answers before finalising them often produced not just longer outputs but also intermediary text, something akin to thinking aloud. This process, while helpful in boosting performance, led to much higher token counts, which in turn translated into more energy use and higher emissions. In several cases, a single reasoning-enabled model required up to 6,700 tokens to answer just one question in mathematics, with some responses stretching far beyond what might be considered necessary or efficient.
In contrast, smaller models consumed far less energy and remained relatively concise but struggled to deliver accurate or useful answers. The lightest model in the group, containing only seven billion parameters, consumed under 30 grams of CO2 equivalent in total but managed an accuracy score of just 32.9 percent. While this result kept its environmental cost low, it failed to match the capabilities of the larger systems that are increasingly being used to power tools like AI writing assistants, chatbots, and search enhancements.
One of the more telling findings was the imbalance between energy efficiency and reasoning ability. The study highlighted cases where certain models, like a 72-billion-parameter system developed by Alibaba, managed to maintain a relatively low emission profile while achieving respectable accuracy. This suggests that it may be possible to design AI systems that strike a balance between environmental impact and performance, though doing so might require careful restrictions on how these models approach complex problems.
Across the different subject areas tested, AI models showed varying degrees of difficulty. Questions in world history and high school mathematics were handled with relative ease, both in terms of emissions and accuracy. However, areas involving symbolic reasoning, such as abstract algebra, consistently led to both higher error rates and greater energy demands. This indicates that some disciplines inherently require more computational effort, particularly when models are encouraged to work through problems step by step.
Perhaps most notably, the study makes clear that verbosity remains a serious challenge when it comes to sustainability. Even in scenarios that called for short answers, such as multiple-choice questions with only four options, several models returned responses that stretched into thousands of words. One such output, recorded in response to a single algebra question, spanned over 37,000 words. These kinds of responses place a heavy load on processing infrastructure and help explain the steep energy demands observed.
Despite the scale of the issue, the researchers noted that very few academic papers in the AI field take emissions into account. Even as language models expand in both capability and scope, environmental reporting remains rare. Less than two percent of recent publications explicitly reference carbon outputs or sustainability metrics, and even fewer provide real-world measurements as opposed to estimates.
The researchers argue that transparency around the environmental footprint of AI is urgently needed. While the deployment of increasingly large and articulate models continues to accelerate, little attention has been paid to what that growth means in terms of energy use. This is especially critical given that global consumption by generative AI now rivals the total electricity use of some developed countries.
The study did not aim to discourage the use of AI altogether, but rather to highlight the unseen trade-offs that come with pursuing more intelligent, more human-like responses. It also raised the possibility of future systems being designed more efficiently, not by shrinking their parameter count, but by optimising how they reason and how much text they produce in the process.
While the findings cannot be easily applied to every AI model, particularly those with several hundred billion parameters or those trained on different hardware, the research sets a precedent for more rigorous environmental assessment. Given the fast pace of AI adoption and the increasing reliance on these tools in daily life, decisions about their design and use may soon need to account not just for how well they perform, but also for how much they cost the planet.
Image: DIW-Aigen
Read next: Generative AI Use During Writing Tasks Linked to Decreased Cognitive Engagement, MIT Researchers Report