ChatGPT’s debut in late 2022 brought an AI revolution into the spotlight, but its inner workings and implications have remained opaque to many users. Beneath the hype are lesser-known facts about how ChatGPT was built and is maintained – from the data it learned from to the people who made it safe. Here are 11 behind-the-scenes realities about ChatGPT, each backed by documented sources.
ChatGPT’s Training on Copyrighted Material: Lawsuits and Scraped Data
ChatGPT was trained on vast amounts of text harvested from the Internet – including copyrighted books and articles. In fact, a June 2023 U.S. lawsuit by novelists Paul Tremblay and Mona Awad alleges ChatGPT “mined data copied from thousands of books without permission,” claiming their works were used to train the model. The plaintiffs estimated OpenAI’s training corpus contained over 300,000 books, including from so‑called “shadow libraries” of pirated text. The New York Times has also sued OpenAI, asserting its articles were “one of the biggest sources of copyrighted text” in ChatGPT’s training data. Similarly, Canadian media companies charged in 2024 that OpenAI “scrap[ed] large swaths of content” from news outlets without permission or payment. In short, ChatGPT’s knowledge comes from enormous web scrapes, and writers and publishers have raised multiple copyright claims over that data.
Memorized Training Data: Verbatim Regurgitation Risks
AI experts have found that ChatGPT can sometimes “leak” portions of its training data verbatim. By repeatedly prompting the model (for example, asking it to repeat a word thousands of times), researchers discovered ChatGPT can go off-script and output chunks of memorized content. In one study, security researchers had ChatGPT keep repeating “book” and, after thousands of repetitions, it began spitting out an email signature and contact details that appear to have been memorized from its training data. Similarly, prompting “poem” led to ChatGPT outputting parts of a copyrighted poem and even private user information. In a paid test, researchers extracted over 10,000 memorized training examples (including computer code, snippets of books, and personal data) by spending about $200 on queries. These findings underscore privacy and copyright risks: even though ChatGPT usually generates original-sounding text, clever prompts can cause it to regurgitate exact phrases or data it saw during training.
The Mystery of GPT-4’s Architecture
When OpenAI unveiled GPT-4 (the model behind newer versions of ChatGPT) in March 2023, it released only a high-level capability report – not the nuts-and-bolts design. The GPT-4 technical report explicitly stated it contains “no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar”. In other words, OpenAI did not disclose how many parameters GPT-4 has, how it was trained, or on what hardware – citing competitive and safety reasons. This secrecy drew criticism. FastCompany noted the 98-page paper was “long on claims” (like GPT-4’s high test scores) but short on concrete engineering details, calling the lack of transparency a “cop-out” that frustrates researchers. (OpenAI CEO Sam Altman later commented GPT-4 cost on the order of $100 million to train, but he too gave few specifics.) In sum, GPT-4’s internal design remains largely a black box outside OpenAI.
Microsoft’s Deep Involvement: Licensing and Cloud Control
Since 2019, Microsoft has been OpenAI’s financial and infrastructure anchor. In 2020 Microsoft announced a $1 billion investment and became OpenAI’s exclusive cloud provider and becasue of that all of OpenAI’s training and ChatGPT’s hosting run on Microsoft Azure data centers. Microsoft also secured an exclusive license to use OpenAI’s models. For example, as The Verge reports, Microsoft acquired an “exclusive license to GPT-3” alongside its investment , and later news reports indicate Microsoft invested billions more for an exclusive license to GPT-4 and future models. This means ChatGPT’s technology powers not only OpenAI’s site but also Microsoft products (like Bing Chat and the GitHub Copilot coding assistant). In early 2025 this tight relationship got a tweak: a joint AI infrastructure effort (“Project Stargate”) opened the field to other partners, ending Microsoft’s status as sole cloud provider. However, Microsoft still holds a “right of first refusal” for additional AI compute through 2030, ensuring it remains central to OpenAI’s computing.
Web Scraping to Licensing Deals (Reddit, Stack Overflow)
OpenAI’s models were initially trained on unlicensed web data – for example, a 2023 audit found Reddit content in GPT’s training sets. In early 2024, OpenAI started formalizing relationships with major data sources. In May 2024, Stack Overflow announced a collaboration: OpenAI will use Stack Overflow’s API and curated Q&A content to improve its coding-related models. Stack Overflow even will be credited when its answers show up in ChatGPT’s responses. Around the same time, Reddit partnered with OpenAI to feed its content into the chatbot via Reddit’s API. (Previously Reddit data had largely been scraped without explicit permission.) These licensing deals mean OpenAI is now paying or giving credit to sites it once simply mined. The moves reflect a shift from freebooting scraped data toward contractual access to popular sources.
Environmental Footprint: Energy and Water Use
Training and running ChatGPT consume enormous computing resources – and therefore lots of electricity and even water for cooling. Generative AI models are trained on large GPU clusters: MIT reports that a single AI training cluster can use “seven or eight times more energy than a typical computing workload”. In North America, data center power demand roughly doubled from ~2,700 MW in 2022 to ~5,300 MW by end of 2023, largely driven by AI needs. Globally, data centers in 2022 used about 460 terawatt-hours (TWh) of electricity – more than the nation of Saudi Arabia. That could rise to over 1,000 TWh by 2026 (as much as Japan’s electricity use) if current AI growth continues. Cooling these centers also uses water: Yale Environment360 notes that generative AI “uses massive amounts of energy and millions of gallons of water” in data center cooling. One estimate (from MIT researchers) found that an average ChatGPT session (10–50 questions) requires about 0.5 liters of water for cooling. These figures highlight the often-overlooked environmental costs of AI services.
Hallucinations: Legal and Medical Misinformation
ChatGPT is prone to “hallucinations” – confidently stated but false information. This has led to real-world problems in high-stakes domains. In law, there have been numerous instances of lawyers submitting court documents with fictitious case citations generated by ChatGPT. Reuters reports at least seven U.S. cases where attorneys included made-up legal rulings from the chatbot; in one example (Walmart litigation), the lawyer admitted ChatGPT “hallucinated” the case law. The trend is growing: a legal consultant found 120 cases worldwide (as of mid-2025) where judges discovered AI-generated bogus quotes or cases, increasingly filed by attorneys over self-represented litigants. In medicine, ChatGPT’s errors can be dangerous too. A 2024 study in Live Science found GPT-3.5 (ChatGPT’s engine at launch) gave correct diagnoses less than half the time (49%) on a set of medical cases. Other research shows ChatGPT routinely fabricates citations or advice in clinical contexts. For example, Nature reports that nearly half of 115 medical references ChatGPT generated were entirely made up (and 46% were real citations used incorrectly). In summary, experts warn that ChatGPT should not be trusted for legal or medical advice without human oversight, since it can just as easily invent believable-sounding falsehoods.
English Bias: Uneven Performance on Other Languages
Like many AI models, ChatGPT works best in English. The model was primarily trained on English-language data, so it “does amazingly well in English” but often poorly in lower-resource languages. For instance, users testing ChatGPT’s ability in Tigrinya or other less-common languages found it frequently produced nonsensical results or simply copied gibberish back. OpenAI claims GPT-4 improved multilingual ability and did outperform GPT-3.5 on a 26-language benchmark, but outside of high-resource languages like English, the chatbot’s responses can be unreliable. As one expert explained, “If you’re not using English, chances are you are having a worse experience”. This English-centric performance reflects biases in the data: far more English content was available for training, so other languages lag behind. Critics say this means ChatGPT will leave non-English speakers at a disadvantage unless new training efforts are made to support those languages.
User Prompts Used for Future Training
Your ChatGPT conversations can help train future AI models – unless you opt out. Originally, OpenAI collected user chats to refine ChatGPT; this practice raised privacy concerns. In April 2023, OpenAI introduced a user-controlled setting so that people can disable the use of their ChatGPT conversations for model training. The company provides a toggle (“Data Controls”) that lets each user turn off “Improve the model for everyone,” meaning their conversations will not be saved to train new models. (Even if enabled, OpenAI says chats are kept only 30 days and then deleted, except for abuse monitoring). By default today, ChatGPT does not use your conversations for training unless you allow it, but OpenAI reserves the right to train on user data if given permission or under enterprise agreements.
Public GitHub Code: Codex and Copilot
OpenAI’s coding-oriented AI, Codex (which powers GitHub Copilot), was trained on public programming code – mostly from GitHub. Microsoft and GitHub assert that Copilot uses only publicly available code and includes a filter to avoid verbatim copies. However, that has not stopped litigation. In late 2022 a group of programmers filed a class-action lawsuit against Microsoft, GitHub and OpenAI alleging Copilot improperly “generated unauthorized copies of open-source code hosted on GitHub” that OpenAI had used in training. They argued this violated open-source licenses (e.g. failing to provide required attribution). In July 2024, a U.S. judge dismissed the plaintiffs’ copyright infringement claims, ruling they had not shown Copilot was copying code verbatim. (A remaining contract-based claim about open-source license terms is still pending.) The Reuters coverage notes the complaint claimed the companies “trained Copilot with code from GitHub repositories without complying with open-source licensing terms”. In practice, this means Codex learned from millions of lines of public GitHub code. Microsoft and OpenAI counter that Copilot’s suggestions count as fair use (and even point to Copilot’s duplication filter as evidence it’s built to avoid copying) . The dispute is a landmark in AI policy: it raises whether building AI from open-source content must respect original licenses.
Human Labor Behind the Scenes: Low-Paid Moderators
Modern AI still depends on people. In particular, OpenAI hired low-paid content moderators to filter and label training data. Investigations in 2023 revealed that OpenAI contracted Sama (formerly Samasource) in Nairobi to have workers sift through explicit and disturbing text and images so ChatGPT would learn what not to say. According to TIME and The Guardian, these Kenyan contractors earned only about $1.50 to $2.00 per hour. The work was grueling and often traumatic: moderators reviewed hundreds of graphic examples of violence or sexual abuse each day to train OpenAI’s safety filters. One whistleblower told TIME he saw torture content and started having nightmares. In mid-2023 a group of Kenyan moderators (through Sama) filed a petition describing “exploitative” conditions – long hours, pay as low as $1.46/hour, and exposure to extreme content with minimal support. Sama itself confirmed the pay range in public statements (about $1.30–$2.00/hr, depending on experience). OpenAI has defended the work as necessary for making safe AI, but critics say the human cost is high. These contractors are a hidden workforce whose low wages and risks helped transform crude web data into the relatively “safe” ChatGPT we use today.
Image: DIW-Aigen
Read next: Study Warns That Friendly Chatbots May Enable Dangerous Behavior