Report Highlights Costly AI Model Training, Potential Data Shortage Impacting Future Development

Stanford University's latest AI Index Report talks about many AI trends, as it shows that it is very costly to train AI models and AI development can get affected with lack of data in the future. Many AI companies do not mention how much cost it takes to train and develop an AI model but one thing is sure that their cost can be in millions of dollars. The Artificial Intelligence Index Report highlights the estimations about how much money it takes to train an AI model. For instance, the Transformer Model [the foundational block to most modern large language models (LLMs)] only costs $930 to train.

RoBERTa Large which was released in 2019 is trained after spending $160,000 on it. ChatGPT-4 by OpenAI and Gemini Ultra by Google cost around $78 million and $191 million respectively. These costs of already developed AI models suggest that the AI models that are still under developmental stages probably cost around billions of dollars.

There is also another challenge for many AI developers and that is the lack of data to train their AI models. Many AI models have been developed after giving them data from different books, articles and other sources. But there is a huge dependency on AI models for data and information. There is a chance that many data scientists will not know what data to give these AI models in order to improve them in the upcoming years. Epochai published a report in 2022 that predicted that data scientists will run out of high quality language data by 2024, low quality language data will be all used up in the upcoming two decades and image data will be finished between 2030-2040.

One of the solutions to this problem is to train LLMs by the synthetic data that is produced by themselves. Researchers from Stanford say that this solution can avoid data depletion and can also provide data when the naturally occurring data is scattered everywhere. But there are also some limitations with training AI models with synthetic data. LLMs can forget this kind of data and start producing outputs that are poorly synthesized. Their data will be limited and they will forget the true data distributions.

Read next: These Are the Top 15 Carbon Emitters in the World
Previous Post Next Post