AI’s Limits Exposed: New Study Finds Machines Struggle With Real Remote Work

For years, discussion around artificial intelligence has centered on whether machines could eventually replace human jobs. That question has become sharper with the growth of remote work, where tasks are handled entirely online and often require a mix of technical and creative ability. Yet a new study from the Center for AI Safety and Scale AI provides a clearer picture of what AI can actually do in those settings. The findings show that, despite steady progress in reasoning and automation tools, today’s AI systems can complete only a small fraction of real freelance projects at human quality levels.

The study, called the Remote Labor Index (RLI), represents one of the most detailed attempts so far to measure AI’s performance on practical digital work. It focuses on tasks that mirror real online freelancing jobs rather than theoretical tests or benchmark problems. Researchers collected 240 completed projects from professional freelancers working through platforms such as Upwork. Each project included the original brief, all input materials, and the final deliverable that a client had accepted. These projects came from 23 categories of work, including product design, animation, architecture, game development, and data analysis. Together they covered more than 6,000 hours of paid labor valued at about $140,000.

Six advanced AI agents were then tested on the same projects. The systems included Manus, Grok 4, Sonnet 4.5, GPT-5, ChatGPT agent, and Gemini 2.5 Pro. Human evaluators compared the AI results to the professional standards of the original deliverables. The measure used was called the automation rate, defined as the percentage of projects that an AI completed to a standard that would be acceptable to a reasonable client.

The overall results placed current AI performance close to the bottom of the scale. Manus achieved the best outcome, with a 2.5 percent automation rate. Grok 4 and Sonnet 4.5 followed at 2.1 percent, while GPT-5 and ChatGPT agent reached 1.7 and 1.3 percent. Gemini 2.5 Pro finished last at 0.8 percent. In effect, even the strongest model could only complete two or three projects successfully out of every hundred. These numbers confirm that most paid remote work remains well beyond the reach of today’s AI systems.

To understand why, the study reviewed where and how the models failed. Nearly half of the AI outputs were judged to be of poor quality. About 36 percent were incomplete, and 18 percent contained technical errors such as corrupted or unusable files. Many tasks broke down before completion, with missing visuals, truncated videos, or unfinished code. Others showed inconsistency between design elements, such as an object changing shape between different 3D views. These errors highlight that even powerful models lack the internal verification ability that human workers apply when checking and refining their own results.

The researchers also noted that remote projects typically combine several layers of skill. A single job might involve writing, coding, design choices, and client-level presentation. While current AI models can produce functional text, basic graphics, or snippets of code, they often fail to align all these elements into a coherent, professional output. The lack of integrated quality control leads to results that are close to correct in parts but unsatisfactory as complete deliverables.

Some narrow areas showed stronger AI performance. Tasks involving short audio clips, simple image generation, or data visualization were occasionally completed at human level. In those cases, the systems benefited from established generative tools that already handle single-format media. The study used an additional metric, known as an Elo score, to track relative progress between different models. Although none matched the human baseline, newer models did show measurable improvement compared with earlier versions, suggesting steady if limited advancement.

Economically, the gap between potential and reality remains wide. When translated into market value, the highest-earning model, Manus, produced accepted work worth only $1,720 out of a total pool of nearly $144,000. This indicates that the contribution of current AI tools to freelance productivity is still marginal. The data also show that AI has not yet achieved meaningful cost deflation in remote labor markets, since most tasks still require full human oversight or redo.

For professionals who depend on online freelance income, the study’s conclusions provide some reassurance. Remote workers, especially in design, architecture, and multimedia fields, remain largely irreplaceable at present. The same applies to roles that involve judgment, error correction, and visual or interactive quality checks. However, the results also point to a gradual path of improvement. As AI models gain better multimodal reasoning and tool-use capacity, they may begin to handle larger portions of complex tasks under supervision.

The authors acknowledge that the benchmark does not cover all types of remote jobs. Work involving direct client communication, teamwork, or long-term project management was excluded. Even so, the Remote Labor Index represents the broadest test so far of AI’s real automation capacity in economically meaningful work. Its value lies in offering empirical measurement rather than assumption. By grounding AI evaluation in actual freelance projects, it shifts the conversation from hypothetical capabilities to demonstrated performance.

The findings suggest that the path to full automation of digital labor remains long. While AI can now assist with smaller creative or technical steps, it still struggles with the coordination, judgment, and quality assurance that professional work requires. Future updates to the RLI may help track whether ongoing model improvements translate into practical economic performance. For now, the study confirms that artificial intelligence, though advancing quickly, has yet to match the reliability and completeness of human remote workers.


Image: Yasmina H / Unsplash

Notes: This post was edited/created using GenAI tools.

Read next: Carnegie Mellon Study Finds Advanced AI Becomes More Self-Interested, Undermining Teamwork as It Gets Smarter
Previous Post Next Post