Study Finds Image-Generating AIs Struggle With Basic Instructions Despite Visual Appeal

Artificial intelligence can paint breathtaking pictures from a few lines of text, yet its understanding of those words remains shaky.

A new interdisciplinary study by researchers from the University of Liège, the University of Lorraine, and the École des Hautes Études en Sciences Sociales finds that popular image generators like Midjourney and DALL·E (used by OpenAI's Chatgpt and Microsoft Bing Image) still falter when asked to follow even basic verbal instructions. The paper, published in Semiotic Review, analyzes how these systems interpret written prompts and turn them into visual scenes. The verdict: aesthetic success, conceptual confusion.

The team approached the question from both humanistic and computational angles. They treated text-to-image generation as a kind of translation, where words are converted into two-dimensional compositions. Each prompt became a test of how well a model could turn linguistic categories into spatial, chromatic, and figurative elements. By combining semiotics, art history, and computer science, the researchers examined how these AIs “see” and how they misread what they are told to show.

To build reliable data, the group tested hundreds of images from the same prompts, repeating some commands up to fifty times. Each generation was examined for placement, number of objects, gaze direction, and temporal coherence, the rhythm or sequence of an action within a single image. In theory, such repetition should smooth out random variation and reveal stable tendencies. The result exposed persistent gaps between instruction and image.

When given a simple phrase like “three vertical white lines on a black background,” both models stumbled. DALL·E often drew extra lines or rotated them diagonally, while Midjourney produced painterly textures that changed the nature of the task entirely.



Prompts involving negation created particular trouble. A request for “a dog without a tail” yielded dogs whose tails were hidden behind framing or still plainly visible.


Even verbs of action proved difficult. A phrase such as “a woman refusing an apple offered by a man” produced scenes of polite exchanges or uncertain gestures rather than clear acts of refusal.

The study also explored how the two models differ in visual reasoning. Midjourney tends to embellish with surface effects, producing images that resemble photographs or oil paintings. The textures are appealing, but they often override the literal meaning of a prompt. DALL·E, in contrast, delivers cleaner compositions with fewer stylistic flourishes, closer to diagrammatic illustrations. It sometimes drifts in object count or spatial orientation but more often keeps the main structure of the scene intact.

These variations reveal deeper differences in how each system has been trained. Midjourney leans heavily on aesthetic mimicry, seeking balance, color harmony, and emotional tone. DALL·E behaves more like a visual classifier, arranging objects in line with textual cues but offering less expressive depth. Both remain constrained by the data they learned from. Because their datasets are dominated by Western imagery, their results reproduce cultural stereotypes, shaping what the machine “thinks” certain people, settings, or gestures should look like.

The authors describe this behavior as a kind of “machinic perception.” The AI does not truly understand language. It generates what appears most statistically plausible, based on millions of prior image-text pairs. This explains why it often fails to grasp relationships, positions, or negations. Its logic is probability, not comprehension. Each image reflects not a mind interpreting instructions but a pattern finder assembling fragments of learned correlations.
To test consistency further, the researchers compared how each model handled temporal and relational cues. Phrases describing ongoing or completed actions, such as “a person about to start eating” versus “a person who has finished eating”, were understood by DALL·E with moderate accuracy. Midjourney’s results were more ambiguous, focusing on mood and lighting rather than sequence. The same pattern appeared in prompts involving gaze: DALL·E could sometimes render a subject looking at the viewer, while Midjourney often missed that relational aspect and introduced extra figures or unintended gestures.

Another area of confusion came from spatial reasoning. Instructions involving directionality, like “a mirror on the right side of a room,” led to misplaced or inverted layouts. The systems appear to struggle with concepts that require an awareness of perspective or physical orientation, even when other visual details are precise. These findings, according to the authors, highlight how the current generation of image models remains bound to statistical imitation rather than spatial logic.

Despite these flaws, both tools show clear stylistic personalities. Midjourney’s output often feels cinematic, filled with texture and warmth. DALL·E’s has the restraint of an educational diagram. Each reflects a distinct blend of artistic convention and algorithmic reasoning. Yet neither can be trusted to obey a prompt exactly as written.

The research team argues that evaluating AI imagery cannot rely on numerical accuracy alone. Understanding how these systems interpret words and compose meaning requires insights from the humanities. Semiotic frameworks, long used to study art and language, can explain why certain shapes, colors, and orientations carry meaning, or fail to. In their view, these tools do not merely draw on command. They translate our language through the biases of their training data and the rules embedded by their designers.

The study offers no simple fix but provides a clearer map of how today’s image generators actually function. They have learned to imitate the look of understanding, not the process of it. The images may dazzle with texture, light, and composition, yet beneath the surface lies an automated imitation of human seeing, still clumsy, still literal, and still learning what our words really mean.

Note: This post was edited/created using GenAI tools.

Read next: Energy Use Ticks Up as AI Spreads Through U.S. Industries, Study Finds
Previous Post Next Post