Large language models, often praised for mimicking human speech, appear to rely more on memory-based comparisons than grammatical logic, according to recent research from Oxford University and the Allen Institute for AI. Rather than extracting symbolic rules, these AI systems seem to reach language decisions through analogy, matching new inputs to known word patterns embedded in training data.
The peer-reviewed findings, published in the Proceedings of the National Academy of Sciences, examined how models like GPT-J handle derivational morphology, a domain where English words shift class through affixes like “-ness” and “-ity.” While prior studies focused on regular sentence structures, this research tested models using made-up adjectives that lacked prior presence in any training data.
Researchers crafted 200 invented adjectives such as friquish and cormasive, then asked GPT-J to choose noun forms for each by appending either “-ness” or “-ity.” Its choices weren’t random. Instead, they tracked with how similar words behave. For instance, selecting friquishness due to parallels with selfish, and preferring cormasivity where existing examples like sensitivity dominated the model’s memory.
These outputs were directly compared against two established cognitive models. One that generalizes using abstract rules, and another that uses stored exemplars for analogical reasoning. For adjectives with consistent patterns, all systems performed similarly. But when confronted with irregular forms — those that vary between taking “-ness” or “-ity” — GPT-J aligned more closely with the analogy-based model, especially when word token frequency was factored in.
In a deeper test, the study expanded the dataset to 48,995 real adjectives drawn from a public corpus. GPT-J’s noun form predictions closely mirrored the frequency patterns found in this training set, achieving nearly 90% accuracy in matching preferred suffixes. Its predictions weren’t guided by grammatical classification alone but leaned heavily on how often specific word forms had occurred during training.
Even for words with highly regular endings like “-ish” or “-able,” where rule-based behavior might be expected, GPT-J’s output still showed sensitivity to how frequently related forms appeared. This undermines the theory that language models apply hard-coded grammatical templates. Instead, the model appeared to act like a search engine, retrieving the most familiar-sounding variant from memory based on surface resemblance and token frequency.
This split reveals a structural gap between human linguistic intuition and AI-based text generation. While language models produce outputs that may sound fluent, they build generalizations by scanning massive volumes of examples, rather than abstracting concepts or forming general rules as people do. Their heavy dependence on how frequently words appear also suggests they memorize far more than they actually understand.
Even larger models like GPT-4 showed weaker alignment with human choices than simpler analogy-based cognitive systems. In tests involving less predictable word patterns, GPT-4 performed worse than GPT-J, showing greater bias toward high-frequency forms and less flexibility in applying analogy when variability was high.
Researchers traced this effect further by analyzing how each adjective’s training frequency influenced model confidence. GPT-J consistently leaned harder on high-frequency forms, revealing a lack of abstraction—contrary to how humans can generalize confidently even from sparse examples, as long as word structure or meaning is clear.
These outcomes challenge a long-standing belief that LLMs internalize generalized grammar rules. The findings instead suggest that models, even advanced ones, operate primarily through memorized associations—effectively conducting a massive similarity search rather than any symbolic computation.
The work also raises implications beyond linguistics. Since AI systems often support applications like writing assistants, educational tools, and dialogue engines, understanding the basis of their language decisions becomes essential for ensuring reliability and transparency. If AI draws on memory traces rather than structured rules, then explainability must adapt accordingly.
While the researchers focused narrowly on adjective nominalization, their methods could extend to broader areas, including syntactic variability and compound formation. The study reinforces the need for grounded evaluation methods that probe the how behind AI outputs—not just the what.
In sum, as language models grow more fluent, their resemblance to human reasoning remains partial at best. They may sound articulate, but beneath the surface lies a patchwork of remembered patterns, shaped by data volume—not human-like abstraction or understanding.
Image: DIW-Aigen
Read next: Advanced Reasoning AI May Face Limits by 2026 Warns Epoch Report
The peer-reviewed findings, published in the Proceedings of the National Academy of Sciences, examined how models like GPT-J handle derivational morphology, a domain where English words shift class through affixes like “-ness” and “-ity.” While prior studies focused on regular sentence structures, this research tested models using made-up adjectives that lacked prior presence in any training data.
Researchers crafted 200 invented adjectives such as friquish and cormasive, then asked GPT-J to choose noun forms for each by appending either “-ness” or “-ity.” Its choices weren’t random. Instead, they tracked with how similar words behave. For instance, selecting friquishness due to parallels with selfish, and preferring cormasivity where existing examples like sensitivity dominated the model’s memory.
These outputs were directly compared against two established cognitive models. One that generalizes using abstract rules, and another that uses stored exemplars for analogical reasoning. For adjectives with consistent patterns, all systems performed similarly. But when confronted with irregular forms — those that vary between taking “-ness” or “-ity” — GPT-J aligned more closely with the analogy-based model, especially when word token frequency was factored in.
In a deeper test, the study expanded the dataset to 48,995 real adjectives drawn from a public corpus. GPT-J’s noun form predictions closely mirrored the frequency patterns found in this training set, achieving nearly 90% accuracy in matching preferred suffixes. Its predictions weren’t guided by grammatical classification alone but leaned heavily on how often specific word forms had occurred during training.
Even for words with highly regular endings like “-ish” or “-able,” where rule-based behavior might be expected, GPT-J’s output still showed sensitivity to how frequently related forms appeared. This undermines the theory that language models apply hard-coded grammatical templates. Instead, the model appeared to act like a search engine, retrieving the most familiar-sounding variant from memory based on surface resemblance and token frequency.
- Related: Without News Facebook Lost User Time, Fewer Posts and Declining Ad Revenue Across Australia
This split reveals a structural gap between human linguistic intuition and AI-based text generation. While language models produce outputs that may sound fluent, they build generalizations by scanning massive volumes of examples, rather than abstracting concepts or forming general rules as people do. Their heavy dependence on how frequently words appear also suggests they memorize far more than they actually understand.
Even larger models like GPT-4 showed weaker alignment with human choices than simpler analogy-based cognitive systems. In tests involving less predictable word patterns, GPT-4 performed worse than GPT-J, showing greater bias toward high-frequency forms and less flexibility in applying analogy when variability was high.
Researchers traced this effect further by analyzing how each adjective’s training frequency influenced model confidence. GPT-J consistently leaned harder on high-frequency forms, revealing a lack of abstraction—contrary to how humans can generalize confidently even from sparse examples, as long as word structure or meaning is clear.
These outcomes challenge a long-standing belief that LLMs internalize generalized grammar rules. The findings instead suggest that models, even advanced ones, operate primarily through memorized associations—effectively conducting a massive similarity search rather than any symbolic computation.
The work also raises implications beyond linguistics. Since AI systems often support applications like writing assistants, educational tools, and dialogue engines, understanding the basis of their language decisions becomes essential for ensuring reliability and transparency. If AI draws on memory traces rather than structured rules, then explainability must adapt accordingly.
While the researchers focused narrowly on adjective nominalization, their methods could extend to broader areas, including syntactic variability and compound formation. The study reinforces the need for grounded evaluation methods that probe the how behind AI outputs—not just the what.
In sum, as language models grow more fluent, their resemblance to human reasoning remains partial at best. They may sound articulate, but beneath the surface lies a patchwork of remembered patterns, shaped by data volume—not human-like abstraction or understanding.
Image: DIW-Aigen
Read next: Advanced Reasoning AI May Face Limits by 2026 Warns Epoch Report