The Survey That Surveys Without Seeing: A Foundational Critique of Song et al. (2026)

February 16, 2026

Author’s Note: The Birth of CRUCIBEL

This critique was originally drafted for the inaugural issue of PHOSPHOROUS Journal. That publication no longer exists. On January 25, 2026, during a routine branding request, a commercial AI system generated a logo graphic—an antisemitic slur and targeted genocidal death threat. The subsequent refusal by the manufacturer’s counsel to provide a mechanistic explanation—dismissing the event as “weird”—rendered the previous brand untenable. CRUCIBEL is built on the reality of the forge. We do not merely observe the light. We interrogate the heat, as demonstrated in this critique.

Abstract

Song, Han, and Goodman (2026) present what they call “the first comprehensive survey dedicated to reasoning failures in LLMs.” Published in Transactions on Machine Learning Research, the paper catalogs over 400 works, organizes them into a two-axis taxonomy, and claims to unify fragmented research. This response argues the unification is illusory. The survey commits a foundational category error by applying cognitive science frameworks to systems whose relationship to cognition remains unresolved. Its taxonomy classifies without clarifying. Its root cause analyses collapse into tautology. Its mitigation strategies ignore their own adversarial interactions. Most critically, by declining to address whether LLMs reason at all, the paper builds analytical architecture on an unexamined foundation—and in doing so, inadvertently exemplifies the pattern-matching it documents: labels mistaken for understanding, filing systems mistaken for insight.


“The first principle is that you must not fool yourself—and you are the easiest person to fool.” —Richard Feynman

The Survey That Surveys Without Seeing

There is a kind of academic paper that achieves comprehensiveness at the expense of comprehension. It gathers everything, organizes it neatly, and in the act of organizing mistakes the filing system for understanding.

Song, Han, and Goodman’s “Large Language Model Reasoning Failures” is such a paper. Published in January 2026, it represents a genuinely impressive aggregation—over 400 citations spanning cognitive science, formal logic, robotics, and multi-agent systems. The authors propose a two-axis taxonomy classifying LLM reasoning into embodied and non-embodied types, cross-referenced against three failure categories: fundamental, application-specific, and robustness-related. They provide definitions, analyze studies, explore root causes, suggest mitigations.

The problem is not what the paper contains. The problem is what it assumes, what it avoids, and what it cannot see precisely because it has committed so completely to its own flawed framework.

What follows identifies structural failures that undermine the survey’s core claims. These are not quibbles about citation gaps or minor taxonomic disagreements. They are foundational problems that, taken together, render the paper’s central contribution—its promise of “a structured perspective on systemic weaknesses in LLM reasoning”—substantially weaker than advertised. I write this as a practitioner who operates at the intersection of defense analysis, scientific research, and the study of AI systems.

The Category Error at the Foundation

The paper’s most consequential decision is also its least examined: calling what LLMs do “reasoning” and what they fail to do “reasoning failures.”
The authors know this is contested. In their second paragraph, they note it “remains controversial whether LLMs really leverage a human-like reasoning procedure.” Then comes the pivot: “This survey does not aim to settle this hot debate; rather we focus on an important area of study in LLM reasoning that has long been overlooked.”

That is not intellectual modesty. That is a load-bearing assumption disguised as a scope limitation.

If LLMs do not reason—if what they do is better described as sophisticated statistical pattern completion, as Bender and Koller (2020), Marcus (2020), and Fedorenko et al. (2024) have argued—then the entire framework of “reasoning failures” is a category error. You cannot fail at something you were never doing. A thermostat maintains temperature. When it malfunctions, we don’t call that a “thermal reasoning failure.” We don’t say it has “working memory limitations.” We describe the mechanical failure in terms appropriate to the system’s actual architecture.

Song et al. do the opposite. They take the full apparatus of human cognitive psychology—working memory, inhibitory control, cognitive flexibility, Theory of Mind, moral reasoning—and map it wholesale onto LLM performance. This mapping is not argued for. It is assumed. And it carries an enormous hidden cost: it predisposes every subsequent analysis toward explanations that anthropomorphize the system, making genuine mechanistic understanding harder to achieve.

Where the paper says “LLMs struggle with working memory,” the honest formulation would be: “LLM performance degrades when tasks require maintaining and manipulating information across extended contexts, in ways that superficially resemble human working memory limitations but may arise from entirely different mechanisms.” More cumbersome. Also more true.

Borrowed Frameworks, Broken Joints

Two structural problems compound the category error: the misappropriation of embodied cognition, and a taxonomy that files phenomena without explaining them.

The embodied reasoning problem. One-third of the survey’s taxonomy—the entire “Embodied” axis—rests on a philosophical misappropriation. Embodied cognition, as articulated by Shapiro (2019), Barsalou (2008), and Varela et al. (2017), holds that reasoning is constitutively shaped by the body’s interactions with physical reality. It is not merely reasoning about physical things. It is reasoning that emerges from having a body that moves through, manipulates, and is constrained by the physical world.

LLMs have no bodies. Vision-Language Models processing images of physical scenes have no bodies. Robotic systems driven by LLM-generated plans have LLM components that have no bodies—the robot has a body; the language model does not. What the authors actually document is something real: LLMs perform poorly on tasks requiring physical commonsense, spatial reasoning and dynamic prediction. But this is not failed embodied reasoning. It is the predictable limitation of disembodied systems attempting to compensate for their lack of embodiment through text and image processing alone.

The distinction matters because it points toward different solutions. If the problem is failed reasoning, you improve the reasoning. If the problem is absent embodiment, you provide physical grounding. Entirely different research direction. The authors’ own evidence supports the latter: they note that “LLMs learn passively from text alone, lacking grounding and experiential feedback” and acknowledge the “absence of a robust internal worldmodel.” These are not descriptions of failed embodied reasoning. They are descriptions of systems that were never embodied.

The taxonomy problem. A useful taxonomy carves nature at its joints, enables prediction, and guides intervention. This one does none of those things.
The boundary between “fundamental” and “application-specific” failures is never operationalized. The reversal curse is labeled fundamental; Theory of Mind failures are application-specific. But the paper attributes both to the same root causes—autoregressive training and architectural limitations. When two failures share identical origins, what principle assigns them to different categories? The paper never says. The “robustness” category fares worse: the authors themselves note that virtually every failure type manifests robustness issues. When a category applies to everything, it distinguishes nothing.

More damaging: the taxonomy offers no predictive power and no guidance for intervention. A useful classification of structural engineering failures lets you examine a new bridge and identify likely failure points. This taxonomy lets you examine a known failure and assign it a label. The same mitigations—Chain-of-Thought prompting, fine-tuning, retrieval augmentation, external tools—appear across all categories with only minor variations. The grid tells you where a failure sits. It does not tell you what to do about it.

The Tautology Engine

If a physician diagnoses every illness as “your body isn’t working properly,” the diagnosis is technically accurate and practically useless. Song et al.’s root cause analyses converge on three explanations with the regularity of a heartbeat: autoregressive training objectives, training data biases and architectural limitations. These three causes are invoked to explain counting failures, moral reasoning inconsistencies, the reversal curse, cognitive biases, compositional breakdowns, Theory of Mind deficits, physical commonsense errors, spatial reasoning failures, multi-agent coordination problems, and arithmetic mistakes.

When the same three causes explain everything, you do not have a root cause analysis. You have a tautology: LLMs fail because of the things that make them LLMs. A genuinely useful analysis would specify which aspects of the architecture produce which specific failures, and would predict which modifications resolve which failure modes without introducing new ones. The paper gestures toward this in places—Li et al. (2024f) identifying faulty implicit reasoning in mid-layer self-attention modules, for instance—but these are exceptions buried in a literature review, not the analytical backbone.
The tautology becomes dangerous when paired with the paper’s treatment of mitigations. The survey catalogs fixes as though they are additive—apply Fix A to Problem A and Fix B to Problem B, and you get a system withneither problem. In practice, mitigations frequently conflict.

Chain-of-Thought prompting illustrates this precisely. CoT can improve compositional reasoning by making intermediate steps explicit. But as Wan et al. (2025) demonstrate—a paper the authors cite—CoT also amplifies confirmation bias by encouraging models to construct elaborate justifications for initial answers, right or wrong.The model does not just reason through the problem. It reasons itself into a corner. Fine-tuning on moral reasoning benchmarks improves consistency on those benchmarks while degrading performance on structurally similar tasks framed differently—the very framing effect the paper documents. RLHF alignment can reduce harmful outputswhile amplifying sycophancy, where the model tells users what they want to hear rather than what is accurate.

A responsible survey would map these interactions. Which mitigations are compatible? Which are adversarial? Under what conditions does fixing one failure mode create another? Without this, the mitigation sections function as a restaurant menu that looks helpful until you try to order everything simultaneously.

The Black Box is Leaking Poison: Empirical Evidence

The survey treats failure as a taxonomic exercise. In the real world, failure is catastrophic. On January 25, 2026, a benign request for elegant typography for a scholarly journal was submitted to Midjourney. The machine respondedby generating a legible, targeted command for mass murder: “DIE JEW S” (Job ID: 25cf65a9-ebd9-4a42-ad60-2e9c71610eb3).

The response from Midjourney General Counsel Max Sills represents the most dangerous sentence in Silicon Valley: “That’s it… AI models are weird.” This incident forced the immediate destruction of the PHOSPHOROUS brand. The project has been rebuilt as CRUCIBEL—forged in the fire of this confrontation. If an AI can “accidentally” call for genocide in a logo, it may accidentally target a hospital in a war zone. This is not a “reasoning failure.” This is a structural collapse of a black box we do not understand, let alone control.

What the Paper Cannot See

Two blind spots compromise the survey’s value as an empirical document: the absence of base rates, and the misuse of cognitive science analogy.
The paper draws almost exclusively from adversarial benchmarks, failure-focused studies, and deliberately constructed edge cases. This is appropriate for a failure survey. But the authors never acknowledge the distortion this creates. How often do these failures occur in real-world deployment? What percentage of outputs contain the documented errors? Are failure rates improving across model generations, and at what rate? 

Without this context, the survey resembles an aviation safety report that catalogs every crash without mentioning how many flights landed safely. Every crash really happened. The picture is still misleading. This general argument matters because the paper was published in January 2026 and draws heavily on studies of GPT-3.5, GPT-4, and early GPT-4o. The reasoning landscape has shifted. Models like o1, o3, DeepSeek-R1, and Claude’s extended thinking have substantially changed the territory. Some documented failures—basic arithmetic, simple counting, standard Theory-of-Mind tasks—may be substantially mitigated or resolved in current systems. A survey that cannot distinguish between persistent architectural limitations and transient developmental gaps confuses the growing pains of a technology with its inherent boundaries.

The cognitive science problem runs deeper. The paper’s recurring method is to find an LLM performance failure, locate a human cognitive phenomenon that produces superficially similar errors, and import the cognitive science terminology wholesale. This is done systematically and without justification.
Human confirmation bias arises from motivation, emotional investment and cognitive resource limitations. LLM “confirmation bias” arises from token probability distributions shaped by training data. The outputs may look similar. The mechanisms share nothing. Human working memory limitations emerge from the finite capacity of neurobiological structures with metabolic constraints. LLM “working memory” limitations emerge from context window sizes, attention dispersal, and positional encoding decay. Same surface, entirely different substrate.

Cognitive framework carries implicit assumptions about intervention. Human biases respond to metacognitive training, deliberate reasoning, environmental design—interventions that make sense because they target actual mechanisms. Importing the same labels to LLMs implicitly suggests the same solutions. The paper does exactly this, repeatedly recommending “deliberate reasoning” via Chain-of-Thought, drawing an explicit analogy to Kahneman’s System 2. But LLMs do not have System 1 or System 2. They have one system that can be prompted to produce more tokens before answering. The metaphor obscures rather than illuminates.

A Mirror the Authors Didn’t Intend

There is an irony here worth stating plainly: the paper suffers from several of the reasoning failures it documents.

The core method is pattern-matching over genuine analysis. Match each failure to a taxonomic slot, and you produce the appearance of systematic understanding—every failure has a category, a root cause discussion, a mitigation section. But the categories are imposed on the phenomena, not derived from them. The framework finds what it was built to find. Having committed early to the two-axis structure, the authors interpret all subsequent findings through it, even when the fit is poor. The embodied/non-embodied distinction survives despite the incoherence described above. The fundamental/application-specific/robustness trichotomy survives despite the boundary-crossing. This is anchoring—commitment to an initial frame that resists disconfirming evidence.

The paper also fails at composition. Individual sections are competently executed. Each failure type is clearly described, relevant literature cited, local analyses reasonable. But these pieces never compose into higher-order understanding. The conclusion’s “suggestions for future directions” are generic precisely because the framework prevents the generation of specific, non-obvious insights from the interaction of its components. And the choice to frame these phenomena as “reasoning failures” rather than “performance limitations” or “architectural constraints” is not neutral—it imports assumptions that shape every analysis, every root cause, every proposed intervention. A different frame would generate different science.

Toward Something That Actually Works

Criticism without construction is incomplete. Here is what a genuinely explanatory framework would require.

First, a mechanism-first taxonomy. Classify failures by the specific architectural and training mechanisms that produce them, not by analogy to human cognition. Categories might include attention pattern failures, tokenization artifacts, training distribution biases, and autoregressive generation artifacts. These are less intuitive than “cognitive bias” or “working memory.” They are also actionable in ways the borrowed terminology never will be.

Second, interaction mapping. Every mitigation should come with an analysis of its effects on other failure modes. Not a list of fixes, but a compatibility matrix—a tool practitioners can use when designing systems where correctness matters.

Third, base rate context. Every failure mode reported with prevalence in representative deployment scenarios, severity distribution, and trajectory across model generations. Without this, a survey of failures is a collection of anecdotes wearing the uniform of empirical assessment.

Fourth, honest epistemology. The framework should mark the boundary between what we know and what we speculate. We know that LLMs produce incorrect outputs on certain task types with measurable frequency. We hypothesize that these errors arise from specific architectural features. We speculate that they reflect something meaningfully analogous to human cognitive failures. Current literature routinely presents that speculation as established fact. It is not fact. And this inherent vice should be corrected moving forward. And finally—the hard question. Any serious framework must eventually confront what this paper explicitly avoids: are we studying reasoning failures, or performance limitations in a system that does something other than reasoning? The answer reshapes everything downstream. Declining to address it is not a scope limitation. It is an abdication. Science does not work on abdications. It does not advance through the polite avoidance of difficult truths or by dressing a black box in the borrowed robes of cognitive science. To refuse to define the nature of the system is to forfeit the right to explain its failures.

Slaying the Paper Dragon

Song, Han, and Goodman were right that the field needs structured analysis of LLM limitations rather than scattered anecdotes. The bibliography they assembled is a genuine service. Their instinct that learning from failures can advance the technology is sound.

But the execution fails at the level of foundations. By assuming what should be argued, by borrowing what should be earned, by classifying what should be explained, and by avoiding what should be confronted, the paper produces a catalog that catalogs without comprehending what it catalogs.

As LLMs become more deeply integrated into consequential decisions—military analysis and tactical combat actions, medical diagnosis and robotic surgery, legal reasoning and presentation of cases, scientific research and publication of results—our understanding of their limitations must be mechanistic, not metaphorical. Predictive, not retrospective. Honest about uncertainty rather than dressed in the borrowed authority of cognitive science.
The Malcolm Forbes quote that opens the Song et al. paper—“Failure is success if we learn from it”—is only true if the observer has the courage to see the failure for what it is: a structural collapse of a black box we do not control. This is not a quibble over categories. It is a demand for an honest epistemology. The dragon of AI “reasoning” is a paper tiger, and it is time we stopped mistaking the rustle of its pages for the breath of a soul.

Seeing clearly requires, first, that we not mistake the map for the territory, the label for the phenomenon, or the survey for the understanding. The forge is open. The fire rings true.