What LLMs Can't Do (And Why That Matters)

Contributor
Oct 27, 2025
6 min read

Every new technology gets oversold. LLMs are no exception. The marketing pitches them as reasoning engines, knowledge bases, and general-purpose problem solvers. They are none of these things. They are statistical pattern completion systems that produce remarkably useful text — and that distinction isn't pedantic. It determines whether the systems you build with them will work or fail.

This post is about the failure modes. Not because LLMs aren't useful — they are — but because understanding limitations is how you make good decisions about where and how to deploy them.

They Don't Reason. They Simulate Reasoning.

When an LLM walks through a logical argument step by step, it looks like reasoning. It follows the form of reasoning. But what's actually happening is pattern completion. The model learned that when text looks like a problem statement, certain patterns of analysis tend to follow. It reproduces those patterns.

This works surprisingly well for common problem types that appeared frequently in training data. It falls apart on novel problems, edge cases, and anything requiring genuine logical deduction that the model hasn't seen a close analog of.

The practical consequence: LLMs can get multi-step reasoning wrong in ways that look right. The structure of the argument is coherent. The conclusion is wrong. And unlike a human making a logical error, the model has no mechanism to catch its own mistakes. It doesn't evaluate whether its output is true. It evaluates whether its output is probable.

If your system depends on reliable reasoning — legal analysis, medical diagnosis, financial modeling — the LLM cannot be the final authority. It can draft. It can suggest. It cannot decide.

They Can't Verify Facts

Hallucination isn't a bug. It's what the architecture does.

An LLM generates text by predicting the most likely next token given everything that came before. If the most statistically plausible continuation is factually wrong, the model produces it with the same confidence as a correct statement. There is no internal fact-checking mechanism. There is no database it queries to confirm claims. There is no uncertainty signal that reliably fires when the model is making things up.

This is fundamental to the architecture. You cannot fix hallucination through better training alone. You can reduce it. You cannot eliminate it. Any system that presents LLM output as verified fact without external validation is broken by design.

The models will cite papers that don't exist, invent statistics, attribute quotes to people who never said them, and describe events that never happened — all in fluent, confident prose. The fluency is the danger. Bad information that reads well is worse than bad information that reads poorly, because it's more likely to be believed.

They Can't Learn From Your Conversation

When you have a long conversation with an LLM, it feels like the model is learning about you. It references things you said earlier. It adapts to your preferences. But this is context window processing, not learning. The moment that conversation ends, everything is gone. The next conversation starts from zero.

The model's weights — the actual knowledge encoded in the neural network — don't change from your interactions. Without fine-tuning (retraining the model on new data) or RAG (retrieval-augmented generation, which feeds the model relevant documents at query time), an LLM has no mechanism to incorporate new information permanently.

This matters for anyone building applications. If you need the system to remember user preferences, accumulate knowledge over time, or improve based on feedback, the LLM alone cannot do this. You need infrastructure around it — databases, retrieval systems, memory layers. The LLM is the generation engine, not the knowledge store.

They Can't Do Math

LLMs handle arithmetic through pattern matching, not calculation. They've seen enough math in training data to get simple problems right most of the time. But "most of the time" isn't good enough when the answer matters.

Ask an LLM to multiply two four-digit numbers and you'll get confident answers that are frequently wrong. Ask it to perform multi-step calculations and errors compound. It's not computing. It's predicting what the answer probably looks like based on patterns in text.

Tool use changes this significantly — models can now call calculators, run code, and use external computation. But the base model, on its own, is not a reliable calculator. If your workflow involves numerical accuracy, the LLM needs to be connected to actual computation tools, not trusted to do the math itself.

They Don't Understand

This is the claim that generates the most debate, and it matters most for safety-critical applications.

LLMs process tokens and produce tokens. They have learned extraordinarily sophisticated statistical relationships between those tokens. Whether this constitutes "understanding" is a philosophy question. But the practical implications are clear: the model has no world model, no causal reasoning, no sense of physical reality, and no way to verify its outputs against the real world.

A model can describe how a bridge bears load without understanding physics. It can explain drug interactions without understanding biochemistry. It produces text that matches the patterns of expert knowledge. This is useful for generating first drafts, suggesting possibilities, and accelerating research. It is dangerous when the distinction between "sounds like expertise" and "is expertise" matters — which is any situation where being wrong has consequences.

They Can't Access Real-Time Information

An LLM's knowledge has a training cutoff date. It doesn't know what happened after that date. It can't check current stock prices, read today's news, or verify whether a company still exists. Without external tools — web search, API access, database queries — it operates on a frozen snapshot of the world.

Models increasingly have tool access that addresses this. But the base capability doesn't include real-time awareness. Systems built on LLMs need to account for this: either provide the model with current information at query time, or clearly scope its use to tasks that don't require current data.

They Can't Replace Domain Expertise

An LLM can produce text that reads like it was written by an expert. It cannot replace the judgment that comes from years of experience in a field. The model doesn't know which details matter in context. It doesn't recognize when a situation is unusual. It can't tell when its own output is subtly wrong in ways that only a domain expert would catch.

The right framing is augmentation, not replacement. An LLM can help an experienced engineer draft documentation faster. It cannot substitute for the engineer's knowledge of which edge cases will break the system. It can help a lawyer research precedents. It cannot substitute for the lawyer's judgment about which precedents actually apply.

The gap between "plausible output" and "correct output" is where domain expertise lives. LLMs close that gap partially. They don't eliminate it.

For building real depth in AI and machine learning — the kind that lets you evaluate what these systems can actually do — a technical reference library with current, expert-reviewed content is worth more than any amount of model-generated explanation. The irony of learning about LLM limitations from an LLM is not lost on anyone.

Why This All Matters

Systems built on false assumptions about LLM capabilities fail. Not sometimes — predictably. Here's what that looks like in practice:

Customer-facing chatbots that hallucinate erode trust faster than they build it. One confidently wrong answer about your product, policy, or pricing does more damage than having no chatbot at all.

Automated decision systems that treat LLM output as fact produce systematic errors. The model is wrong in patterns, not randomly. Those patterns create liability.

Workflows that skip human review because the LLM output "looks good" accumulate errors over time. The fluency of the output masks the frequency of the mistakes.

The solution isn't to avoid LLMs. It's to use them with accurate expectations. Build validation layers. Keep humans in the loop for consequential decisions. Design systems that assume the model will be wrong some percentage of the time and handle that gracefully.

The Right Mental Model

LLMs are powerful text generation tools with known, predictable limitations. They are best used where:

Speed matters more than perfection. First drafts, brainstorming, summarization.
A human reviews the output. Code suggestions, research assistance, writing support.
The cost of being wrong is low. Internal communication, idea generation, exploratory analysis.
They're augmented with tools. Search, calculation, database access, retrieval systems.

They are poorly suited for situations where:

Accuracy is non-negotiable. Medical, legal, financial decisions.
There is no human verification step. Fully automated pipelines producing customer-facing content.
Real-time or proprietary information is required. Without RAG or tool access.
The stakes of being wrong are high. Safety-critical systems, compliance, regulatory filings.

Knowing this boundary is the difference between building something useful and building something that fails expensively.

The Takeaway

This is the final post in the Intro to LLMs learning path. If you've followed along, you now have a grounded understanding of what large language models are, how they work, how they're trained, and — critically — where they break down. That puts you ahead of most people making decisions about AI right now, including many of the people selling it.

The next question is: what do you do with this understanding?

Three paths forward, depending on what you're building:

RAG and Knowledge Systems — Learn how to give LLMs access to your own data, overcoming the knowledge cutoff and hallucination problems through retrieval-augmented generation.
Prompt Engineering — Go deeper on how to communicate with LLMs effectively, getting consistent results for specific tasks.
Building with LLMs — Move from understanding to implementation. APIs, tool integration, system design for AI-augmented workflows.

The hype cycle will do what hype cycles do. Understanding the actual technology — including its limits — is what lets you build things that work after the hype moves on.

ShiftQuality