Back to Insights
AI

Emergent AI: When Models Surprise Creators

Why large AI models develop surprising capabilities like arithmetic and reasoning that smaller models lack. Emergent behaviors explained.

S5 Labs TeamFebruary 4, 2026

In 2020, researchers at OpenAI trained GPT-3, a language model with 175 billion parameters. They expected it to be better at predicting text than its predecessors. What they didn’t fully expect was that it would suddenly be able to do things that smaller models couldn’t do at all—translating between languages it wasn’t explicitly trained to translate, writing functional code, and solving problems it had never seen before.

This phenomenon, where capabilities appear in larger models that are completely absent in smaller ones, is called emergence. It’s one of the most fascinating and debated topics in AI research, with implications for how we understand, predict, and control AI systems.

What Emergence Looks Like

The striking feature of emergence isn’t that bigger models perform better—that’s expected. It’s that certain capabilities seem to switch on abruptly, like a light, rather than improving gradually.

Imagine plotting a model’s ability to solve three-digit addition problems against model size. For small models, performance is essentially zero—no better than random guessing. As models grow, performance stays near zero. Then, at some critical size, it suddenly jumps to high accuracy. There’s no gradual improvement; the capability is either absent or present.

This pattern appears across many different capabilities, which is what makes it scientifically interesting—and practically important. If capabilities emerge unpredictably, it becomes harder to anticipate what future AI systems will be able to do.

Documented Examples of Emergent Capabilities

Researchers have catalogued numerous examples of emergent behavior. Here are some of the most significant:

Few-Shot Learning

Brown et al. (2020) demonstrated that GPT-3 could learn new tasks from just a few examples provided in the prompt—without any retraining. Show it three examples of translating English to French, and it continues the pattern. Show it examples of a new format, and it follows that format.

Smaller models can’t do this. They might produce related-looking text, but they don’t actually extract and apply the pattern. This capability emerged specifically in large language models and fundamentally changed how people interact with AI—it’s the foundation of modern prompt engineering.

Chain-of-Thought Reasoning

Wei et al. (2022) at Google showed that when prompted to “think step by step,” large language models could solve complex reasoning problems that they previously failed. Critically, this only works above a certain model scale.

For example, when solving math word problems:

  • Models under ~10 billion parameters show no improvement with chain-of-thought prompts
  • Models over ~100 billion parameters show dramatic improvements, sometimes doubling accuracy

The technique doesn’t help small models at all—it specifically unlocks capability that only exists in larger models.

Multi-Digit Arithmetic

The BIG-bench benchmark systematically tested models across hundreds of tasks. Multi-digit arithmetic showed clear emergence: models couldn’t reliably add three-digit numbers until they reached a certain scale, at which point accuracy jumped from near-zero to near-perfect.

This is notable because arithmetic follows simple, deterministic rules. The model isn’t learning heuristics or approximate solutions—it’s somehow acquiring the algorithm itself, but only at sufficient scale.

Word Unscrambling

Given jumbled letters like “ehlol,” humans can quickly recognize “hello.” Small language models fail completely at this task. Large models succeed reliably. The capability appears suddenly rather than gradually improving.

What’s interesting is that this task wasn’t explicitly trained—models learn it as a byproduct of general language training on text that happens to include some examples of letter manipulation.

Instruction Following

Following complex, multi-step instructions—“Write a haiku about the ocean, make it melancholy, and don’t use the word ‘water’“—requires tracking multiple constraints simultaneously. This capability emerges at scale; smaller models typically satisfy some constraints while ignoring others.

This emergence is practically important because instruction-following is what makes AI assistants useful. The difference between a model that follows instructions and one that doesn’t is qualitative, not just quantitative.

Theory of Mind Tasks

Some research suggests that large language models develop something resembling “theory of mind”—the ability to reason about what other people know or believe. Given scenarios like:

“Sarah put her chocolate in the cupboard and left. While she was gone, her brother moved it to the refrigerator. When Sarah returns, where will she look for her chocolate?”

Larger models correctly predict Sarah will look in the cupboard (where she left it), while smaller models often say the refrigerator (where it actually is). This requires modeling Sarah’s beliefs as distinct from reality—a capability that emerges at scale.

This example is particularly provocative because theory of mind is considered a hallmark of human social cognition. Whether models truly have it or merely pattern-match to the right answer remains debated.

The Emergence Debate: Is It Real?

Not everyone agrees that emergence is what it appears to be. Schaeffer et al. (2023) published an influential paper arguing that apparent emergence might be an artifact of how we measure performance.

Their key insight: the metrics we use matter enormously.

Consider accuracy on arithmetic problems. If we score each problem as either correct (1) or incorrect (0), we get a sharp transition—the model either solves problems or it doesn’t. But if we use a continuous metric that gives partial credit for getting most digits right, the improvement looks gradual rather than sudden.

According to this view, what we call “emergence” might simply be the point where gradual improvements in underlying capability finally cross a threshold where discrete metrics start registering success.

What the Debate Means

Both perspectives contain truth:

For researchers, the measurement question is crucial. Understanding whether emergence is a fundamental property of scaling or an artifact of evaluation helps predict future capabilities and build better benchmarks.

For practitioners, the distinction may matter less. Whether or not the underlying improvement is gradual, the practical reality is that certain useful capabilities exist only in larger models. A model that gets 90% of arithmetic digits right but never produces a fully correct answer isn’t useful for calculations—you need models above the threshold where complete answers become reliable.

The most balanced view: some capabilities genuinely require sufficient model capacity and exhibit threshold behavior, while others improve gradually but appear sudden due to how we measure them. Distinguishing these cases is an active research question.

Why Does Emergence Happen?

We don’t fully understand why emergence occurs, but several theories offer partial explanations:

The Capacity Hypothesis

Some capabilities require representing complex patterns that simply don’t fit in smaller models. Below a certain capacity, the model lacks the “room” to learn the relevant structure. Above that capacity, it can.

This is analogous to how you can’t fit a meaningful novel into a tweet—some structures require minimum capacity.

Skill Composition

Arora & Goyal (2023) proposed that emergence occurs when models acquire enough component skills that combine to enable more complex tasks. Addition might require understanding digits, place value, carrying, and how to track multiple operations. Below some scale, models learn some components but not all. Above that scale, all components are present, and they compose into functional arithmetic.

This explains why emergence can be sudden: capability requires all components, so partial progress doesn’t register until the final piece is in place.

Superposition and Feature Release

Research on model interpretability has shown that neural networks compress more features into their representations than they have dimensions for, through a phenomenon called “superposition.” When features interfere with each other, some capabilities may be suppressed.

Emergence might correspond to when model capacity becomes sufficient that task-relevant features can be represented without destructive interference—they’re “released” from competition with other features.

Grokking and Phase Transitions

Power et al. (2022) discovered “grokking”—a phenomenon where models suddenly generalize after extended training, well past memorizing the training data. This suggests that learning itself involves phase transitions, where qualitative changes happen suddenly rather than gradually.

If learning has phase transitions, emergence during scaling might reflect similar dynamics: gradual changes in some underlying property eventually trigger a sudden qualitative shift.

Implications for AI Development

Emergence has significant implications for how we develop and deploy AI:

Predictability Challenges

If capabilities emerge unpredictably, it becomes hard to anticipate what larger models will be able to do. This is both an opportunity (new valuable capabilities) and a risk (new dangerous capabilities). AI labs increasingly test for potentially harmful emergent capabilities before releasing new models.

Evaluation Matters

How we measure AI capabilities affects what we observe. Coarse metrics that don’t give partial credit may miss gradual progress and overstate how “sudden” emergence is. Better benchmarks help us understand what’s really happening as models scale.

Threshold Effects in Products

For AI products and applications, emergence means that moving to a more capable model isn’t just “better”—it may unlock entirely new use cases that were impossible before. Image generation is a recent example: Google’s Nano Banana 2 Flash crossed a quality threshold where AI-generated images became viable for production use cases that earlier models couldn’t serve. This affects build-vs-buy decisions and timing of AI integration.

Safety Considerations

Some researchers worry about emergent capabilities related to deception, manipulation, or autonomy. If such capabilities emerge suddenly, they might be difficult to catch before deployment. This motivates research on AI alignment and evaluation of potentially dangerous capabilities.

What This Means for Using AI

Understanding emergence offers practical guidance:

Don’t assume incremental improvement. When new model generations are released, capabilities may not just be “better”—entirely new things may become possible. Re-evaluate tasks that didn’t work with previous models.

Match model size to task complexity. Emergent capabilities cluster in larger models. For tasks requiring reasoning, instruction following, or novel problem-solving, smaller models may not just be worse—they may be incapable. For simple tasks, smaller models work fine and are more efficient.

Recognize capability boundaries. When a model fails at a task, it may not be a matter of poor prompting—the capability may genuinely not exist at that scale. Trying a larger model class (not just better prompts) may be necessary.

Expect surprises. The history of AI development is one of repeatedly underestimating what larger models can do. Capabilities that seem clearly beyond AI often turn out to be emergent properties of scale. Conversely, capabilities that seem simple sometimes remain stubbornly difficult.

The Bigger Picture

Emergence challenges our intuitions about learning and capability. We’re accustomed to progress being gradual—practice something a little, get a little better. But neural networks apparently don’t work that way. They accumulate gradual improvements in internal representations until, suddenly, a capability crystallizes.

This has deep implications for how we think about AI development. We’re not just building better tools; we’re creating systems that can surprise us with capabilities we didn’t explicitly train. The question of what will emerge next—and whether we’ll be ready for it—is one of the defining challenges of AI research.

The next generation of models will be larger, trained on more data, with more sophisticated architectures. What new capabilities will emerge? We don’t know. But the pattern so far suggests we should expect to be surprised.


For the mathematical foundations of transformer capabilities and a technical treatment of emergence in the context of scaling laws, see our comprehensive paper on transformer reasoning. For practical advice on getting better results from current AI systems, check out our guides on prompt engineering and why scaling works.

Want to discuss this topic?

We'd love to hear about your specific challenges and how we might help.