As large language models (LLMs) and image generators grow in complexity, a new challenge has emerged. Not from the hardware, or the algorithms, or even the ethics. It’s the fact that we’re running out of real data.

According to Stanford’s 2025 HAI AI Index Report, researchers project that by 2026, we’ll run out of the high-quality, human-created data that’s been used to train the most advanced models so far. When the internet can’t give you anything new to learn, the answer could be to generate your own information.

The rise of AI trained on AI

Synthetic data — data generated by other models — is positioned as a solution to this scarcity. Instead of scraping Reddit or transcribing books, future models might train on text written by GPT-4 or images created by Midjourney. In other words, AI will start learning from its own output.

This raises serious philosophical and technical questions:

• What happens when a model only sees data that was created by another model?

• Does it lose touch with human nuance, context, ambiguity?

• Will it start to drift into a synthetic echo chamber, amplifying its own quirks?

Model collapse

The term “model collapse” describes what happens when generations of models are trained on the output of previous models. Like photocopying a photocopy, the fidelity degrades.

The report cites research showing that when models rely too heavily on synthetic data, they start forgetting how to “think” clearly. They become overconfident, less diverse in their outputs, and increasingly unaware of edge cases — the messy, nuanced situations that humans deal with every day.

This means:

• AI becomes less accurate

• Errors get baked into future generations

• And ironically, models trained on more data might perform worse

The upshot is, AI might get dumber before it gets smarter.

Layering: One path forward

Researchers are already experimenting with ways to make synthetic data work with human data, not replace it entirely.

One promising approach is layering. Think of it like composting: mixing fresh organic material (human data) with reused scraps (synthetic outputs) to grow something richer.

Results suggest that blending synthetic and real data — carefully, with proper filtering — can improve performance without sacrificing reliability, especially when synthetic data is used to augment underrepresented domains.

There’s also growing interest in:

• Data deduplication (removing repeated patterns)

• Fidelity scoring (rating how “real” synthetic samples feel)

• Feedback loops where humans remain part of the training process

All this to keep AI connected to the complexity of the real world, even when it’s dreaming in vectors.

Implications for builders and leaders

If you’re building or deploying AI, this shift means:

Data sourcing becomes strategic. You’ll need pipelines that mix human and synthetic inputs in ways that don’t break the model.

Transparency matters more than ever. If a model was trained mostly on its own output, that should be known and measured.

Evaluation becomes non-negotiable. Traditional benchmarks may not catch synthetic drift. New ones will need to test for originality, adaptability, and human alignment.

Perhaps the biggest takeaway is this: We’re entering the first era where machines might be influenced more by each other than by us. That should make us ask not just what the models know, but where they learned it from.