Opinion

What the Dead Internet Theory gets right about AI’s data problem

Jody Bailey

You might not have heard of “dead internet theory,” but growing concerns surround the idea that the web is increasingly dominated by bots and recycling AI content. As a theory, it’s mostly speculative, but the concern it raises is real: it’s harder to trust content on the internet. Content is being flooded with duplication, misinformation, and synthetically generated material, often with little transparency into its origin.

This shift doesn’t only affect how we consume information, but also creates a significant problem for the development of artificial intelligence (AI) – especially large language models (LLMs), which rely on massive volumes of data to learn and improve. When the data is rich in real human input, the resulting output of LLMs is meaningful. When the pool is polluted, the output loses its quality and relevance.

That’s why there’s growing urgency around one existential and crucial question: how do we ensure that AI systems continue learning from the best of human knowledge, and not a diluted version of it?

The internet isn’t the training ground it once was

For years, the open web was treated as a near-limitless source of training data. Developers scraped blogs, forums, documentation sites, and Q&A platforms to build up vast corpus of human-created content. However, the conditions that made this possible are changing.

Many sites are now placing limits on access to their content. Regulations around how data can be collected and used are evolving. Meanwhile, the quality of publicly available information is in decline, with more and more of it churned out by generative AI models that echo existing content rather than contributing new insight.

This dynamic results in a supply crunch - not just in quantity, but in originality. The kind of diverse, human-generated knowledge that fuels useful AI is becoming harder to come by. Models trained on stale or derivative inputs may still sound fluent, but their insights risk becoming shallow, outdated, or unreliable.

Synthetic data can’t close the gap

One proposed fix is synthetic data: material generated by AI to help train other AI. At first glance, this might seem like an efficient workaround. But it introduces serious limitations.

Synthetic data is created by identifying and replicating patterns in existing datasets. This means it’s only as good as the material it’s based on – and often worse. It tends to miss edge cases, overlook nuance, and amplify whatever flaws or biases were present in the original source. When used uncritically, it can lead to a feedback loop of inaccuracy.

That doesn’t mean synthetic data has no value. It’s useful in specific contexts, such as testing or privacy-preserving scenarios, but it’s not a substitute for grounded, expert-led insight that real experts provide.

The case for a more human approach

The best training data still comes from people solving problems, sharing what they’ve learned, and improving that knowledge over time as technology and processes evolve. This iteration process is what gives content relevance, depth, and real-world context – something no model can generate on its own.

One emerging approach to capturing this value is Knowledge as a Service (KaaS). Unlike traditional scraping methods, KaaS focuses on structured, ongoing contributions from individuals with domain expertise. It builds dynamic systems where knowledge is constantly refined, updated, and validated.

Think of the KaaS approach like an open-source model, but applied to knowledge rather than code. Communities participate not just in the creation of content, but in the governance and evolution of it. The result is a living, high-quality dataset that reflects current thinking and real human experience.

Why KaaS matters for AI’s future

KaaS supports more ethical, effective, and sustainable AI development. It prioritises attribution and transparency, so contributors can see how their inputs are used. It provides a fresh, trusted source of domain-specific data that keeps models relevant in fast-moving fields. Crucially, it ensures that the human perspective remains central to AI systems, especially as those systems grow in scale and complexity.

In fields where accuracy and accountability are critical, such as healthcare and finance, this commitment to accuracy and sustainable AI is more than a nice-to-have - it’s essential.

A smarter foundation for smarter systems

Building better AI doesn’t mean building bigger models. It entails making more intentional decisions about the data that feeds those models, which requires shifting away from dependency on scraped or synthetic material and investing in platforms that encourage real knowledge-sharing.

With the right incentives, structure, and safeguards, these platforms can enable a more human-centered approach to AI – one that’s grounded in real experience, not repetition. It’s not about resisting automation, but ensuring AI reflects the best of what people know.

‍

Written by

October 13, 2025

Written by

Jody Bailey

Chief Product and Technology Officer at Stack Overflow

August 13, 2025

What the Dead Internet Theory gets right about AI’s data problem

The internet isn’t the training ground it once was

Synthetic data can’t close the gap

The case for a more human approach

Why KaaS matters for AI’s future

A smarter foundation for smarter systems

Rethinking Disruption: why leadership needs to change

China's Involution and Our Own: What Happens When Hard Work Stops Working

The three payment mistakes that cost business owners thousands (and how to fix them)