AI Progress in the Second Half of the 2020s

AI Progress So Far

The past few years have seen unprecedented advances in artificial intelligence, particularly through the usage of transformers in autoregressive and diffusion architectures. Since the release of GPT-2 in 2019, we've witnessed a cascade of increasingly capable AI models from multiple organizations. These models have progressed from basic text completion to complex reasoning, coding, and even multimodal capabilities combining text, images, video, and audio.

Key milestones include:

2019:
- Demonstrated the success of self-supervised training for decoder-only transformers (GPT-2)
2020:
- Set new standards for language understanding/generation; demonstrated in-context learning (GPT-3)
- Demonstrated breakthrough protein structure prediction (AlphaFold 2)
- Public launch of driverless taxis in geo-fenced areas (Waymo)
2021:
- First major product release of large language models (GitHub Copilot and OpenAI's Codex)
2022:
- Impressive multimodal understanding by processing text and image data jointly (Flamingo)
- Compute-optimal scaling laws for transformer large language models (Chinchilla)
- Significant advancement in text-to-image generation (DALLE-2, Imagen)
- Brought conversational AI to the mainstream (ChatGPT and GPT 3.5)
2023:
- Massive improvements in reasoning and multimodal capabilities (GPT-4)
2024:
- Expanded multimodal capabilities, combining text, image, video, and audio input (and output for text and image) into the same models (GPT-4o)
- Heavy improvements in reasoning and overall model quality (Claude 3.5 Sonnet, o1)
- Impressive video generation models (Sora, Veo 2, Runway, Kling)
- Impressive music generation models (Suno, Udio)
- Massive improvements in structure prediction of biomolecular interactions (AlphaFold 3)
- Significant improvements for driverless cars without geo-fencing (Tesla FSD v12 and v13)

Have LLMs hit a wall?

At first glance, after going from GPT-2 -> GPT-3 -> GPT-4, it may feel like progress has stalled. However, we think this is the common pitfall associated with desensitization or hedonic adaptation. We believe there are three major sources of this increased desensitization:

People who were previously paying attention are now paying attention much more now, so AI progress feels slower. Recall that it took almost 3 years from the GPT-3 announcement until the GPT-4 announcement. As of this writing, it has not even been 2 years since the GPT-4 announcement.
Model releases are much, much more frequent. This stems from more frontier AI labs (OpenAI, Anthropic, Google Deepmind, Meta), more secondary AI labs (Deepseek, Qwen, Mistral), and a shorter release cycle from each. Thus, there were much fewer models between GPT-3 and GPT-4 than between GPT-4 and GPT-4o
AI models are increasingly saturating both formalized benchmarks and unformalized ones (i.e. "vibes"). As models improve, people resort to harder and harder questions, and thus the rate of "failure" in real-life usage stays relatively constant from 2 years ago.

The result of these sources is that AI progress can feel slower than it was 2 years ago, or 4 years ago. However, in reality, the models we have access to today are significantly better than models from 1-2 years ago, such as GPT-4 and GPT-4 Turbo. This is evident in essentially every single benchmark, as well as in anecdotal experience from our team.

Will LLMs hit a wall?

Assuming LLMs have not yet hit a wall: will they do so in the near future? Common arguments for "yes" include: running out of data, fundamental limits of autoregressive transformers, and diminishing returns of more/better data and more compute. Let's visit each point directly.

Running out of Data
- Llama 3, the best knowledge we have of modern state-of-the-art LLM training, used 15 trillion tokens.
  - Most of this data is from mass internet web crawls.
  - Some estimates put the number of books published each year at ~2.2 million. Assuming the average number of words per book is 80,000, that's ~235B tokens (~175B words) published every year. That amounts to 14 trillion high-quality tokens since 1960, matching the amount Llama 3 was trained on.
  - Google's YouTube has about 1 million hours of footage uploaded every day. At 30 frames per second, and 258 tokens per frame, that's about 27 trillion tokens of video added every single day. That's about one quadrillion tokens for every year of YouTube. This even ignores audio tokens from the videos as well! The tens of trillions of tokens used to train today's state-of-the-art models pale in comparison.
    - Keep in mind, this is just one product. There are similarly crazy numbers of human-generated tokens on other social media services, such as Facebook, Instagram, LinkedIn, TikTok, Twitter, Reddit etc.
- Frontier AI labs are finding surprising success with synthetic data approaches.
  - For example, generating code or math using LLMs, and, using automated deterministic tests or even LLM judges, choosing high quality responses and performing Reinforcement Learning, fine-tuning, or even pretraining on them.
    - OpenAI has announced that they used this methodology to create o1
    - Many other major and small labs have claimed they're seeing success with synthetic data.
- In any case, if we really do run out of data, AI scientists have shown that scaling up model size results in better sample efficiency. We are currently in an overtraining paradigm (due to outsized inference costs compared to training costs), but if we transition to being constrained by data, there is a lot of room for improvement by entering an undertrained paradigm (by massively scaling model sizes).
Fundamental Limits of Autoregressive Transformers
- Recent research has mathematically proven that autoregressive, decoder-only transformers are Turing-complete if given the correct input
- Time and time again, skeptics of the modern generative AI paradigm have predicted that these models would fail at specific tasks, and time and time again they have quickly improved enough to succeed at those tasks.
Diminishing Returns of Increasing Data Size, Data Quality, and Compute
- Of the three, this argument has the best potential to be right. We can never be sure that a certain trend will continue, and this is no different. However, we can take hints from the frontier labs, who have at least a few months headstart on seeing these models in action.
  - Noam Brown (OpenAI multi-agent researcher): "We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue." (link)
  - Dario Amodei (Founder of Anthropic) loosely predicts 2026 or 2027 for "very powerful AI" (link)
  - Sam Altman (Founder of OpenAI): "We are now confident we know how to build AGI as we have traditionally understood it. We believe that, in 2025, we may see the first AI agents “join the workforce” and materially change the output of companies." (link)

What might 2030 look like?

There are a few major trends to pay attention to:

More data and more high-quality data are being used to pretrain and post-train these models.
Progressively increasing compute.
- Massively parallelized computer chips are continuing to improve in speed, memory, and other metrics.
- Larger and larger clusters of these chips are being built constantly. OCI Supercluster from Oracle, Colossus from xAI, and rumors of Stargate from OpenAI and Microsoft
Increasing focus and efforts on post-training, using methods such as RLHF, RLAIF, supervised fine-tuning, and other reinforcement learning and synthetic data methods based on provably validity of response.
Improving on the vanilla transformer architecture. Papers like the Byte-Latent Transformer, Transformer^2, Reasoning in Continuous Latent Space, and many others are recent, seemingly successful, attempts at this.
Distributed training. As cluster sizes hit physical and infrastructure limits, research is being done on the potential for distributed training of models. Prime Intellect's DiLoCo and Nous Research's DisTrO are examples of this.

If we extrapolate these trends, 2029 and 2030 may look like enormous (even relative to today), distributed training runs for heavily multimodal (text, image, video, and audio input and output) models. These models may be able to autonomously perform long-horizon tasks, such as: building complex web applications, creating detailed visual media, managing a team of and writing detailed research on computer science, artificial intelligence, mathematics, physics, and more (chemistry and biology may require more human intervention due to laboratory tasks).

This would be the beginning of massive job automation, where most white-collar jobs would be able to be fully replaced by AI. Depending on the manufacturing speed of general-purpose (e.g. humanoid) robots, we may start to see significant blue-collar job replacement as well.

It would also be the beginning of a lack of scarcity (relative to today's society). After controlling for potential governmental increases of monetary supply, prices for nearly every good or service should drop significantly, as the cost of producing them drops, and as overall efficiency increases.