#llm#synthetic-data#deepseek#ai-training#reinforcement-learning

How LLMs get smarter from self-generated data

Ever since ChatGPT launched, a contentious question has emerged: what happens when we run out of human-generated data to train LLMs on? Can they continue improving by learning from synthetic data — data they created themselves — or will this lead to degradation, because synthetic data contains no "new knowledge" and is merely an inferior remix of what came before?

Although I find this debate very interesting, I never took the time to make up my own mind about this issue. This changed when I encountered the DeepSeek-V3.2 paper, which stated that they relied significantly on synthetic data for improving upon the previous version of DeepSeek.

I got motivated to read the paper in more detail with a special focus on what their findings meant for the broader discussion on synthetic data. The paper provides empirical evidence on the use of synthetic data for improving LLMs, and as such elevates the discussion above mere philosophical arguments.

We should be grateful that DeepSeek is willing to share these details about building a model so close to the frontier.

The exercise I did was to translate DeepSeek's findings into a down-to-earth argument on the effect of synthetic data — first to convince myself one way or the other, and now to share my thoughts so others can convince me otherwise. But here we go.

We know for a fact that intelligent agents can learn from self-generated data

Why? Because humans can. Take for example the best professor in number theory, a math specialty, in the world. They don't get smarter by reading textbooks. There is no number theory textbook that contains insights they don't already have. They get better by taking a conjecture and going through the effort of being the first one to actually prove it.

In this case, the self-generated data contains the task specification, the attempts that lead to a dead end, and the final successful attempt.

Limited knowledge of LLM training is a common pitfall

The mistake that LLMs can't learn from self-generated data is rooted in a misunderstanding about how LLMs learn. Many engaging in this discussion hold the simplified view that LLMs learn from next token prediction on all of the internet's text. They will be the ones arguing that one can't compare the number theorist case to an LLM.

Their mistake is however that they only consider the pre-training phase and are unaware of the post-training phase. Even when they are well aware of the post-training phase, they still hold a mental model of "LLM training = next token prediction".

But it is precisely in this post-training phase that it is possible to learn from self-generated data.

Learning is more than reading

Learning is not only reading text that contains knowledge. On the contrary. Although ingesting the knowledge is a first necessary step, real learning happens when the student solves problems that require the application of the recently acquired knowledge.

This is where the student truly internalizes the knowledge, gets smarter, more skilful, and becomes capable of applying the knowledge in the future for solving a valuable task that they couldn't have solved without having gone through the exercise.

When interpreting next-token-prediction as "reading," it is remarkable that, in the early days, when little else than next-token-prediction was done, LLMs were already quite capable. That's probably because next-token-prediction is not "just reading" but also has an element of solving the problem of predicting the next word in it.

A learning coach and a class of students

To understand how an LLM can get smarter from self-generated data, it is helpful to consider the following analogy.

The human researcher is the learning coach. The LLM is a large class of thousands of students. The learning coach (human researcher) and the class of students (LLM) will work together to obtain higher test scores.

The learning coach (human) is in general less knowledgeable than the students (LLM). They are however an expert in learning, and will explain to the students how to study, get smarter and consequently obtain higher test scores.

Exercises over textbooks

The students (LLM) already read the biggest textbook on earth (the internet). Therefore, the learning coach (human) does not possess any extra knowledge they can transfer to the students (LLM).

Instead, the learning coach explains to the students how they can create exercises for themselves that will help them to better internalize the knowledge they already read in the textbook (internet) and as such become more skilful, achieving higher test scores.

Notice how this gets close to the number theorist becoming a better mathematician by proving a conjecture, but is not yet exactly the same. In both cases they learn from self-created exercises.

However, where the students (LLM) need a learning coach (human researcher) to tell them to create the exercises, the number theorist also did that on their own.

Even reading the textbook is actually an exercise

We have framed next-token-prediction on internet data as "reading a textbook" because it aligns with the widely spread view of pre-training being the ingestion of knowledge.

There is however also an alternative view in which we interpret pre-training also as an exercise, i.e. the exercise of predicting the next token, which can be considered a "lower level" exercise compared to the other exercises we give to the model.

One could argue that it is more like kindergarten, where the model learns to speak, which allows us to give it more difficult exercises later on.

How do the students (LLM) learn from exercises

The students (LLM) receive the task. The students (LLM) solve the task. The answer of the students (LLM) is verified by a dedicated verifier.

If the answer is correct, the students (LLM) reinforce the neural connections in their brain (weights) that led to the correct result. If the answer is wrong, the students (LLM) weaken the neural connections in their brain (weights) that led to the wrong result.

As you can see, the verification of the answer is a crucial step. Therefore, the exercises need to be hard to solve, but easy to verify.

On the one hand, they should be easy to verify, such that one can robustly discard wrong answers to avoid learning wrong behavior. But on the other hand they should be difficult enough for the students (LLM) to actually become smarter from doing the exercise.

In what follows, we will discuss multiple examples of self-generated data that were used to make DeepSeek-V3.2 perform at a much higher level than its predecessor.

Self-generated data to become a better researcher

If the students (LLM) want to get a job as a researcher, they need to become better at it. The learning coach (human) tells them how to create exercises for themselves that will make them score higher on researcher exams if they do these exercises.

This is what the students (LLM) have to do:

  • Find niche information on the internet.
  • Create questions to which the answer is in the niche information they have found.
  • Ask many other students (LLM) to answer the question from the top of their head, without using the internet.
  • Throw away the question if there exists at least one student (LLM) that was capable of answering the question correctly. This means the question does not require proper internet research to get solved and is therefore not a good exercise for getting better at internet research.

The resulting question-answer pairs are the self-generated data that the students (LLM) can then use to learn from.

For improving DeepSeek, 50275 of those were created by the students (LLM).

The asymmetry that makes self-improvement possible

Notice the asymmetry in difficulty between creating the exercises and actually solving the exercises.

Creating the exercises can be done by the students while they are still their dumber version. But when they actually try to solve the exercises, they fail a lot in the beginning, learn from their many mistakes and few successes, and as such become better at the job, resulting in becoming better students (LLM) that score higher on the researcher exams.

Self-generated data to become a better software engineer

For a software engineering job, the students (LLM) likewise need to improve. The learning coach (human) tells them how to create exercises for themselves that will make them score higher on software engineering exams.

This is what the students (LLM) have to do:

  • Go to GitHub and examine the historical records of every open source repository.
  • Find issues reported by users of the software.
  • For each issue, find the patch that was implemented to solve the issue.
  • Run the unit tests on the broken version of the open source software.
  • Confirm that at least one test is failing (because of the issue), i.e. something is verifiably broken. If not, throw away this issue because we will have issues with verifiability.
  • Run the unit tests on the patched version of the open source software.
  • Confirm that at least one previously failing test now succeeds, i.e. something got verifiably better. If not, throw away this issue.
  • Confirm that no previously succeeding test now fails, i.e. the patch did not break something else. If not, throw away this issue.

The resulting code-issue-pairs are the self-generated data that the students (LLM) can then use to learn from.

The students (LLM) can now try to fix the issues with the software and check the correctness of their solution by running the unit tests. They will consider their answer correct if their fix decreases the amount of failing tests without breaking a previously succeeding test.

For improving DeepSeek, 24667 of these coding challenges were created by the students (LLM).

Notice again the asymmetry:

  • easy (relatively): construct the data
  • easy (relatively): verify the results
  • hard: implement the actual fix

Although implementing the fix is clearly an order of magnitude harder than creating the data, it feels not 100% correct to call the construction of data "easy". It requires setting-up and running programming environments that can be quite complicated.

We can get two insights from this:

First, that data construction is forgiving. If the students (LLM) fail at setting up an environment properly, they just throw it away. This is one of the reasons we can consider the task to be "easy".

Secondly, generating this data already requires a certain level of expertise from the students (LLM). The first generation of students (LLM) would not have been capable of doing this.

So we can expect that for the students (LLM) to keep on improving, they will create ever more challenging exercises for themselves that they couldn't have created, let alone solved, a generation ago.

Self-generated data to become a better generalist

To succeed at any job, the students (LLM) must also get better at many things beyond coding and research. The learning coach (human) tells them how to create exercises for themselves that will make them score higher on general task exams.

This is what the students (LLM) have to do:

  • For each category (e.g. planning a travel itinerary) retrieve relevant data from the internet and put it in a database.
  • Program tools that other students can use to interact with the data, e.g. get_all_attractions_by_city(city), get_city_transport(city), etc.
  • Create a (too) simple task.
  • Write a verification function in Python that verifies if the task is correct. Planning a trip that needs to comply with many constraints is hard, but writing a Python program that checks if the constraints are met is much easier.
  • Write a solution function in Python that solves the task by only using the tools you created.
  • If the verification function does not validate the solution function, adjust both until it does.
  • Iteratively increase the difficulty of the task. Of course also adjust verification and solution functions accordingly. If the toolset is not sufficient for solving the more difficult task, program the missing tools.

The resulting environment-tools-task-verifier tuples are the self-generated data the students (LLM) can use to learn from. A task will be considered solved, if the verifier says so.

Notice that the solution function has been discarded as finding that (or an alternative) solution is exactly the exercise the students (LLM) need to do to get more skillful. It was however a crucial component for creating the correct verifier for each task.

When doing the exercises to improve themselves, if the students couldn't solve the task after 100 attempts, they will not consider it a failure they need to learn from, but simply discard the task as "too hard to learn from".

For improving DeepSeek, the students (LLM) generated 4417 such tasks on 1827 distinct environments.

I would argue this approach adds another level of sophistication for achieving the asymmetry that is required for letting "dumber" students (LLM) create exercises that will make them smarter after training on them.

First, creating the task, solution and verifier at the same time feels like a more sophisticated version of first finding the answer and only afterwards coming up with the question, as was done for creating the researcher exercises.

Second, creating sufficiently difficult tasks, while making sure they are correctly verified, is done by starting with a too easy task, and iteratively increasing the difficulty, as the student (LLM) would not be able to create the difficult task from scratch, but is able to achieve it in the end, if they increase the difficulty gradually.

This shows that when students (LLM) increase their intelligence step by step, the size of such step is limited and the exercises required to take the next step become ever more complex.

Intelligence explosion?

What light does all of this shine on an intelligence explosion, the idea that LLMs will start improving themselves indefinitely and uncontrollably?

An explosion requires two things: self-sustainability and acceleration.

First, self-sustainability. An explosion implies improvement without needing a human to continue. We started with humans taking action to create a better LLM based on human-generated data. We are now at the stage where a human takes action to create a better LLM based on LLM-generated data.

An "explosion" would require the LLM to take the action to create a better LLM itself. A scenario in which this would be possible is when an LLM solving a task decides it is not capable of solving the task, and first needs to create a smarter LLM to forward the task to.

Second, acceleration. From our discussion, we can learn that improvement steps are inherently small. So they would need to happen very fast. What determines their speed? I would argue it's (1) the amount of tokens required for the next learning step, and (2) the speed at which LLMs can create tokens.

From our discussion it also becomes clear that more difficult exercises contain more tokens, require more tokens to be created, and require more tokens to solve. This is a decelerating force that puts a natural brake on the speed of improvement.

It's like extending a spring: the further you extend it, the harder you have to pull, making each step harder than the last.

So if an LLM wants to create an LLM that is much smarter, it might first want to create a faster LLM, such that the learning process goes faster.

And at some point, a limit will be hit in terms of hardware and energy. A model that wants to create a smarter successor might first try to make a more efficient version. Or it might be motivated to engage in the economy to secure the hardware and energy required to achieve its goal.

Why didn't human intelligence explode?

In this context, it is interesting to ask: why did the intelligence of humans not "explode"?

We do not see human intelligence growth as an explosion because the time it takes us to find exercises that make us smarter, and then actually solve them, is slow compared to a human lifetime.

The dragging force is not to be underestimated. And although humanity is to some extent motivated to increase its intelligence, it is also motivated to do many other things and by no means maximizes resources toward this goal.

In summary

I became convinced that LLMs can create synthetic data to train smarter versions of themselves.

Although the possibility of an intelligence explosion is not to be discarded, there are strong decelerating forces working against it.

I hope this discussion triggers you to engage, which will allow me to learn and refine my views.