Synthetic Data and LLMs

Synthetic data has become a hot topic in generative AI and has come to refer to data generated by one LLM that is then used in training another. This is often done because the smaller more optimized model is faster and less expensive to operate than the larger foundational model.

For instance, you may be trying to classify short texts such as tweets but not have enough hand-labeled examples of particular cases and no inexpensive or easy way to find more. In that case you may want to use an LLM to generate enough examples of each class of interest.

This seems like a great idea but there are two general groups of concerns. One is the "faithfulness" of the generated data since it may not have the same distribution as real-world data. That is, the tone and vocabulary of the LLM may not match the tone and vocabulary actually used in the real world text (which also constantly changes). This will impact your current performance as well as limit the model's ability to handle tone and vocabulary shifts in the future.

The other is the limit to the available knowledge. That is if you train on what you already know (in the large model) you'll never learn something new. There are probably better names but lets call these two issues faithfulness and the knowledge cap for now.

In computer vision we've practiced image augmentation, where images are randomly manipulated during training, for many years. The manipulations, such as shifting, flipping or adjusting color and contrast, generate additional training data without changing the basic content. A dark image of a cat looking to the left instead of bright version of that image with the cat looking to the right is still a cat.

We've also used physics and game engines to generate images and video that would otherwise be difficult or expensive to capture such as driving scenes all over the world and in various weather, lighting and traffic conditions.

In ML there is also the concept of self-play. Where a model improves by playing against itself over and over again. Self-play is famously one of the techniques that led to AlphaGo's success.

The use of synthetic data in LLMs described above is a little different than these techniques. The current use may be more similar to the concept of distillation where the output or predictions of a larger teacher model is used to train the smaller student model. And this seems like a very feasible way to get a specialized smaller model.

I'm also now starting to see that synthetic data in LLMs can take distillation a step further.

Rather than generating a static training set you can combine the ideas of distillation with augmentation and active learning and generate synthetic training data on the fly. That is you may be able to continually train the smaller model in rounds (or epochs) where the training data for each round is adjusted with data specifically generated to address shortcomings in the performance of the student model in the previous round.

So in our example if the student model is doing well with texts that express customer satisfaction but not as well with texts that discuss potential product improvements, the larger model could be used to generate more of that kind of data to learn from.

This still has the concerns mentioned above but combined with clever prompt engineering should go a long way to extracting even more of the available knowledge from the larger model.

However, there is also the theory that if LLMs actually do or can achieve true reasoning abilities they can start to generate novel information. That truly new information can then be used to expand an (the student or teacher) LLM's ability past the knowledge cap mentioned above. The ability to create new information combined will self-play may be the key to unlocking much more powerful models.

I'm skeptical we can do that with just LLMs ... but LLMs plus reasoning engines, symbolic methods, etc. may be another story ... we'll just have to wait and see.

Want to get notified of new articles and insights?