Synthetic Data in 2025: Revolutionizing GenAI Model Performance

Valtteri KarestoFebruary 20, 2025
How Synthetic Data is Powering the Next Generation of Efficient, Specialized AI Models

Introduction

Synthetic data generation has advanced significantly, with larger language models now being leveraged for knowledge distillation to produce high-quality training examples. It enables the creation of domain-specific datasets that can be used to finetune smaller or specialised models while preserving accuracy, lowering computational costs, and reducing existing data requirements.

For instance, this approach enables efficient finetuning of GPT-4o, Llama etc. for specific use cases with just a few synthetic examples, or training specialized ModernBERT models for classification tasks within your agentic workflow - achieving better performance on downstream tasks while requiring only a fraction of the computing resources. This is particularly valuable for high-throughput, latency-sensitive applications like LLM routing or real-time classification tasks, especially in agentic workflows where multiple models are used in sequence - each round of model inference compounds the efficiency benefits and reduces overall latency.


Early Days of Data Generation Practices: Naive LLM API Querying

Model Collapse

Generated samples often lacked diversity, with models falling into repetitive patterns even when prompted differently. This similarity between samples limited the training value of the synthetic datasets.

Structured Output Issues

Without proper constraints, output formatting was inconsistent across samples, making dataset integration difficult. What worked in one generation would fail in another, creating a constant need for format verification and correction.

Reliability Problems

Large-scale generation jobs were particularly vulnerable to:

  • API timeouts and rate limiting
  • Network connection interruptions
  • Inconsistent response formats
  • Incomplete generations

These issues often meant paying multiple times for the same data as jobs failed mid-way through hundreds of generations.


The Path to Maturity: From Quick Fixes to Robust Tools

Custom Scripts Evolution

Teams developed increasingly sophisticated solutions:

  • Caching mechanisms to prevent redundant API calls and preserve successful generations
  • Smart prompting systems that:
    • Track and avoid duplicate examples
    • Break down complex queries into manageable chunks
    • Enforce parameter ranges for greater control
    • Post-processing pipelines for consistent data cleaning and validation

Emergence of Specialized Libraries

Tools like Instructor (python.useinstructor.com) emerged to provide:

  • Structured data generation
  • Type validation
  • Automated error handling
  • Built-in best practices for synthetic data creation

Modern Tooling Platforms

Advanced platforms like Curator by Bespoke Labs & Distilabel by Argilla now offer:

  • End-to-end data generation pipelines
  • Quality control mechanisms
  • Scalable infrastructure
  • Integrated monitoring and validation

Building Synthetic Datasets: A Practical Example

We believe that intent recognition is a fundamental component of smart, AI-powered user experiences. That’s why we’re showcasing its potential through synthetic data generation. You can explore a minimal, straightforward example on Google Colab, which demonstrates how synthetic data can be generated for intent recognition.

Tools and libraries make things easier by obfuscating some parts of the execution, but at the same time, they may make it harder to understand what happens behind the scenes. For that reason, we wanted to create the simplest and cleanest example possible, making it accessible even to those with less experience in the synthetic data space.

In a nutshell, in this code:

  1. We install dependencies, instructor for structured generation, openai to be used as a client for generation, and some other helpers
  2. Define the intents or labels we want to create queries for
  3. Use Pydantic class to define our output structure
  4. Create generate_queries function which takes in intent (str) and outputs 20 queries for intent, also define generate_dataset function which takes list of intents and generates queries for each and outputs list of dicts which contains query, text and some metadata
  5. We turn our raw generated dataset into Huggingface Dataset and push it to Huggingface

Output dataset can be found here

We challenge you to replicate this using Curator or Distilabel and let us know which one you preferred and why.


Looking Forward: The Promise of Small Language Models

There seems to be a trend emerging: the shift toward smaller, more efficient and specialized models. We believe the future lies in using LLM generated data to train smaller focused models that can be deployed efficiently at scale (or even locally).

In our next article, we'll explore how organizations are successfully replacing large, general-purpose models with smaller, specialized ones trained on synthetic data. We'll examine real-world cases where this approach has led to significant improvements in both performance and operational efficiency.

Have you experimented with synthetic data or small language models in your work? We'd love to hear about your experiences. Share your insights in the comments below or join our community discussion about the future of AI model development.

Appendix: Can I Use Commercial Models for Synthetic Data?

In short, if you’re using a commercial LLM like Claude, be aware that while you generally own the text it generates, the Terms of Service often prohibit using that output to build a competing language model. Please consult your company's legal team to avoid potential pitfalls. It’s also worth considering an open-source LLM with a more permissive license that explicitly allows you to reuse outputs for training. It’s always better to play by the rules than to find yourself on the wrong side of a TOS.


Further Reading

For a deeper dive into these topics, check out these excellent resources:

  1. "Fine-tune classifier with ModernBERT in 2025" - Comprehensive guide on utilizing ModernBERT for classification tasks
  2. "Finally, a replacement for BERT: Introducing ModernBERT" - Simon Willison's detailed exploration of ModernBERT
  3. "Fine-tune ModernBERT for RAG with Synthetic Data" - HuggingFace guide on combining RAG and synthetic data
  4. "Synthetic data & Smol models in 2024" - Comprehensive presentation on the evolution of synthetic data and small models
  5. "Synthetic Data Generation with Instructor" - Tutorial on using Instructor library for structured synthetic data generation
  6. "Distilabel: Advanced Data Generation Guide" - Detailed guide on scalable data generation with Distilabel

Continue reading

Thoughts on the Model Context Protocol Part 2

In the previous post we introduced some concerns around Model Context Protocol, things have updated quite a bit, here's a quick review.

Valtteri Karesto

Valtteri Karesto

CTO, Founder

Thoughts on the Model Context Protocol

Model Context Protocol (MCP) is a new open protocol that aims to standardize how AI applications connect to custom data sources (files, photos) and tools (eg. functions that can fetch inform from third party systems). It was released by Anthropic, but in theory, any AI application could support it. At the moment, Claude Desktop and Cursor are among the most popular applications that support MCP. This means that third party developers can build custom tools and other capabilities host applications like Claude Desktop can use.

Valtteri Karesto

Valtteri Karesto

CTO, Founder

Optimizing current business vs. unlocking new business opportunities using GenAI

Transformative impact of Generative AI, particularly Large Language Models, on business optimization and the creation of new opportunities, highlighting its applications in various industries like real estate and the importance of exploring beyond current business models to fully leverage AI's potential

Tuomas Martin

Tuomas Martin

VP of Sales

Founder Conversations: A Week of LLM Insights in the Bay Area

We spent a week talking to founders and builders at Ray Summit, TechCrunch Disrupt, and various Bay Area GenAI meetups to understand the challenges they face when building value-providing LLM-based apps.

Joni Juup

Joni Juup

CDO, Founder

The HARG Truth: AI's Need for the Human Element

The Human-Augmented Retrieval Generation (HARG) method builds upon the Retrieval Augmented Generation (RAG) model, integrating a crucial human touch to the pipeline.

Valtteri Karesto

Valtteri Karesto

CTO, Founder

Intentface: Human-Centric Computing Through Intent-Driven Interactions

Imagine a future where clicking dropdowns, filling input fields, and browsing through abstract data visualizations are things of the past. That's the future we can build with intentfaces.

Joni Juup

Joni Juup

CDO, Founder