Designing the Ideal Synthetic Data Generation Pipeline for LLMs

31 May, 2025

In this post, I describe how an ideal system for generating synthetic data with and for Large Language Models would look like.

First, I go through the API by developing a simple but common scenario of generating synthetic data for LLM fine tuning.

Then I elaborate on why I believe that this approach is the right one for building realistic synthetic data generation pipelines in production.

The Arguments

Robust, maintainable, expressive and composable pipelines are critical for scaling synthetic data generation for LLMs.

Good abstractions let users focus on the hard problems (prompt design, QA quality) instead of wrestling with boilerplate.
Avoid ad-hoc and manual scripts for anything production-like because these projects are highly iterative by nature.
Leverage dataframe APIs, structured document representations (Markdown), and clear separation of concerns.

The use case

A common scenario in synthetic data generation for fine tuning Large Language Models is to create pairs of questions and answers based on some data.

The idea is that you generate these QAs using a frontier model and then use the output to fine tune a smaller model to make it match the quality of the bigger model while offering faster and cheaper inference.

As a concrete example, we will consider here the case of fine tuning a model in answering questions related to SEC filed corporate reports.

We want to get a few reports, use a strong model to generate questions and answers and then store the results in a format that can be used for fine tuning.

The Data

Although we are creating synthetic data, we still need to start by having some data that we can generate the questions and answers from and of ideally you want this data to be representative of what your model will be fed with to answer questions about.

In our case this is easy because the SEC filings are publicly available. The documents are in a PDF form so the first step we have to make is to OCR these PDFs and turn them into something we can work with.

Ideally, we want to turn the PDF into some kind of representation that preserves the structure of the document too. So if there are sections and sub-sections we want to be able to work with them.

Turning the PDF into a Markdown representation would be ideal for our case and the good news is that there are many OCR models and tools out there that can do that.

Just for simplicity, I used Mistral's OCR model to turn one document into markdown. And here a small piece of how the output looks like.

FORM 10-K
![img-0.jpeg](img-0.jpeg)
(270) 782-2900

(Registrant's telephone number, including area code)

(Former name, former address and former fiscal year, if changed since last report) N/A

Securities registered pursuant to Section 12(b) of the Act:
 
| Title of each class | Trading <br> symbol(s) | Name of each exchange on which registered |
| :--: | :--: | :--: |
| Common Stock, par value $\$ 0.0001$ | HLLY | New York Stock Exchange |
| Warrants to Purchase Common Stock | HLLY WS | New York Stock Exchange |

As you can see the markdown that is generated is pretty rich, it includes placeholders for images that were identified and tables that have been extracted.

But the data does not come out perfect or in the best possible shape to work with. For example notice the following:

...Indicate by check mark whether any of those error corrections are restatements 
that required a recovery analysis of incentive-based compensation received 
by any of the registrant's executive officers 
during the relevant recovery period pursuant to $\S 240.10 \mathrm{D}-1(\mathrm{~b})$. $\square$
...

There's some LaTeX syntax used there and for a good reason. The Model detected some mathematical notation that cannot be represented in plain characters and instead of losing thing information, it used the LaTeX syntax to represent what exactly was found in the PDF page.

But having this piece of information, although useful for recreating the original document, might not be useful for what we plan to do with it.

The reason I'm raising this issue is because in practice, no matter how good our models are in performing OCR, there will always be the need to post-process the results before we feed them into an LLM for a specific action.

You can of course rely on ad-hoc python code or whatever to perform this type of cleaning and wrangling of data but in anything close to a production pipeline where data drifts, models change and edge cases arise every day, ad-hoc code will turn your pipeline into a hard to manage and painful to operate pretty soon.

Post Processing the Data

Continuing from the previous section, a few more comments on the data we got.

The OCR model - service, breaks down the document into pages, because of that we can have sections or other markdown elements that are broken between pages.
The document has a rich structure that might be useful to exploit for generating better quality QA pairs.
There are characters and artifacts like the LaTeX syntax mentioned earlier that might not make sense to include in our processing so we would like to clean them up.
The document is big, we will have to chunk it to process it, how would we do that efficiently while maximizing the quality of the QA pairs we generate?

Based on all the above, what I'd like to do with the data is the following:

Instead of breaking down the document into pages, break it down into sections.
Chunk the document keeping the integrity of the sections.
Clean up the data from any artifacts we don't need.
Maintain a clear connection between the chunks and the original document.

After we are done with the above then we can move into the actual data generation. Let's start doing this!

Data Clean up

I want to try and perform the following:

Remove any HTML tags.
Replace checkbox-like symbols with clear boolean representation.
Strip embedded images.

I'll be using a dataframe API that will feel very familiar to anyone who has worked with PySpark before.

The first thing I want to do is to load the data into a datafram.

raw_data = session.ingest_from_directory("path_to_my_md_files").with_column("document", lit(1))

raw_data.show()

And we should get back something looking like the following output:

┌──────────┬─────────────┬────────┐
│ document ┆ filename    ┆ content│
│ 1        ┆ page_1.md   ┆ PART I │
.......

We have a unique id for the whole document, a filename and it's content. The problem though is that the information on which page is which, is encoded in the string of the filename only. So we have to take care of this if we want to make sure we have the document ordered correctly.

The code below processes the filename column and adds a column of type INTEGER, that contains the page number and with that, we can sort the data and make sure we have everything in the right order!

raw_data.with_column("page",text.split_part(text.split_part(col("filename"), ".", 1), "_", 2).cast(PrimitiveType.INTEGER)).sort("page")

Now that we have taken care of the importing of data, let's move on an start cleaning the data.

Removing HTML tags

First thing we want to do is to remove the HTML tags.

The code below will replace all the HTML tags with an empty string.

raw_data.with_column("content", text.regexp_replace(col("content"), r"<[^>]*>", ""))

Removing the embedded images

Next, we want to strip embedded images.

Similarly with what we did previously, we can use the following code.

raw_data.with_column("content", text.regexp_replace(col("content"), r"!\[[^\]]*\]\([^)]*\)", ""))

Replacing checkbox like symbols

This is an interesting one because we can't just remove the data without missing some important information. The OCR result codifies in markdown and text an option the author of the document has taken so Ideally we want to maintain the option there.

Here's how this can be done.

raw_data.with_column("content",
	text.regexp_replace(
		text.regexp_replace(col("content"), r"Yes \$\\square\$ No    \$\\boxtimes\$", "No"),
r"Yes \$\\boxtimes\$ No \$\\square\$", "Yes"
	)
)

Now where we had this Yes \$\\square\$ No \$\\boxtimes\$ we will get No and where we had es \$\\boxtimes\$ No \$\\square\$ we will get Yes. Preserving the author's choice while removing the weird symbols.

There are many more things we can do to clean the data but I'll stop here as we will be mainly repeating the above patterns again and again.

The point I want to make though is that having a lazy, composable and expressive dataframe API to work with your data is a must when you want to build a pipeline like this.

The fact that we are using LLMs at some point, makes this traditional data cleaning even more important than before and it's not about the cost that increases because of the useless tokens, it's mainly the noise we add to the LLM and how that affects its results at the end.

Now that we have figured out the clean up face, let's move into chunking the data appropriately.

Data Chunking

Here we are working with a large document, even if the model is capable of handling the whole document at once, asking it to generate questions considering the whole document in the window context will be chaotic at least.

Even if the model generates decent questions and answers, it will be impossible to associate them with specific parts of the document in a consistent way, making any kind of performance evaluation very hard to perform properly.

The data we have comes pre-chunked because of how the OCR model works. We have it broken down into pages.

The naive approach to follow and the easiest is to just send a page to the model and ask it to generate QA pairs for just that content. If we want to help the model a bit more, we can first generate a summary of the whole document and add the summary to each request for each page too.

This approach can work and it's definitely within the context window limits of the modern models but the problem we have is that the page barrier can break down content that should be together. For example a section that is long can span across many pages and we might even end up with a page that contains just a tiny part of that section just because it's at the end.

It would be amazing if we could chunk our document the way we want, ensuring that the information that should be together would remain together when we send it to the large language model.

How can we do that?

Markdown documents as first class citizens

Cloud services and SaaS platform turned JSON into a first class citizen of databases and data platforms. Although not as strict as a relational model, the industry made it important for JSON documents to be treaded as a datatype on its own.

Similarly, with large language models, Markdown is turning into a new important data type that is less strict than JSON but it still has a structure and you can represent it as an AST.

Being able to work with markdown with the same ease that you can with JSON and relations, would be great.

Here's how I see this working.

whole_doc = raw_data.group_by("document").agg(collect_list(col("content")).alias("content"))

whole_doc = whole_doc.with_column("content",text.array_join("content", "\n"))
schema = whole_doc.select(text.md_structure("content").alias("md_structure")).collect()[0]["md_structure"][0]

whole_doc.with_column("md", text.md_transform("content", schema))

If someone is familiar with how DuckDB handles the JSON data type, what we are doing here might also feel familiar. The above code does the following:

Merges all the pages together into one string representation of the document.
It extracts the structure or schema of the markdown document that is stored as a string.
It uses this schema to transform the string representation of the document into a hierarchical column with Structs and Lists where we preserve the structural information of the markdown document.

With that in place now we can do some fun stuff like the following:

structed_doc.explode("md").with_column("section", col("md").get_item("section")).with_column("section_title", col("section").get_item("title")).show()

Or even better do something like the following:

structed_doc.explode("md").unnest("md")

And get the same results but with less code and less control over how columns are named.

Or even better, select the exact parts of the markdown document we want even before we turn into Struct types.

Let's assume we want only the data of the Gross Profit and Gross Margin section. The code below will do exactly that, will return back only the appropriate section, assuming that it exists.

whole_doc.select(text.md_extract("content", "# Gross Profit and Gross Margin").alias("extracted"))

Now we can repeat the same process to turn this into structured data.

partial_doc_schema = partial_doc.select(text.md_structure("extracted").alias("md_structure")).collect()[0]["md_structure"][0]

partial_doc_structured = partial_doc.with_column("md", text.md_transform("extracted", partial_doc_schema))

And finally do something like:

partial_doc.explode("md").unnest("md")

to get the following result:

┌──────────────────────────────┬─────────────────────┐
│ section_title                ┆ section_content.    │
│ Gross Profit and Gross Margin┆ Gross...            │
......

There's definitely space for improvement for the API for working with this new data type but the functionality is all there. We can manipulate the raw data to bring it to the structure we want, we want extract the structure of it and then we can turn it into a well structured column that maintains all the structure we need.

Now we have enough to move to invoking the LLM and start generating data!

Synthetic QA Data Generation

Now we have our data clean and broken down into sections and what we want to do next is to start asking the LLM for each section to generate QA pairs.

Summarization

But first, we want to create a summary for the whole document. As the document is quite big, we want to a bit smarter with how we generate the summary. The good thing is that we have the document broken down into pages and sections.

There are many different strategies we can follow here.

First, we can create a summary of each section and then a summary of all the summaries to get one for the whole document.

This is great if we care about the summary of each section and we want to keep that too.

If we don't need that, then we can just generate a summary for each page and then summarize them together for a whole document summary.

In both cases the approach should be similar to perform the task.

First, if we want to use the extracted sections and their structure.

partial_doc_structured.group_by("document").agg(semantic.reduce("summarize the document with sections titled {section_title} and paragraphs {section}").alias("summary"))

Then, if we want to do it directly using the pages we have:

raw_data.group_by("document").agg(semantic.reduce("summarize the document with pages ({content})").alias("summary")).show()

The beauty of this API is that the data can change and the syntax remains the same and this is a good thing because it allows us to easily and fast figure out what's the best way and data to achieve the best possible performance from our models.

Data Generation

With the summary in place and the data in good shape, generating the synthetic data should be just 2-3 lines away.

We wil be asking the language model to generate pairs of questions and answers for each section that we have, we will also pass the summary of the whole document to each request to expand the context the model has.

Here's the prompt we will be using:

Create 10 question-answer pairs from this text for LLM training.
The text is a section from a bigger document with the following summary {summary}
Rules:

1. Each question must be about an important fact in the text.
    
2. Each answer must be directly supported by the text.
    
3. Format each question-answer pair clearly as follows, one pair per line, separated by a newline:

Question: <your question here> | Answer: <your answer here>
Question: <your question here> | Answer: <your answer here>
...


Section: 
title: {section_title}
paragraphs: {section}

and here's the code we need to generate the synthetic data

prompt = (

"Create 10 question-answer pairs from this text for LLM training.\n"

"The text is a section from a bigger document with the following summary:\n"

"{summary}\n\n"

"Rules:\n\n"

"1. Each question must be about an important fact in the text.\n"

"2. Each answer must be directly supported by the text.\n"

"3. Format each question-answer pair clearly as follows, one pair per line, separated by a newline:\n\n"

"Question: <your question here> | Answer: <your answer here>\n"

"Question: <your question here> | Answer: <your answer here>\n"

"...\n\n"

"Section:\n"

"title: {section_title}\n"

"paragraphs: {section}"

)

generated_df = final_df.with_column("qas", semantic.map(prompt)).cache()

As you can see, the actual code is just one line and this is important because we need to reduce the amount of time we spend on the boilerplate code we need and maximize the time we spend on refining our prompts.

This is an important characteristic of the API we are considering here and one that is affecting the overall design of it as it can be seen with the previous code we written too.

Now that we have our results back, we can turn the response we got into a dataframe that we can easily work with it, just as we did in the previous steps.

qa_template ="Question: ${question:none} | Answer: ${answer:none}"

generated_df.with_column("qas", text.split(col("qas"), "\n")).explode("qas").with_column("qas", text.extract(qa_template, col("qas"))).with_column("question", col("qas").get_item("question")).with_column("answer", col("qas").get_item("answer")).select("*")

And with that, we will get a result like the following:

Document	Section Title	Question	Answer
1	Gross Profit and Gross Margin	What was the gross profit for Holley Inc. for the year ended December 31, 2022?	The gross profit for the year ended December 31, 2022, was $111.6 million.

And this is it. Now we have a well structured and composable pipeline that can generate pairs of questions and answers from documents as a synthetic data set.

What else is needed

In summary, successful LLM driven projects inherently rely on rapid iteration and experimentation.

The API design outlined in this post, emphasizing composability, expressiveness, structured markdown handling, and concise data operations, significantly reduces friction and accelerates iteration cycles.

This allows teams to efficiently scale high-quality synthetic data generation pipelines, ensuring consistent model performance in production environments.

Looking forward, there are several enhancements that would further enrich this proposed API, here are some of these.

It's important to have some kind of session concept where specific metadata, e.g. what models to be used and what limits to be observed, will be defined.

By doing this, it will also be easy to define multiple different models that can be used within a session or by switching to different sessions, depending on the pattern the user prefers.

This is important because these pipelines tend to require different models at different stages. For example, you might want to generate your data using a smaller and faster model and then use one with reasoning capabilities for the curation, quality part then you might want to go back to the QA pairs that didn't make it and if a chapter is not being represented with enough questions to use a larger model to regenerate the QA pairs.

It's also important to have some kind of lineage concept easily accessible in the API. Ideally a row based lineage. When you get QA pairs that fail the curation and quality stage, you need to be able and trace how the QA pair was generated including all the data, operators and prompts involved.

This can potentially become a very powerful tool for iterating efficiently over your dataset until you get to a point where the quality is good enough for you to train or fine tune your model.

Bonus Chapter

Putting everything into one script.

# Ingest OCR'd markdown files as raw data (page-level)
raw_df = session.ingest_from_directory("path_to_my_md_files").with_column("document", lit(1))

# Add page numbers extracted from filenames for correct ordering
raw_df = (
    raw_df
    .with_column(
        "page", 
        text.split_part(
            text.split_part(col("filename"), ".", 1), "_", 2
        ).cast(PrimitiveType.INTEGER)
    )
    .sort("page")
)

# Clean HTML tags from content
raw_df = raw_df.with_column(
    "content", 
    text.regexp_replace(col("content"), r"<[^>]*>", "")
)

# Remove embedded images
raw_df = raw_df.with_column(
    "content", 
    text.regexp_replace(col("content"), r"!\[[^\]]*\]\([^)]*\)", "")
)

# Replace checkbox-like symbols with 'Yes'/'No'
raw_df = raw_df.with_column(
    "content",
    text.regexp_replace(
        text.regexp_replace(
            col("content"), 
            r"Yes \$\\square\$ No    \$\\boxtimes\$", "No"
        ),
        r"Yes \$\\boxtimes\$ No \$\\square\$", "Yes"
    )
)

# Merge all pages into a single document string
doc_df = (
    raw_df
    .group_by("document")
    .agg(collect_list(col("content")).alias("content"))
    .with_column("content", text.array_join("content", "\n"))
)

# Extract markdown structure/schema
md_schema = (
    doc_df
    .select(text.md_structure("content").alias("md_structure"))
    .collect()[0]["md_structure"][0]
)

# Transform markdown string to structured hierarchical column (AST-like)
doc_df = doc_df.with_column("md", text.md_transform("content", md_schema))

# Explode sections, extract section titles/content (as example)
section_df = doc_df.explode("md").with_column(
    "section", col("md").get_item("section")
).with_column(
    "section_title", col("section").get_item("title")
).with_column(
    "section_content", col("section").get_item("paragraphs")
)

# Summarize the document (section-based)
summary_df = section_df.group_by("document").agg(
    semantic.reduce(
        "summarize the document with sections titled {section_title} and paragraphs {section_content}"
    ).alias("summary")
)

# Join section_df with summary_df to get section + summary per document
section_df = section_df.join(summary_df, on="document", how="left")

# Generate QA pairs for each section, using the document summary for extra context
prompt = (
    "Create 10 question-answer pairs from this text for LLM training.\n"
    "The text is a section from a bigger document with the following summary:\n"
    "{summary}\n\n"
    "Rules:\n\n"
    "1. Each question must be about an important fact in the text.\n"
    "2. Each answer must be directly supported by the text.\n"
    "3. Format each question-answer pair clearly as follows, one pair per line, separated by a newline:\n\n"
    "Question: <your question here> | Answer: <your answer here>\n"
    "Question: <your question here> | Answer: <your answer here>\n"
    "...\n\n"
    "Section:\n"
    "title: {section_title}\n"
    "paragraphs: {section_content}"
)

# Generate QA pairs using the LLM for each section
section_df = section_df.with_column("qas", semantic.map(prompt)).cache()

# Extract structured QA pairs from model responses
qa_template = "Question: ${question:none} | Answer: ${answer:none}"
qa_df = (
    section_df
    .with_column("qas", text.split(col("qas"), "\n"))
    .explode("qas")
    .with_column("qas_struct", text.extract(qa_template, col("qas")))
    .with_column("question", col("qas_struct").get_item("question"))
    .with_column("answer", col("qas_struct").get_item("answer"))
    .select("document", "section_title", "question", "answer")
)

# Final output: structured synthetic QA pairs per document/section
qa_df.show()

#AI tooling #LLM fine-tuning #LLMs #OCR #composable APIs #data pipelines #dataset generation #document processing #fine-tuning #iterative workflows #large language models #machine learning engineering #markdown processing #prompt engineering #structured data #synthetic data