What is Data Augmentation?

Data augmentation is the process of increasing the diversity and size of a dataset through various techniques such as generating synthetic data or elaborative augmentation, which enhances the performance and robustness of machine learning models.

Data augmentation is used to refine and bolster the quality of datasets that can be used to fine-tune LLMs.

There are two primary forms of data augmentation: Elaborative, and Synthetic. This page is a guide for each type and how to use it.

Benefits of Augmentation

Enhanced Dataset Size

Synthetic augmentation increases the volume of the dataset, which can be particularly useful in scenarios where acquiring real-world data is challenging or expensive.

Improved Model Training

A larger and more diverse dataset can lead to better model performance, as it provides a broader range of examples for training.

Reduced Bias

By generating diverse data points, synthetic augmentation can help in reducing biases present in the original dataset.

Synthetic

Synthetic augmentation, also known as data synthesis, is a method used to expand an existing dataset by utilizing LLMs. This process involves sampling existing data points and generating new, synthetic data points based on them. The newly created data is referred to as synthesized or synthetic data.

Process

1

Sampling Existing Data Points

The initial step involves selecting data points from the existing dataset. These samples serve as the foundation for generating new data.

2

Generating Synthetic Data

Using an LLM, new data points are generated. The LLM analyzes the sampled data and creates new data that is consistent with the patterns and structures observed in the original dataset.

3

Integration

The synthesized data is then integrated back into the original dataset, effectively expanding it and enhancing its diversity.

Applications

Healthcare

For generating patient data to train medical models without compromising patient privacy.

Finance

To simulate transaction data for fraud detection and other financial models.

Elaborative

Elaborative augmentation is a technique used to create a new dataset from uploaded documents. This process involves utilizing a large language model (LLM) to extract data from your documents and generate a series of prompt-completion pairs. These pairs are then stored as data points in a new dataset. Once enough data points have been generated, this new dataset can be used to fine-tune a model.

Process

1

Document Upload

Begin by uploading the documents from which you want to create the new dataset. These documents will serve as the source material for data extraction.

2

Data Extraction

The LLM analyzes the uploaded documents and extracts relevant data. This data forms the basis for generating prompt-completion pairs.

3

Generating Prompt-Completion Pairs

The LLM creates a set of prompt-completion pairs based on the extracted data. Each pair consists of a prompt (input) and a corresponding completion (output), which are stored as individual data points.

4

Dataset Creation

The generated prompt-completion pairs are aggregated to form a new dataset. This dataset can be expanded as more documents are processed and more pairs are generated.

Once a sufficient number of data points have been accumulated, the new dataset can be used to fine-tune a model. This fine-tuning process enhances the model’s ability to understand and generate text related to the specific domain of the uploaded documents.

Applications

Legal & Regulatory Compliance

Generating datasets from legal documents to train models for legal text analysis.

Customer Support

Creating datasets from customer service logs to improve chatbots and automated response systems.

Academic Research

Utilizing research papers and academic articles to develop models for literature review and summarization.