By following these steps, you can ensure your dataset is well-prepared for synthetic or elaborative augmentation, leading to more effective and accurate AI model training.
system prompt
, then a series of prompts
and responses
that the LLM will use to hone itself.
Structure them like this:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |
Remove Duplicates
Handle Missing Values
Correct Errors
Normalize Data
Categorize Data
Feature Selection
Remove Personal Identifiers
Tokenization
Original Data | Tokenized Data |
---|---|
John Doe, 123-45-6789 | T123, T456 |
Jane Smith, 987-65-4321 | T789, T654 |
Consistent Structure
File Formats
system prompt
, prompt
, and completion
. Ensure that these fields are strictly followed in the order specified and without header row.A properly formatted .csv dataset should look like this, without the header descriptions:System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |