Preparing Your Dataset
By following these steps, you can ensure your dataset is well-prepared for synthetic or elaborative augmentation, leading to more effective and accurate AI model training.
Collecting Your Data
Data is the lifeblood of AI. Your organization likely has invaluable data that can be used to train AI models very specific to your domain:
Customer Interactions
Data from interactions with customers through various channels.
User Behavior Logs
Logs tracking user actions and behavior patterns.
CRM Data
Customer relationship management data and contact details.
Transaction Histories
Records of financial transactions and activities.
Market Data
Information on market trends and economic indicators.
Fraud Detection Logs
Logs and records related to fraud detection activities.
Risk Assessment Reports
Evaluations of potential risks and their impacts.
Patient Records
Comprehensive records of patient health information.
Medical Imaging Data
Images from medical scans like X-rays and MRIs.
Lab Results
Data from laboratory tests and analyses.
Clinical Trial Data
Information and results from clinical trials.
Insurance Claims
Records of insurance claims and processing details.
Preparing Your Dataset
It’s essential to prepare your dataset cleanly and systematically. This ensures the quality and relevance of the data, leading to better model performance.
The dataset needs to reflect the prompt-response nature of LLMs. So your dataset will need to have one system prompt
, then a series of prompts
and responses
that the LLM will use to hone itself.
Structure them like this:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |
Collect
Ensure you have collected all relevant data from various sources within your organization. This includes databases, CRM systems, transaction logs, and more.
Clean
Remove Duplicates
Ensure there are no duplicate entries in your dataset to avoid redundancy and bias.
Handle Missing Values
Identify and appropriately handle missing data points. Options include filling in missing values using statistical methods or removing incomplete records.
Correct Errors
Look for and correct any inaccuracies or inconsistencies in the data, such as typographical errors or misformatted entries.
Normalize Data
Standardize the format and scale of your data. This might involve converting all dates to a standard format, ensuring numerical values are on a consistent scale, and normalizing text data.
Segment
Categorize Data
Segment your data into meaningful categories or classes. This helps in training models more effectively by ensuring relevant data is grouped together.
Feature Selection
Identify and select the most relevant features for your model. This reduces noise and improves model performance.
Anonymize
Remove Personal Identifiers
Ensure all personal or sensitive information is anonymized or removed to comply with privacy regulations.
Tokenization
Replace sensitive data with tokens to maintain data integrity while protecting privacy. Store the mapping between tokens and the original sensitive data in a secure, separate location.
PII Redaction Example:
Original Data | Tokenized Data |
---|---|
John Doe, 123-45-6789 | T123, T456 |
Jane Smith, 987-65-4321 | T789, T654 |
In this example, “John Doe” is replaced with “T123” and “123-45-6789” with “T456,” ensuring that sensitive information remains secure.
Format
Consistent Structure
Ensure all data follows a consistent structure and format, making it easier to process and analyze.
File Formats
Convert data into appropriate file formats (e.g., CSV, JSON) suitable for your training framework.
For Anarchy’s systems to be able to augment your data, Data should be in csv format with three mandatory fields: system prompt
, prompt
, and completion
. Ensure that these fields are strictly followed in the order specified and without header row.
A properly formatted .csv dataset should look like this, without the header descriptions:
System Prompt | User Prompt | Response |
---|---|---|
Hi, how can I help you today? | ||
What do these lab results suggest? | These lab results suggest that the patient is healthy, as no anomalous data has been detected. | |
What is the sentiment of the last 5 customers who came into support chat? | The last five customers have a neutral to positive sentiment. | |
… | … |
Validate
Quality Checks
Perform thorough quality checks to ensure the data is accurate, consistent, and complete.
Cross-Validation
Use cross-validation techniques to check the reliability and validity of your dataset.
Document
Metadata
Document metadata, including data sources, collection methods, and any preprocessing steps applied.
Version Control
Maintain version control to track changes and updates to your dataset over time.