Curating Datasets
Confident AI provides your team a centralized place to create, upload, and edit evaluation datasets online. Think of it like a Google Sheets or Notion CSV editor, but the difference is that each row is already in the structure of an LLMTestCase
, and most importantly you can use it directly in code when evaluating with deepeval
.
An evaluation dataset on Confident AI is a collection of goldens, which is extremely similar to a test case. You can learn more about goldens here.
Create Your First Dataset
To create your first dataset, simply navigate to the Datasets page in your project space. There, you'll see a button that says Create Dataset, and you will be required to name your first dataset by providing it with an alias. This alias will be used to identify which will be used for evaluation later on.
Populate Your Dataset With Golden(s)
Now that you've created a dataset, you can create a "golden" within your dataset that will later be converted to an LLMTestCase
during evaluation time (we'll talk more about this later). There are a few ways you can populate your dataset with goldens, which includes:
- Creating a golden individually using Confident AI's goldens editor.
- Importing a CSV of goldens to Confident AI.
- Uploading a list of
Golden
s to Confident AI viadeepeval
.
If you're not sure what to include in your goldens, simply enter the inputs you're currently prompting your LLM application with when eyeballing outputs. You'll be able to automate this process by creating a list of goldens out of input
s.
A golden is basically a test case that isn't ready for evaluation yet. It holds additional information needed for a better dataset annotation experience, such as the ability to mark it as "ready" for evaluation, the ability to contain empty actual_output
s that will later be populated at evaluation time, and the inclusion of additional columns and metadata that might be useful for you at evaluation time.
We highly recommend AGAINST running evaluations on pre-computed evaluation datasets since you'll want to test your LLM application based on the latest actual_output
s that are generated as a consequence of your iteration, so if you find yourself filling in the actual_output
field at any point in time, think again.
Create Individual Golden(s)
You can create goldens manually by clicking on the Create Golden button in the Datasets > Dataset Editor page, which will open an editor for you to fill in your golden information.
Import Golden(s) From CSV
Alternatively, you can also choose to upload list of goldens from CSV files. Simply click on the Upload Goldens button, and you'll have the opportunity to map CSV columns to golden fields when importing.
The golden fields include:
input
: a string representing theinput
to prompt your LLM application with during evaluation.- [Optional]
actual_output
: a string representing the generatedactual_output
of your LLM application for the correspondinginput
. - [Optional]
expected_output
: a string representing the ideal output based on for the correspondinginput
. - [Optional]
retrieval_context
a list of strings representing the retrieved text chunks of your LLM application for the correspondinginput
. This is only for RAG pipelines. - [Optional]
context
: a list of strings representing the ground truth as supporting context. - [Optional]
comments
: a string representing whatever comments your data annotators have for this particular golden (e.g. "Watch out for this expected output! It needs more work."). - [Optional]
additional_metadata
: a freeform JSON which you can use to include as any additional data which you can later make use of in code during evaluation time.
Upload Golden(s) Via DeepEval
Pushing an EvaluationDataset
on Confident using deepeval
is simply involves creating an EvaluationDataset
with a list of Golden
s, and pushing it to Confident AI by supplying the dataset alias.
Here's the data structure of a Golden
:
from typing import Optional, List, Dict
class Golden:
input: str
actual_output: Optional[str]
expected_output: Optional[str]
retrieval_context: Optional[List[str]]
context: Optional[List[str]]
comments: Optional[str]
additional_metadata: Optional[Dict]
Click to see fake data we'll be uploading
fake_data = [
{
"input": "I have a persistent cough and fever. Should I be worried?",
"expected_output": (
"If your cough and fever persist or worsen, it could indicate a serious condition. "
"Persistent fevers lasting more than three days or difficulty breathing should prompt immediate medical attention. "
"Stay hydrated and consider over-the-counter fever reducers, but consult a healthcare provider for proper diagnosis."
)
},
{
"input": "What should I do if I accidentally cut my finger deeply?",
"expected_output": (
"Rinse the cut with soap and water, apply pressure to stop bleeding, and elevate the finger. "
"Seek medical care if the cut is deep, bleeding persists, or your tetanus shot isn't up to date."
),
}
]
And here's a quick example of how to push Golden
s within an EvaluationDataset
to Confident AI:
from deepeval.dataset import EvaluationDataset, Golden
# See above for contents of fake_data
fake_data = [...]
goldens = []
for fake_datum in fake_data:
golden = Golden(
input=fake_datum["input"],
expected_output=fake_datum["expected_output"],
)
goldens.append(golden)
dataset = EvaluationDataset(goldens=goldens)
After creating your EvaluationDataset
, all you have to do is push it to Confident by providing an alias
as an unique identifier.
...
# Provide an alias when pushing a dataset
dataset.push(alias="QA Dataset")
You can also choose to overwrite or append to an existing dataset if an existing dataset with the same alias already exist.
...
# Overwrite existing datasets
dataset.push(alias="QA Dataset", overwrite=True)
deepeval
will prompt you in the terminal if no value for overwrite
is provided.
What is a Golden?
A "Golden" is what makes up an evaluation dataset and is very similar to a test case in deepeval
, but they:
- Does not require an
actual_output
, so whilst test cases are always ready for evaluation, a golden isn't. - Only exists within an
EvaluationDataset()
, while test cases can be defined anywhere. - Contains an extra
additional_metadata
field, which is a dictionary you can define on Confident. Allows you to do some extra preprocessing on your dataset (eg., generating a custom LLMactual_output
based on some variables inadditional_metadata
) before evaluation.
We introduced the concept of goldens because it allows you to create evaluation datasets on Confident without needing pre-computed actual_output
s. This is especially helpful if you are looking to generate responses from your LLM application at evaluation time.