RAGAS RAG Evaluation#
RAGAS (Retrieval Augmented Generation Assessment) is a dedicated framework for evaluating the performance of Retrieval Augmented Generation (RAG) systems. Core evaluation metrics include:
Metric |
Description |
|---|---|
Faithfulness |
Whether the answer is grounded in the retrieved context without hallucination |
AnswerRelevancy |
Whether the answer directly addresses the user’s question |
ContextPrecision |
Proportion of relevant information in the retrieved context |
ContextRecall |
Whether all information needed to answer was successfully retrieved |
AnswerCorrectness |
Degree of agreement between generated answer and ground truth |
Additionally, RAGAS supports automatic test dataset generation from documents and multi-modal text-image RAG evaluation.
Environment Setup#
pip install evalscope[rag] -U
Scenario 1: Evaluate RAG with Existing Dataset (Quick Start)#
Use when: You already have question/answer/context data and want to evaluate RAG system quality.
Data Format#
The evaluation data is a JSON file where each record contains the following fields:
Field |
Description |
|---|---|
|
User question |
|
Answer generated by the RAG system |
|
List of retrieved context passages |
|
Ground truth answer |
Example:
[
{
"user_input": "When was the first Olympic Games held?",
"retrieved_contexts": [
"The first modern Olympic Games were held from April 6 to April 15, 1896, in Athens, Greece."
],
"response": "The first modern Olympic Games were held on April 6, 1896.",
"reference": "The first modern Olympic Games opened on April 6, 1896, in Athens, Greece."
}
]
Configuration and Execution#
from evalscope.run import run_task
eval_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "RAGAS",
"eval": {
"testset_file": "outputs/testset.json",
"critic_llm": {
"model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
"api_base": "http://127.0.0.1:8000/v1",
},
"embeddings": {
"model_name_or_path": "AI-ModelScope/m3e-base",
},
"metrics": [
"Faithfulness",
"AnswerRelevancy",
"ContextPrecision",
"AnswerCorrectness",
],
"language": "english",
},
},
}
run_task(task_cfg=eval_task_cfg)
Understanding Results#
After evaluation completes, results are displayed as follows:
Evaluation Results#
Each data entry will include scores for each metric, and the overall report shows the average across all metrics.
Key Parameters for This Scenario#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Path to the evaluation dataset file |
|
|
- |
LLM configuration for evaluation (see Parameter Reference) |
|
|
- |
Embedding model configuration (see Parameter Reference) |
|
|
|
List of evaluation metrics |
|
|
|
Language setting, set to |
|
|
|
Batch size |
Scenario 2: Auto-Generate Test Dataset#
Use when: You don’t have existing evaluation data and want to automatically generate a question/answer/context test set from your documents.
RAGAS uses an evolutionary generation paradigm inspired by Evol-Instruct, systematically constructing questions with different characteristics (reasoning, conditional, multi-context, etc.) from the provided documents to ensure comprehensive evaluation coverage.#
Prepare Documents#
Have your source documents ready (supports markdown, txt, and other formats). Documents should contain sufficient content (recommended > 100 tokens), otherwise errors may occur.
Configuration and Execution#
from evalscope.run import run_task
generate_testset_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "RAGAS",
"testset_generation": {
"docs": ["README.md"],
"test_size": 10,
"output_file": "outputs/testset.json",
"generator_llm": {
"model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
"api_base": "http://127.0.0.1:8000/v1",
},
"embeddings": {
"model_name_or_path": "AI-ModelScope/m3e-base",
},
"language": "english",
},
},
}
run_task(task_cfg=generate_testset_task_cfg)
Once generation completes, data is saved to the path specified by output_file, in the same format as Scenario 1’s input.
Key Parameters for This Scenario#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
List of source document paths |
|
|
|
Number of test entries to generate |
|
|
|
Output file path for generated dataset |
|
|
- |
Generator LLM configuration (see Parameter Reference) |
|
|
- |
Embedding model configuration (see Parameter Reference) |
|
|
|
Language setting, e.g. |
Troubleshooting#
LLM Output Format Errors
Note
generator_llm requires a model with strong instruction-following capabilities. Models with 7B or fewer parameters may encounter the following error:
ragas.testset.transforms.engine - ERROR - unable to apply transformation: 'Generation' object has no attribute 'message'
This happens because smaller models produce outputs in an unexpected format, causing parsing failures. Solution: use a larger model (e.g. Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4) or a proprietary model (e.g. GPT-4o).
Document Too Short
Tip
If you encounter the following error, the unstructured library found insufficient content in your documents:
ValueError: Documents appear to be too short (i.e., 100 tokens or less). Please provide longer documents.
Solution: ensure your source documents have sufficient content, or preprocess them into plain txt format before retrying.
Scenario 3: Multi-Modal Text-Image RAG Evaluation#
Use when: Evaluating RAG systems that involve image understanding, where the context may include images.
Data Format#
Same format as Scenario 1, except that retrieved_contexts can include image paths (local paths or URLs):
[
{
"user_input": "What is the brand of the car in the image?",
"retrieved_contexts": [
"custom_eval/multimodal/images/tesla.jpg"
],
"response": "Tesla is a car brand.",
"reference": "The car brand in the image is Tesla."
}
]
Configuration and Execution#
from evalscope.run import run_task
multi_modal_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "RAGAS",
"eval": {
"testset_file": "outputs/testset_multi_modal.json",
"critic_llm": {
"model_name": "gpt-4o",
"api_base": "http://127.0.0.1:8088/v1",
"api_key": "EMPTY",
},
"embeddings": {
"model_name_or_path": "AI-ModelScope/bge-large-zh",
},
"metrics": [
"MultiModalFaithfulness",
"MultiModalRelevance",
],
},
},
}
run_task(task_cfg=multi_modal_task_cfg)
Example output:
[
{
"user_input": "What is the brand of the car in the image?",
"retrieved_contexts": ["custom_eval/multimodal/images/tesla.jpg"],
"response": "Tesla is a car brand.",
"reference": "The car brand in the image is Tesla.",
"faithful_rate": true,
"relevance_rate": true
}
]
Notes for This Scenario#
critic_llmmust use a model that supports multi-modal interleaved text-image input (e.g.gpt-4o).Multi-modal specific metrics are
MultiModalFaithfulnessandMultiModalRelevance.General metrics not involving images (e.g.
AnswerCorrectness) can also be used. Refer to the metrics list.embeddingsis optional and may be omitted.
Complete Parameter Reference#
General Configuration#
task_cfg = {
"eval_backend": "RAGEval", # Fixed value
"eval_config": {
"tool": "RAGAS", # Fixed value
"eval": { ... }, # Evaluation config (Scenario 1/3)
"testset_generation": { ... } # Generation config (Scenario 2)
},
}
eval Parameters (RAGASEvalConfig)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Path to the evaluation dataset file |
|
|
- |
Evaluation LLM configuration (see LLM Config table) |
|
|
- |
Embedding model configuration (see Embedding Config table) |
|
|
|
Evaluation metrics, refer to metrics list |
|
|
|
Language setting |
|
|
|
Batch size |
|
|
|
Whether to raise exceptions |
testset_generation Parameters (RAGASTestsetConfig)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
List of source document paths |
|
|
|
Size of the generated test set |
|
|
|
Output file path |
|
|
- |
Generator LLM configuration (see LLM Config table) |
|
|
- |
Embedding model configuration (see Embedding Config table) |
|
|
|
Language setting |
generator_llm / critic_llm Parameters (RAGASLLMConfig)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Model name, e.g. |
|
|
|
Provider |
|
|
|
API base URL, e.g. |
|
|
|
API key |
|
|
|
Generation temperature |
|
|
|
Maximum number of tokens |
embeddings Parameters (RAGASEmbeddingConfig)#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
- |
Model name or path, e.g. |
|
|
|
Provider |
|
|
|
API base URL (when using API embeddings) |
|
|
|
API key |
FAQ#
LLM Errors / API Timeout#
Ensure the LLM service is running and the
api_baseURL is accessible.If using a local model (e.g. deployed with vLLM), confirm the model has finished loading.
For timeout issues, try increasing the client timeout or reducing
batch_size.
Metrics Explained#
Metric |
Evaluation Dimension |
Requires reference |
|---|---|---|
|
Whether the answer is grounded in context (no hallucination) |
No |
|
Whether the answer is relevant to the question |
No |
|
Proportion of relevant information in context |
Yes |
|
Completeness of retrieved information for answering |
Yes |
|
Agreement between answer and ground truth |
Yes |
|
Whether multi-modal answer is grounded in text-image context |
No |
|
Whether multi-modal answer is relevant to the question |
No |
Model Selection Guide#
Evaluation model (critic_llm): Recommended to use 70B+ parameter instruction models or proprietary models (e.g. GPT-4o) for stable and reliable scoring.
Generator model (generator_llm): Also recommended 70B+ parameters; smaller models are prone to output format errors causing generation failures.
Embedding model (embeddings): Lightweight models such as
AI-ModelScope/m3e-baseorAI-ModelScope/bge-large-zhwork well.