RAGAS RAG Evaluation#

RAGAS (Retrieval Augmented Generation Assessment) is a dedicated framework for evaluating the performance of Retrieval Augmented Generation (RAG) systems. Core evaluation metrics include:

Metric	Description
Faithfulness	Whether the answer is grounded in the retrieved context without hallucination
AnswerRelevancy	Whether the answer directly addresses the user’s question
ContextPrecision	Proportion of relevant information in the retrieved context
ContextRecall	Whether all information needed to answer was successfully retrieved
AnswerCorrectness	Degree of agreement between generated answer and ground truth

Additionally, RAGAS supports automatic test dataset generation from documents and multi-modal text-image RAG evaluation.

Environment Setup#

pip install evalscope[rag] -U

Scenario 1: Evaluate RAG with Existing Dataset (Quick Start)#

Use when: You already have question/answer/context data and want to evaluate RAG system quality.

Data Format#

The evaluation data is a JSON file where each record contains the following fields:

Field	Description
`user_input`	User question
`response`	Answer generated by the RAG system
`retrieved_contexts`	List of retrieved context passages
`reference`	Ground truth answer

Example:

[
    {
        "user_input": "When was the first Olympic Games held?",
        "retrieved_contexts": [
            "The first modern Olympic Games were held from April 6 to April 15, 1896, in Athens, Greece."
        ],
        "response": "The first modern Olympic Games were held on April 6, 1896.",
        "reference": "The first modern Olympic Games opened on April 6, 1896, in Athens, Greece."
    }
]

Configuration and Execution#

from evalscope.run import run_task

eval_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "eval": {
            "testset_file": "outputs/testset.json",
            "critic_llm": {
                "model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
                "api_base": "http://127.0.0.1:8000/v1",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/m3e-base",
            },
            "metrics": [
                "Faithfulness",
                "AnswerRelevancy",
                "ContextPrecision",
                "AnswerCorrectness",
            ],
            "language": "english",
        },
    },
}

run_task(task_cfg=eval_task_cfg)

Understanding Results#

After evaluation completes, results are displayed as follows:

../../../_images/eval_result.png — Evaluation Results#

Each data entry will include scores for each metric, and the overall report shows the average across all metrics.

Key Parameters for This Scenario#

Parameter	Type	Default	Description
`testset_file`	`str`	-	Path to the evaluation dataset file
`critic_llm`	`dict`	-	LLM configuration for evaluation (see Parameter Reference)
`embeddings`	`dict`	-	Embedding model configuration (see Parameter Reference)
`metrics`	`List[str]`	`["answer_relevancy", "faithfulness"]`	List of evaluation metrics
`language`	`str`	`"english"`	Language setting, set to `"chinese"` for Chinese evaluation
`batch_size`	`Optional[int]`	`None`	Batch size

Scenario 2: Auto-Generate Test Dataset#

Use when: You don’t have existing evaluation data and want to automatically generate a question/answer/context test set from your documents.

../../../_images/generation_process.png — RAGAS uses an evolutionary generation paradigm inspired by Evol-Instruct, systematically constructing questions with different characteristics (reasoning, conditional, multi-context, etc.) from the provided documents to ensure comprehensive evaluation coverage.#

Prepare Documents#

Have your source documents ready (supports markdown, txt, and other formats). Documents should contain sufficient content (recommended > 100 tokens), otherwise errors may occur.

Configuration and Execution#

from evalscope.run import run_task

generate_testset_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "testset_generation": {
            "docs": ["README.md"],
            "test_size": 10,
            "output_file": "outputs/testset.json",
            "generator_llm": {
                "model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
                "api_base": "http://127.0.0.1:8000/v1",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/m3e-base",
            },
            "language": "english",
        },
    },
}

run_task(task_cfg=generate_testset_task_cfg)

Once generation completes, data is saved to the path specified by output_file, in the same format as Scenario 1’s input.

Key Parameters for This Scenario#

Parameter	Type	Default	Description
`docs`	`List[str]`	-	List of source document paths
`test_size`	`int`	`10`	Number of test entries to generate
`output_file`	`str`	`"outputs/testset.json"`	Output file path for generated dataset
`generator_llm`	`dict`	-	Generator LLM configuration (see Parameter Reference)
`embeddings`	`dict`	-	Embedding model configuration (see Parameter Reference)
`language`	`str`	`"english"`	Language setting, e.g. `"chinese"`

Troubleshooting#

LLM Output Format Errors

Note

generator_llm requires a model with strong instruction-following capabilities. Models with 7B or fewer parameters may encounter the following error:

ragas.testset.transforms.engine - ERROR - unable to apply transformation: 'Generation' object has no attribute 'message'

This happens because smaller models produce outputs in an unexpected format, causing parsing failures. Solution: use a larger model (e.g. Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4) or a proprietary model (e.g. GPT-4o).

Document Too Short

Tip

If you encounter the following error, the unstructured library found insufficient content in your documents:

ValueError: Documents appear to be too short (i.e., 100 tokens or less). Please provide longer documents.

Solution: ensure your source documents have sufficient content, or preprocess them into plain txt format before retrying.

Scenario 3: Multi-Modal Text-Image RAG Evaluation#

Use when: Evaluating RAG systems that involve image understanding, where the context may include images.

Data Format#

Same format as Scenario 1, except that retrieved_contexts can include image paths (local paths or URLs):

[
    {
        "user_input": "What is the brand of the car in the image?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg"
        ],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the image is Tesla."
    }
]

Configuration and Execution#

from evalscope.run import run_task

multi_modal_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "eval": {
            "testset_file": "outputs/testset_multi_modal.json",
            "critic_llm": {
                "model_name": "gpt-4o",
                "api_base": "http://127.0.0.1:8088/v1",
                "api_key": "EMPTY",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/bge-large-zh",
            },
            "metrics": [
                "MultiModalFaithfulness",
                "MultiModalRelevance",
            ],
        },
    },
}

run_task(task_cfg=multi_modal_task_cfg)

Example output:

[
    {
        "user_input": "What is the brand of the car in the image?",
        "retrieved_contexts": ["custom_eval/multimodal/images/tesla.jpg"],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the image is Tesla.",
        "faithful_rate": true,
        "relevance_rate": true
    }
]

Notes for This Scenario#

critic_llm must use a model that supports multi-modal interleaved text-image input (e.g. gpt-4o).
Multi-modal specific metrics are MultiModalFaithfulness and MultiModalRelevance.
General metrics not involving images (e.g. AnswerCorrectness) can also be used. Refer to the metrics list.
embeddings is optional and may be omitted.

Complete Parameter Reference#

General Configuration#

task_cfg = {
    "eval_backend": "RAGEval",       # Fixed value
    "eval_config": {
        "tool": "RAGAS",             # Fixed value
        "eval": { ... },             # Evaluation config (Scenario 1/3)
        "testset_generation": { ... } # Generation config (Scenario 2)
    },
}

eval Parameters (`RAGASEvalConfig`)#

Parameter	Type	Default	Description
`testset_file`	`str`	-	Path to the evaluation dataset file
`critic_llm`	`dict`	-	Evaluation LLM configuration (see LLM Config table)
`embeddings`	`dict`	-	Embedding model configuration (see Embedding Config table)
`metrics`	`List[str]`	`["answer_relevancy", "faithfulness"]`	Evaluation metrics, refer to metrics list
`language`	`str`	`"english"`	Language setting
`batch_size`	`Optional[int]`	`None`	Batch size
`raise_exceptions`	`bool`	`False`	Whether to raise exceptions

testset_generation Parameters (`RAGASTestsetConfig`)#

Parameter	Type	Default	Description
`docs`	`List[str]`	-	List of source document paths
`test_size`	`int`	`10`	Size of the generated test set
`output_file`	`str`	`"outputs/testset.json"`	Output file path
`generator_llm`	`dict`	-	Generator LLM configuration (see LLM Config table)
`embeddings`	`dict`	-	Embedding model configuration (see Embedding Config table)
`language`	`str`	`"english"`	Language setting

generator_llm / critic_llm Parameters (`RAGASLLMConfig`)#

Parameter	Type	Default	Description
`model_name`	`str`	-	Model name, e.g. `"Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4"`
`provider`	`str`	`"openai"`	Provider
`api_base`	`Optional[str]`	`None`	API base URL, e.g. `"http://127.0.0.1:8000/v1"`
`api_key`	`Optional[str]`	`None`	API key
`temperature`	`float`	`0.0`	Generation temperature
`max_tokens`	`Optional[int]`	`None`	Maximum number of tokens

embeddings Parameters (`RAGASEmbeddingConfig`)#

Parameter	Type	Default	Description
`model_name_or_path`	`str`	-	Model name or path, e.g. `"AI-ModelScope/m3e-base"`
`provider`	`str`	`"huggingface"`	Provider
`api_base`	`Optional[str]`	`None`	API base URL (when using API embeddings)
`api_key`	`Optional[str]`	`None`	API key

FAQ#

LLM Errors / API Timeout#

Ensure the LLM service is running and the api_base URL is accessible.
If using a local model (e.g. deployed with vLLM), confirm the model has finished loading.
For timeout issues, try increasing the client timeout or reducing batch_size.

Metrics Explained#

Metric	Evaluation Dimension	Requires reference
`Faithfulness`	Whether the answer is grounded in context (no hallucination)	No
`AnswerRelevancy`	Whether the answer is relevant to the question	No
`ContextPrecision`	Proportion of relevant information in context	Yes
`ContextRecall`	Completeness of retrieved information for answering	Yes
`AnswerCorrectness`	Agreement between answer and ground truth	Yes
`MultiModalFaithfulness`	Whether multi-modal answer is grounded in text-image context	No
`MultiModalRelevance`	Whether multi-modal answer is relevant to the question	No

Model Selection Guide#

Evaluation model (critic_llm): Recommended to use 70B+ parameter instruction models or proprietary models (e.g. GPT-4o) for stable and reliable scoring.
Generator model (generator_llm): Also recommended 70B+ parameters; smaller models are prone to output format errors causing generation failures.
Embedding model (embeddings): Lightweight models such as AI-ModelScope/m3e-base or AI-ModelScope/bge-large-zh work well.

RAGAS RAG Evaluation#

Environment Setup#

Scenario 1: Evaluate RAG with Existing Dataset (Quick Start)#

Data Format#

Configuration and Execution#

Understanding Results#

Key Parameters for This Scenario#

Scenario 2: Auto-Generate Test Dataset#

Prepare Documents#

Configuration and Execution#

Key Parameters for This Scenario#

Troubleshooting#

Scenario 3: Multi-Modal Text-Image RAG Evaluation#

Data Format#

Configuration and Execution#

Notes for This Scenario#

Complete Parameter Reference#

General Configuration#

eval Parameters (RAGASEvalConfig)#

testset_generation Parameters (RAGASTestsetConfig)#

generator_llm / critic_llm Parameters (RAGASLLMConfig)#

embeddings Parameters (RAGASEmbeddingConfig)#

FAQ#

LLM Errors / API Timeout#

Metrics Explained#

Model Selection Guide#

eval Parameters (`RAGASEvalConfig`)#

testset_generation Parameters (`RAGASTestsetConfig`)#

generator_llm / critic_llm Parameters (`RAGASLLMConfig`)#

embeddings Parameters (`RAGASEmbeddingConfig`)#