RAGAS RAG Evaluation#

RAGAS (Retrieval Augmented Generation Assessment) is a dedicated framework for evaluating the performance of Retrieval Augmented Generation (RAG) systems. Core evaluation metrics include:

Metric

Description

Faithfulness

Whether the answer is grounded in the retrieved context without hallucination

AnswerRelevancy

Whether the answer directly addresses the user’s question

ContextPrecision

Proportion of relevant information in the retrieved context

ContextRecall

Whether all information needed to answer was successfully retrieved

AnswerCorrectness

Degree of agreement between generated answer and ground truth

Additionally, RAGAS supports automatic test dataset generation from documents and multi-modal text-image RAG evaluation.

Environment Setup#

pip install evalscope[rag] -U

Scenario 1: Evaluate RAG with Existing Dataset (Quick Start)#

Use when: You already have question/answer/context data and want to evaluate RAG system quality.

Data Format#

The evaluation data is a JSON file where each record contains the following fields:

Field

Description

user_input

User question

response

Answer generated by the RAG system

retrieved_contexts

List of retrieved context passages

reference

Ground truth answer

Example:

[
    {
        "user_input": "When was the first Olympic Games held?",
        "retrieved_contexts": [
            "The first modern Olympic Games were held from April 6 to April 15, 1896, in Athens, Greece."
        ],
        "response": "The first modern Olympic Games were held on April 6, 1896.",
        "reference": "The first modern Olympic Games opened on April 6, 1896, in Athens, Greece."
    }
]

Configuration and Execution#

from evalscope.run import run_task

eval_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "eval": {
            "testset_file": "outputs/testset.json",
            "critic_llm": {
                "model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
                "api_base": "http://127.0.0.1:8000/v1",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/m3e-base",
            },
            "metrics": [
                "Faithfulness",
                "AnswerRelevancy",
                "ContextPrecision",
                "AnswerCorrectness",
            ],
            "language": "english",
        },
    },
}

run_task(task_cfg=eval_task_cfg)

Understanding Results#

After evaluation completes, results are displayed as follows:

../../../_images/eval_result.png

Evaluation Results#

Each data entry will include scores for each metric, and the overall report shows the average across all metrics.

Key Parameters for This Scenario#

Parameter

Type

Default

Description

testset_file

str

-

Path to the evaluation dataset file

critic_llm

dict

-

LLM configuration for evaluation (see Parameter Reference)

embeddings

dict

-

Embedding model configuration (see Parameter Reference)

metrics

List[str]

["answer_relevancy", "faithfulness"]

List of evaluation metrics

language

str

"english"

Language setting, set to "chinese" for Chinese evaluation

batch_size

Optional[int]

None

Batch size

Scenario 2: Auto-Generate Test Dataset#

Use when: You don’t have existing evaluation data and want to automatically generate a question/answer/context test set from your documents.

../../../_images/generation_process.png

RAGAS uses an evolutionary generation paradigm inspired by Evol-Instruct, systematically constructing questions with different characteristics (reasoning, conditional, multi-context, etc.) from the provided documents to ensure comprehensive evaluation coverage.#

Prepare Documents#

Have your source documents ready (supports markdown, txt, and other formats). Documents should contain sufficient content (recommended > 100 tokens), otherwise errors may occur.

Configuration and Execution#

from evalscope.run import run_task

generate_testset_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "testset_generation": {
            "docs": ["README.md"],
            "test_size": 10,
            "output_file": "outputs/testset.json",
            "generator_llm": {
                "model_name": "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4",
                "api_base": "http://127.0.0.1:8000/v1",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/m3e-base",
            },
            "language": "english",
        },
    },
}

run_task(task_cfg=generate_testset_task_cfg)

Once generation completes, data is saved to the path specified by output_file, in the same format as Scenario 1’s input.

Key Parameters for This Scenario#

Parameter

Type

Default

Description

docs

List[str]

-

List of source document paths

test_size

int

10

Number of test entries to generate

output_file

str

"outputs/testset.json"

Output file path for generated dataset

generator_llm

dict

-

Generator LLM configuration (see Parameter Reference)

embeddings

dict

-

Embedding model configuration (see Parameter Reference)

language

str

"english"

Language setting, e.g. "chinese"

Troubleshooting#

LLM Output Format Errors

Note

generator_llm requires a model with strong instruction-following capabilities. Models with 7B or fewer parameters may encounter the following error:

ragas.testset.transforms.engine - ERROR - unable to apply transformation: 'Generation' object has no attribute 'message'

This happens because smaller models produce outputs in an unexpected format, causing parsing failures. Solution: use a larger model (e.g. Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4) or a proprietary model (e.g. GPT-4o).

Document Too Short

Tip

If you encounter the following error, the unstructured library found insufficient content in your documents:

ValueError: Documents appear to be too short (i.e., 100 tokens or less). Please provide longer documents.

Solution: ensure your source documents have sufficient content, or preprocess them into plain txt format before retrying.

Scenario 3: Multi-Modal Text-Image RAG Evaluation#

Use when: Evaluating RAG systems that involve image understanding, where the context may include images.

Data Format#

Same format as Scenario 1, except that retrieved_contexts can include image paths (local paths or URLs):

[
    {
        "user_input": "What is the brand of the car in the image?",
        "retrieved_contexts": [
            "custom_eval/multimodal/images/tesla.jpg"
        ],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the image is Tesla."
    }
]

Configuration and Execution#

from evalscope.run import run_task

multi_modal_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "RAGAS",
        "eval": {
            "testset_file": "outputs/testset_multi_modal.json",
            "critic_llm": {
                "model_name": "gpt-4o",
                "api_base": "http://127.0.0.1:8088/v1",
                "api_key": "EMPTY",
            },
            "embeddings": {
                "model_name_or_path": "AI-ModelScope/bge-large-zh",
            },
            "metrics": [
                "MultiModalFaithfulness",
                "MultiModalRelevance",
            ],
        },
    },
}

run_task(task_cfg=multi_modal_task_cfg)

Example output:

[
    {
        "user_input": "What is the brand of the car in the image?",
        "retrieved_contexts": ["custom_eval/multimodal/images/tesla.jpg"],
        "response": "Tesla is a car brand.",
        "reference": "The car brand in the image is Tesla.",
        "faithful_rate": true,
        "relevance_rate": true
    }
]

Notes for This Scenario#

  • critic_llm must use a model that supports multi-modal interleaved text-image input (e.g. gpt-4o).

  • Multi-modal specific metrics are MultiModalFaithfulness and MultiModalRelevance.

  • General metrics not involving images (e.g. AnswerCorrectness) can also be used. Refer to the metrics list.

  • embeddings is optional and may be omitted.

Complete Parameter Reference#

General Configuration#

task_cfg = {
    "eval_backend": "RAGEval",       # Fixed value
    "eval_config": {
        "tool": "RAGAS",             # Fixed value
        "eval": { ... },             # Evaluation config (Scenario 1/3)
        "testset_generation": { ... } # Generation config (Scenario 2)
    },
}

eval Parameters (RAGASEvalConfig)#

Parameter

Type

Default

Description

testset_file

str

-

Path to the evaluation dataset file

critic_llm

dict

-

Evaluation LLM configuration (see LLM Config table)

embeddings

dict

-

Embedding model configuration (see Embedding Config table)

metrics

List[str]

["answer_relevancy", "faithfulness"]

Evaluation metrics, refer to metrics list

language

str

"english"

Language setting

batch_size

Optional[int]

None

Batch size

raise_exceptions

bool

False

Whether to raise exceptions

testset_generation Parameters (RAGASTestsetConfig)#

Parameter

Type

Default

Description

docs

List[str]

-

List of source document paths

test_size

int

10

Size of the generated test set

output_file

str

"outputs/testset.json"

Output file path

generator_llm

dict

-

Generator LLM configuration (see LLM Config table)

embeddings

dict

-

Embedding model configuration (see Embedding Config table)

language

str

"english"

Language setting

generator_llm / critic_llm Parameters (RAGASLLMConfig)#

Parameter

Type

Default

Description

model_name

str

-

Model name, e.g. "Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4"

provider

str

"openai"

Provider

api_base

Optional[str]

None

API base URL, e.g. "http://127.0.0.1:8000/v1"

api_key

Optional[str]

None

API key

temperature

float

0.0

Generation temperature

max_tokens

Optional[int]

None

Maximum number of tokens

embeddings Parameters (RAGASEmbeddingConfig)#

Parameter

Type

Default

Description

model_name_or_path

str

-

Model name or path, e.g. "AI-ModelScope/m3e-base"

provider

str

"huggingface"

Provider

api_base

Optional[str]

None

API base URL (when using API embeddings)

api_key

Optional[str]

None

API key

FAQ#

LLM Errors / API Timeout#

  • Ensure the LLM service is running and the api_base URL is accessible.

  • If using a local model (e.g. deployed with vLLM), confirm the model has finished loading.

  • For timeout issues, try increasing the client timeout or reducing batch_size.

Metrics Explained#

Metric

Evaluation Dimension

Requires reference

Faithfulness

Whether the answer is grounded in context (no hallucination)

No

AnswerRelevancy

Whether the answer is relevant to the question

No

ContextPrecision

Proportion of relevant information in context

Yes

ContextRecall

Completeness of retrieved information for answering

Yes

AnswerCorrectness

Agreement between answer and ground truth

Yes

MultiModalFaithfulness

Whether multi-modal answer is grounded in text-image context

No

MultiModalRelevance

Whether multi-modal answer is relevant to the question

No

Model Selection Guide#

  • Evaluation model (critic_llm): Recommended to use 70B+ parameter instruction models or proprietary models (e.g. GPT-4o) for stable and reliable scoring.

  • Generator model (generator_llm): Also recommended 70B+ parameters; smaller models are prone to output format errors causing generation failures.

  • Embedding model (embeddings): Lightweight models such as AI-ModelScope/m3e-base or AI-ModelScope/bge-large-zh work well.