RAGEval#

This project supports independent evaluation and end-to-end evaluation for RAG and multimodal RAG:

  • Independent Evaluation: Evaluating the retrieval module separately. The evaluation metrics for the retrieval module include Hit Rate, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Precision, etc. These metrics are used to measure the system’s effectiveness in ranking items based on a query or task.

  • End-to-End Evaluation: Evaluating the final response generated by the RAG model for a given input. This includes the relevance and alignment of the model-generated answer with the input query. From the content generation objective perspective, the evaluation can be divided into no-reference and reference-based evaluations: No-reference evaluation metrics include Context Relevance, Faithfulness, etc.; Reference-based evaluation metrics include Accuracy, BLEU, ROUGE, etc.

See also

Related research on RAG evaluation here

This framework supports the following:

  • Independent evaluation of the text retrieval module using MTEB/CMTEB.

  • Independent evaluation of the multimodal image-text retrieval module using CLIP Benchmark.

  • End-to-end generation evaluation of RAG and multimodal RAG using RAGAS.

MTEB/CMTEB

For independent evaluation of the retrieval module, supporting embedding models and reranker models.

MTEB
CLIP Benchmark

For independent evaluation of the multimodal image-text retrieval module, supporting CLIP models.

CLIP Benchmark
RAGAS

For end-to-end generation evaluation of RAG and multimodal RAG, also supporting automatic generation of evaluation sets.

RAGAS