RAGEval#
This project supports independent evaluation and end-to-end evaluation for RAG and multimodal RAG:
Independent Evaluation: Evaluating the retrieval module separately. The evaluation metrics for the retrieval module include Hit Rate, Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (NDCG), Precision, etc. These metrics are used to measure the system’s effectiveness in ranking items based on a query or task.
End-to-End Evaluation: Evaluating the final response generated by the RAG model for a given input. This includes the relevance and alignment of the model-generated answer with the input query. From the content generation objective perspective, the evaluation can be divided into no-reference and reference-based evaluations: No-reference evaluation metrics include Context Relevance, Faithfulness, etc.; Reference-based evaluation metrics include Accuracy, BLEU, ROUGE, etc.
See also
Related research on RAG evaluation here
This framework supports the following:
Independent evaluation of the text retrieval module using MTEB/CMTEB.
Independent evaluation of the multimodal image-text retrieval module using CLIP Benchmark.
End-to-end generation evaluation of RAG and multimodal RAG using RAGAS.
For independent evaluation of the retrieval module, supporting embedding models and reranker models.
For independent evaluation of the multimodal image-text retrieval module, supporting CLIP models.
For end-to-end generation evaluation of RAG and multimodal RAG, also supporting automatic generation of evaluation sets.