MTEB#

This framework supports MTEB and CMTEB, with the following details:

  • MTEB (Massive Text Embedding Benchmark) is a large-scale benchmark designed to measure the performance of text embedding models across diverse embedding tasks. MTEB includes 56 datasets covering 8 tasks and supports over 112 different languages. The goal of this benchmark is to assist developers in finding the best text embedding models suitable for various tasks.

  • C-MTEB (Chinese Massive Text Embedding Benchmark) is a dedicated evaluation benchmark for Chinese text vectors, built on MTEB, aimed at assessing the performance of Chinese text vector models. C-MTEB collects 35 public datasets and is divided into 6 categories of evaluation tasks, including retrieval, re-ranking, semantic text similarity (STS), classification, pair classification, and clustering.

Supported Datasets#

Here is an overview of the available tasks and datasets in C-MTEB:

Name

Hub Link

Description

Type

Category

Number of Test Samples

T2Retrieval

C-MTEB/T2Retrieval

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Retrieval

s2p

24,832

MMarcoRetrieval

C-MTEB/MMarcoRetrieval

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Retrieval

s2p

7,437

DuRetrieval

C-MTEB/DuRetrieval

A large-scale Chinese web search engine paragraph retrieval benchmark

Retrieval

s2p

4,000

CovidRetrieval

C-MTEB/CovidRetrieval

COVID-19 news articles

Retrieval

s2p

949

CmedqaRetrieval

C-MTEB/CmedqaRetrieval

Online medical consultation texts

Retrieval

s2p

3,999

EcomRetrieval

C-MTEB/EcomRetrieval

Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems

Retrieval

s2p

1,000

MedicalRetrieval

C-MTEB/MedicalRetrieval

Paragraph retrieval dataset collected from Alibaba medical search engine systems

Retrieval

s2p

1,000

VideoRetrieval

C-MTEB/VideoRetrieval

Paragraph retrieval dataset collected from Alibaba video search engine systems

Retrieval

s2p

1,000

T2Reranking

C-MTEB/T2Reranking

T2Ranking: A large-scale Chinese paragraph ranking benchmark

Re-ranking

s2p

24,382

MMarcoReranking

C-MTEB/MMarco-reranking

mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset

Re-ranking

s2p

7,437

CMedQAv1

C-MTEB/CMedQAv1-reranking

Chinese community medical Q&A

Re-ranking

s2p

2,000

CMedQAv2

C-MTEB/CMedQAv2-reranking

Chinese community medical Q&A

Re-ranking

s2p

4,000

Ocnli

C-MTEB/OCNLI

Original Chinese natural language inference dataset

Pair Classification

s2s

3,000

Cmnli

C-MTEB/CMNLI

Chinese multi-class natural language inference

Pair Classification

s2s

139,000

CLSClusteringS2S

C-MTEB/CLSClusteringS2S

Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

s2s

10,000

CLSClusteringP2P

C-MTEB/CLSClusteringP2P

Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.

Clustering

p2p

10,000

ThuNewsClusteringS2S

C-MTEB/ThuNewsClusteringS2S

Clustering titles from the THUCNews dataset

Clustering

s2s

10,000

ThuNewsClusteringP2P

C-MTEB/ThuNewsClusteringP2P

Clustering titles + abstracts from the THUCNews dataset

Clustering

p2p

10,000

ATEC

C-MTEB/ATEC

ATEC NLP Sentence Pair Similarity Competition

STS

s2s

20,000

BQ

C-MTEB/BQ

Banking Question Semantic Similarity

STS

s2s

10,000

LCQMC

C-MTEB/LCQMC

Large-scale Chinese Question Matching Corpus

STS

s2s

12,500

PAWSX

C-MTEB/PAWSX

Translated PAWS evaluation pairs

STS

s2s

2,000

STSB

C-MTEB/STSB

Translated STS-B into Chinese

STS

s2s

1,360

AFQMC

C-MTEB/AFQMC

Ant Financial Question Matching Corpus

STS

s2s

3,861

QBQTC

C-MTEB/QBQTC

QQ Browser Query Title Corpus

STS

s2s

5,000

TNews

C-MTEB/TNews-classification

News Short Text Classification

Classification

s2s

10,000

IFlyTek

C-MTEB/IFlyTek-classification

Long Text Classification of Application Descriptions

Classification

s2s

2,600

Waimai

C-MTEB/waimai-classification

Sentiment Analysis of User Reviews on Food Delivery Platforms

Classification

s2s

1,000

OnlineShopping

C-MTEB/OnlineShopping-classification

Sentiment Analysis of User Reviews on Online Shopping Websites

Classification

s2s

1,000

MultilingualSentiment

C-MTEB/MultilingualSentiment-classification

A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative

Classification

s2s

3,000

JDReview

C-MTEB/JDReview-classification

Reviews of iPhone

Classification

s2s

533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

Environment Setup#

Install dependencies

pip install mteb

Configure Evaluation Parameters#

The framework supports two evaluation modes: single-stage evaluation and two-stage evaluation:

  • Single-Stage Evaluation: Directly use the model for prediction and compute metrics. Supports tasks such as retrieval, re-ranking, and classification for embedding models.

  • Two-Stage Evaluation: Use the model for retrieval first, then use the model for re-ranking, and compute metrics. Supports re-ranking models.

Single-stage Evaluation#

Example configuration file:

one_stage_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "pooling_mode": None,
                "max_seq_length": 512,
                "prompt": "",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 128,
                },
            }
        ],
        "eval": {
            "tasks": [
                "TNews",
                "CLSClusteringS2S",
                "T2Reranking",
                "T2Retrieval",
                "ATEC",
            ],
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "topk": 10,
            "limits": 500,
        },
    },
}

Parameter Explanation#

  • eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.

  • eval_config: A dictionary containing the following fields:

    • tool: The evaluation tool, using MTEB.

    • model: A list of model configurations. Only one model can be placed for single-stage evaluation, containing the following fields:

      • model_name_or_path: str The model name or path. Supports automatic downloading from the ModelScope repository.

      • is_cross_encoder: bool Whether the model is a cross-encoder. Default is False.

      • pooling_mode: Optional[str] Pooling mode. Default is mean. Possible values are: β€œcls”, β€œlasttoken”, β€œmax”, β€œmean”, β€œmean_sqrt_len_tokens”, or β€œweightedmean”. For the bge series models, please set it to β€œcls”.

      • max_seq_length: int Maximum sequence length. Default is 512.

      • prompt: str Prompt for the retrieval task in front of the model. Default is an empty string.

      • model_kwargs: dict Keyword arguments for the model. Default value is {"torch_dtype": "auto"}.

      • config_kwargs: Dict[str, Any] Keyword arguments for the configuration. Default is an empty dictionary.

      • encode_kwargs: dict Keyword arguments for encoding. Default value is:

        {
            "show_progress_bar": True,
            "batch_size": 32
        }
        
      • hub: str Source of the model, which can be β€œmodelscope” or β€œhuggingface”.

    • eval: A dictionary containing the following fields:

      • tasks: List[str] Task names. See Task List.

      • top_k: int Number of top results to select for retrieval tasks.

      • verbosity: int Level of verbosity. Range is 0-3.

      • output_folder: str Output folder. Default is β€œoutputs”.

      • overwrite_results: bool Whether to overwrite results. Default is True.

      • limits: Optional[int] Limit the number of samples. Default is None; it is not recommended to set this for retrieval tasks.

      • hub: str Source of the dataset, which can be β€œmodelscope” or β€œhuggingface”.

Two-stage Evaluation#

Example configuration file: first perform retrieval, then reranking:

two_stage_task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "is_cross_encoder": False,
                "max_seq_length": 512,
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 64,
                },
            },
            {
                "model_name_or_path": "OpenBMB/MiniCPM-Reranker",
                "is_cross_encoder": True,
                "max_seq_length": 512,
                "prompt": "Generate a retrieval representation for this question",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 32,
                },
            },
        ],
        "eval": {
            "tasks": ["T2Retrieval"],
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "topk": 5,
            "limits": 100,
        },
    },
}

Parameter Explanation#

The basic parameters are the same as those for single-stage evaluation. The difference lies in the model field, where two models need to be provided. The first model is used for retrieval, and the second model is used for reranking. The reranking model needs to be a cross-encoder, i.e., is_cross_encoder=True.

Model Evaluation#

from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()

one_stage_task_cfg = one_stage_task_cfg
# or
# two_stage_task_cfg = two_stage_task_cfg

# Run task
run_task(task_cfg=one_stage_task_cfg) 
# or 
# run_task(task_cfg=two_stage_task_cfg)

The following is an example of the output:

One-Stage Evaluation

Outputs
outputs/m3e-base/master/TNews.json#
{
  "dataset_revision": "317f262bf1e6126357bbe89e875451e4b0938fe4",
  "evaluation_time": 16.50650382041931,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "validation": [
      {
        "accuracy": 0.4744,
        "f1": 0.44562489526640825,
        "f1_weighted": 0.47540307398330806,
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.4744,
        "scores_per_experiment": [
          {
            "accuracy": 0.48,
            "f1": 0.4536376605217497,
            "f1_weighted": 0.47800277926811163
          },
          {
            "accuracy": 0.48,
            "f1": 0.44713633954639176,
            "f1_weighted": 0.4826984434763292
          },
          {
            "accuracy": 0.462,
            "f1": 0.433365706955334,
            "f1_weighted": 0.4640970055245127
          },
          {
            "accuracy": 0.484,
            "f1": 0.4586732839614161,
            "f1_weighted": 0.4857359110392786
          },
          {
            "accuracy": 0.462,
            "f1": 0.4293797541165097,
            "f1_weighted": 0.4632657330831137
          },
          {
            "accuracy": 0.474,
            "f1": 0.44775120246296396,
            "f1_weighted": 0.4737182842092953
          },
          {
            "accuracy": 0.47,
            "f1": 0.4431197566080463,
            "f1_weighted": 0.4714830140231783
          },
          {
            "accuracy": 0.472,
            "f1": 0.44322381694059326,
            "f1_weighted": 0.47100005556357255
          },
          {
            "accuracy": 0.484,
            "f1": 0.45454749692062835,
            "f1_weighted": 0.4856239367465818
          },
          {
            "accuracy": 0.476,
            "f1": 0.44541393463044954,
            "f1_weighted": 0.47840557689910646
          }
        ]
      }
    ]
  },
  "task_name": "TNews"
}

Two-stage Evaluation

first stage
outputs/stage1/m3e-base/v1/T2Retrieval.json#
{
  "dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
  "evaluation_time": 599.5170171260834,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "dev": [
      {
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.73143,
        "map_at_1": 0.22347,
        "map_at_10": 0.63237,
        "map_at_100": 0.67533,
        "map_at_1000": 0.67651,
        "map_at_20": 0.66282,
        "map_at_3": 0.43874,
        "map_at_5": 0.54049,
        "mrr_at_1": 0.7898912852884447,
        "mrr_at_10": 0.8402654617870331,
        "mrr_at_100": 0.8421827758769684,
        "mrr_at_1000": 0.8422583001072272,
        "mrr_at_20": 0.8415411456315557,
        "mrr_at_3": 0.8307469752761716,
        "mrr_at_5": 0.8368029984218875,
        "nauc_map_at_1000_diff1": 0.17749400860890877,
        "nauc_map_at_1000_max": 0.42844516520725967,
        "nauc_map_at_1000_std": 0.18789871694419072,
        "nauc_map_at_100_diff1": 0.17747467084779375,
        "nauc_map_at_100_max": 0.42732291785494575,
        "nauc_map_at_100_std": 0.18694287087286737,
        "nauc_map_at_10_diff1": 0.19976199493034202,
        "nauc_map_at_10_max": 0.3374436217668296,
        "nauc_map_at_10_std": 0.07951451707732717,
        "nauc_map_at_1_diff1": 0.41727578149080663,
        "nauc_map_at_1_max": -0.1402656422184478,
        "nauc_map_at_1_std": -0.26168722519030313,
        "nauc_map_at_20_diff1": 0.1811898211371171,
        "nauc_map_at_20_max": 0.40563441466210043,
        "nauc_map_at_20_std": 0.15927727170010608,
        "nauc_map_at_3_diff1": 0.31255422845809033,
        "nauc_map_at_3_max": 0.007523677231905161,
        "nauc_map_at_3_std": -0.19578481884353466,
        "nauc_map_at_5_diff1": 0.26073699217160473,
        "nauc_map_at_5_max": 0.14665611579604088,
        "nauc_map_at_5_std": -0.09600383298672226,
        "nauc_mrr_at_1000_diff1": 0.3819666309367981,
        "nauc_mrr_at_1000_max": 0.6285393024619401,
        "nauc_mrr_at_1000_std": 0.3294970299417527,
        "nauc_mrr_at_100_diff1": 0.3819436006743644,
        "nauc_mrr_at_100_max": 0.6286346262471935,
        "nauc_mrr_at_100_std": 0.32963045935037844,
        "nauc_mrr_at_10_diff1": 0.3819124721154632,
        "nauc_mrr_at_10_max": 0.6292778905762176,
        "nauc_mrr_at_10_std": 0.3298187966196067,
        "nauc_mrr_at_1_diff1": 0.3862589251033909,
        "nauc_mrr_at_1_max": 0.589976680174432,
        "nauc_mrr_at_1_std": 0.2780515387897469,
        "nauc_mrr_at_20_diff1": 0.38198959771391816,
        "nauc_mrr_at_20_max": 0.6290569436652999,
        "nauc_mrr_at_20_std": 0.3301570340189363,
        "nauc_mrr_at_3_diff1": 0.3825046940733129,
        "nauc_mrr_at_3_max": 0.6282507269128365,
        "nauc_mrr_at_3_std": 0.3260807934869131,
        "nauc_mrr_at_5_diff1": 0.3816317396711923,
        "nauc_mrr_at_5_max": 0.6288655177904692,
        "nauc_mrr_at_5_std": 0.3298854062538469,
        "nauc_ndcg_at_1000_diff1": 0.21319598381916555,
        "nauc_ndcg_at_1000_max": 0.5328295949130256,
        "nauc_ndcg_at_1000_std": 0.2946773445135694,
        "nauc_ndcg_at_100_diff1": 0.2089807772703975,
        "nauc_ndcg_at_100_max": 0.5239397690321543,
        "nauc_ndcg_at_100_std": 0.29123456982125717,
        "nauc_ndcg_at_10_diff1": 0.20555333230027603,
        "nauc_ndcg_at_10_max": 0.44316027023003046,
        "nauc_ndcg_at_10_std": 0.1921835220940756,
        "nauc_ndcg_at_1_diff1": 0.3862589251033909,
        "nauc_ndcg_at_1_max": 0.589976680174432,
        "nauc_ndcg_at_1_std": 0.2780515387897469,
        "nauc_ndcg_at_20_diff1": 0.20754208582741446,
        "nauc_ndcg_at_20_max": 0.4786092392092643,
        "nauc_ndcg_at_20_std": 0.23536973680564616,
        "nauc_ndcg_at_3_diff1": 0.1902823773882388,
        "nauc_ndcg_at_3_max": 0.5400466380622567,
        "nauc_ndcg_at_3_std": 0.2713874990424778,
        "nauc_ndcg_at_5_diff1": 0.18279298790691637,
        "nauc_ndcg_at_5_max": 0.4916119327522918,
        "nauc_ndcg_at_5_std": 0.2375397192963552,
        "nauc_precision_at_1000_diff1": -0.20510380600112582,
        "nauc_precision_at_1000_max": 0.4958820760698651,
        "nauc_precision_at_1000_std": 0.5402465580496146,
        "nauc_precision_at_100_diff1": -0.1994322347949809,
        "nauc_precision_at_100_max": 0.5206762748551254,
        "nauc_precision_at_100_std": 0.5568154081333078,
        "nauc_precision_at_10_diff1": -0.16707155441197413,
        "nauc_precision_at_10_max": 0.5600612846655972,
        "nauc_precision_at_10_std": 0.49419688804691536,
        "nauc_precision_at_1_diff1": 0.3862589251033909,
        "nauc_precision_at_1_max": 0.589976680174432,
        "nauc_precision_at_1_std": 0.2780515387897469,
        "nauc_precision_at_20_diff1": -0.18471041949530417,
        "nauc_precision_at_20_max": 0.5458950955439645,
        "nauc_precision_at_20_std": 0.5355982267058214,
        "nauc_precision_at_3_diff1": -0.03826790088047189,
        "nauc_precision_at_3_max": 0.5833083970750171,
        "nauc_precision_at_3_std": 0.380196662597275,
        "nauc_precision_at_5_diff1": -0.11789367842600275,
        "nauc_precision_at_5_max": 0.5708494593335263,
        "nauc_precision_at_5_std": 0.42860609671688105,
        "nauc_recall_at_1000_diff1": 0.1341309660059583,
        "nauc_recall_at_1000_max": 0.5923755841077135,
        "nauc_recall_at_1000_std": 0.5980459502693942,
        "nauc_recall_at_100_diff1": 0.12181394285840096,
        "nauc_recall_at_100_max": 0.47090136790318127,
        "nauc_recall_at_100_std": 0.3959369184297595,
        "nauc_recall_at_10_diff1": 0.17356300971546512,
        "nauc_recall_at_10_max": 0.25475707245853674,
        "nauc_recall_at_10_std": 0.041819982320384745,
        "nauc_recall_at_1_diff1": 0.41727578149080663,
        "nauc_recall_at_1_max": -0.1402656422184478,
        "nauc_recall_at_1_std": -0.26168722519030313,
        "nauc_recall_at_20_diff1": 0.14273713155999543,
        "nauc_recall_at_20_max": 0.36251116771924663,
        "nauc_recall_at_20_std": 0.1912123941692314,
        "nauc_recall_at_3_diff1": 0.2873719855400218,
        "nauc_recall_at_3_max": -0.041198403561830285,
        "nauc_recall_at_3_std": -0.21921947922872737,
        "nauc_recall_at_5_diff1": 0.23680082643694844,
        "nauc_recall_at_5_max": 0.06580524171324151,
        "nauc_recall_at_5_std": -0.14104561361502632,
        "ndcg_at_1": 0.78989,
        "ndcg_at_10": 0.73143,
        "ndcg_at_100": 0.78829,
        "ndcg_at_1000": 0.80026,
        "ndcg_at_20": 0.75787,
        "ndcg_at_3": 0.7417,
        "ndcg_at_5": 0.72641,
        "precision_at_1": 0.78989,
        "precision_at_10": 0.37304,
        "precision_at_100": 0.04828,
        "precision_at_1000": 0.00511,
        "precision_at_20": 0.21403,
        "precision_at_3": 0.65461,
        "precision_at_5": 0.54942,
        "recall_at_1": 0.22347,
        "recall_at_10": 0.73318,
        "recall_at_100": 0.91093,
        "recall_at_1000": 0.97197,
        "recall_at_20": 0.81286,
        "recall_at_3": 0.46573,
        "recall_at_5": 0.59383
      }
    ]
  },
  "task_name": "T2Retrieval"
}
second stage
outputs/stage2/jina-reranker-v2-base-multilingual/master/T2Retrieval.json#
{
  "dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
  "evaluation_time": 332.15709686279297,
  "kg_co2_emissions": null,
  "mteb_version": "1.14.15",
  "scores": {
    "dev": [
      {
        "hf_subset": "default",
        "languages": [
          "cmn-Hans"
        ],
        "main_score": 0.661,
        "map_at_1": 0.24264,
        "map_at_10": 0.56291,
        "map_at_100": 0.56291,
        "map_at_1000": 0.56291,
        "map_at_20": 0.56291,
        "map_at_3": 0.4714,
        "map_at_5": 0.56291,
        "mrr_at_1": 0.841969139049623,
        "mrr_at_10": 0.8689147524694633,
        "mrr_at_100": 0.8689147524694633,
        "mrr_at_1000": 0.8689147524694633,
        "mrr_at_20": 0.8689147524694633,
        "mrr_at_3": 0.8664883979192248,
        "mrr_at_5": 0.8689147524694633,
        "nauc_map_at_1000_diff1": 0.12071580301051653,
        "nauc_map_at_1000_max": 0.2536691069727338,
        "nauc_map_at_1000_std": 0.343624832364704,
        "nauc_map_at_100_diff1": 0.12071580301051653,
        "nauc_map_at_100_max": 0.2536691069727338,
        "nauc_map_at_100_std": 0.343624832364704,
        "nauc_map_at_10_diff1": 0.12071580301051653,
        "nauc_map_at_10_max": 0.2536691069727338,
        "nauc_map_at_10_std": 0.343624832364704,
        "nauc_map_at_1_diff1": 0.47964980727810325,
        "nauc_map_at_1_max": -0.08015044571696166,
        "nauc_map_at_1_std": 0.3507257834956417,
        "nauc_map_at_20_diff1": 0.12071580301051653,
        "nauc_map_at_20_max": 0.2536691069727338,
        "nauc_map_at_20_std": 0.343624832364704,
        "nauc_map_at_3_diff1": 0.23481937699306626,
        "nauc_map_at_3_max": 0.10372745264123306,
        "nauc_map_at_3_std": 0.45345158923063256,
        "nauc_map_at_5_diff1": 0.12071580301051653,
        "nauc_map_at_5_max": 0.2536691069727338,
        "nauc_map_at_5_std": 0.343624832364704,
        "nauc_mrr_at_1000_diff1": 0.23393918304502795,
        "nauc_mrr_at_1000_max": 0.8703379129725659,
        "nauc_mrr_at_1000_std": 0.5785333616122065,
        "nauc_mrr_at_100_diff1": 0.23393918304502795,
        "nauc_mrr_at_100_max": 0.8703379129725659,
        "nauc_mrr_at_100_std": 0.5785333616122065,
        "nauc_mrr_at_10_diff1": 0.23393918304502795,
        "nauc_mrr_at_10_max": 0.8703379129725659,
        "nauc_mrr_at_10_std": 0.5785333616122065,
        "nauc_mrr_at_1_diff1": 0.2520016067648708,
        "nauc_mrr_at_1_max": 0.8560897633767299,
        "nauc_mrr_at_1_std": 0.5642467684745208,
        "nauc_mrr_at_20_diff1": 0.23393918304502795,
        "nauc_mrr_at_20_max": 0.8703379129725659,
        "nauc_mrr_at_20_std": 0.5785333616122065,
        "nauc_mrr_at_3_diff1": 0.2343988881957151,
        "nauc_mrr_at_3_max": 0.8695482778251757,
        "nauc_mrr_at_3_std": 0.5799167198804328,
        "nauc_mrr_at_5_diff1": 0.23393918304502795,
        "nauc_mrr_at_5_max": 0.8703379129725659,
        "nauc_mrr_at_5_std": 0.5785333616122065,
        "nauc_ndcg_at_1000_diff1": 0.11252208055013257,
        "nauc_ndcg_at_1000_max": 0.3417865079349515,
        "nauc_ndcg_at_1000_std": 0.3623961771041499,
        "nauc_ndcg_at_100_diff1": 0.11252208055013257,
        "nauc_ndcg_at_100_max": 0.3417865079349515,
        "nauc_ndcg_at_100_std": 0.3623961771041499,
        "nauc_ndcg_at_10_diff1": 0.10015448775533999,
        "nauc_ndcg_at_10_max": 0.3761759074862075,
        "nauc_ndcg_at_10_std": 0.35152523471339914,
        "nauc_ndcg_at_1_diff1": 0.2524564785684737,
        "nauc_ndcg_at_1_max": 0.8566368743831702,
        "nauc_ndcg_at_1_std": 0.5635391925059349,
        "nauc_ndcg_at_20_diff1": 0.11228113618796766,
        "nauc_ndcg_at_20_max": 0.34274993051851965,
        "nauc_ndcg_at_20_std": 0.36216437469674284,
        "nauc_ndcg_at_3_diff1": -0.062134030685870506,
        "nauc_ndcg_at_3_max": 0.7183183844837573,
        "nauc_ndcg_at_3_std": 0.3352626268658533,
        "nauc_ndcg_at_5_diff1": -0.04476981761624879,
        "nauc_ndcg_at_5_max": 0.6272060974309411,
        "nauc_ndcg_at_5_std": 0.21341258393783158,
        "nauc_precision_at_1000_diff1": -0.3554940965683014,
        "nauc_precision_at_1000_max": 0.605443274008298,
        "nauc_precision_at_1000_std": -0.13073611213585504,
        "nauc_precision_at_100_diff1": -0.35549409656830133,
        "nauc_precision_at_100_max": 0.6054432740082977,
        "nauc_precision_at_100_std": -0.13073611213585531,
        "nauc_precision_at_10_diff1": -0.3554940965683011,
        "nauc_precision_at_10_max": 0.6054432740082981,
        "nauc_precision_at_10_std": -0.1307361121358551,
        "nauc_precision_at_1_diff1": 0.2524564785684737,
        "nauc_precision_at_1_max": 0.8566368743831702,
        "nauc_precision_at_1_std": 0.5635391925059349,
        "nauc_precision_at_20_diff1": -0.3554940965683011,
        "nauc_precision_at_20_max": 0.6054432740082981,
        "nauc_precision_at_20_std": -0.1307361121358551,
        "nauc_precision_at_3_diff1": -0.3377658698816377,
        "nauc_precision_at_3_max": 0.6780151277792397,
        "nauc_precision_at_3_std": 0.12291559606586676,
        "nauc_precision_at_5_diff1": -0.3554940965683011,
        "nauc_precision_at_5_max": 0.6054432740082981,
        "nauc_precision_at_5_std": -0.1307361121358551,
        "nauc_recall_at_1000_diff1": 0.1091970342988605,
        "nauc_recall_at_1000_max": 0.18339955163544436,
        "nauc_recall_at_1000_std": 0.30756376767627086,
        "nauc_recall_at_100_diff1": 0.1091970342988605,
        "nauc_recall_at_100_max": 0.18339955163544436,
        "nauc_recall_at_100_std": 0.30756376767627086,
        "nauc_recall_at_10_diff1": 0.1091970342988605,
        "nauc_recall_at_10_max": 0.18339955163544436,
        "nauc_recall_at_10_std": 0.30756376767627086,
        "nauc_recall_at_1_diff1": 0.47964980727810325,
        "nauc_recall_at_1_max": -0.08015044571696166,
        "nauc_recall_at_1_std": 0.3507257834956417,
        "nauc_recall_at_20_diff1": 0.1091970342988605,
        "nauc_recall_at_20_max": 0.18339955163544436,
        "nauc_recall_at_20_std": 0.30756376767627086,
        "nauc_recall_at_3_diff1": 0.22013063499116758,
        "nauc_recall_at_3_max": 0.054749114965246065,
        "nauc_recall_at_3_std": 0.4258163949018153,
        "nauc_recall_at_5_diff1": 0.1091970342988605,
        "nauc_recall_at_5_max": 0.18339955163544436,
        "nauc_recall_at_5_std": 0.30756376767627086,
        "ndcg_at_1": 0.84188,
        "ndcg_at_10": 0.661,
        "ndcg_at_100": 0.6534,
        "ndcg_at_1000": 0.6534,
        "ndcg_at_20": 0.65358,
        "ndcg_at_3": 0.7826,
        "ndcg_at_5": 0.74517,
        "precision_at_1": 0.84188,
        "precision_at_10": 0.27471,
        "precision_at_100": 0.02747,
        "precision_at_1000": 0.00275,
        "precision_at_20": 0.13736,
        "precision_at_3": 0.68633,
        "precision_at_5": 0.54942,
        "recall_at_1": 0.24264,
        "recall_at_10": 0.59383,
        "recall_at_100": 0.59383,
        "recall_at_1000": 0.59383,
        "recall_at_20": 0.59383,
        "recall_at_3": 0.48934,
        "recall_at_5": 0.59383
      }
    ]
  },
  "task_name": "T2Retrieval"
}

Custom Dataset Evaluation#

Custom Text Retrieval Evaluation#

  1. Construct the Dataset with the following format:

retrieval_data
β”œβ”€β”€ corpus.jsonl
β”œβ”€β”€ queries.jsonl
└── qrels
    └── test.tsv

Where:

  • corpus.jsonl: Corpus file, each line is a JSON object formatted as {"_id": "xxx", "text": "xxx"}. _id is the corpus ID, and text is the corpus text. For example:

    {"_id": "doc1", "text": "Climate change is leading to more extreme weather patterns."}
    {"_id": "doc2", "text": "The stock market surged today, led by tech stocks."}
    {"_id": "doc3", "text": "AI is transforming various industries by automating tasks and providing insights."}
    {"_id": "doc4", "text": "With technological advancements, renewable energy sources like wind and solar are becoming more widespread."}
    {"_id": "doc5", "text": "Recent studies show that a balanced diet and regular exercise can significantly improve mental health."}
    {"_id": "doc6", "text": "Virtual reality is creating new opportunities in education, entertainment, and training."}
    {"_id": "doc7", "text": "Electric vehicles are becoming increasingly popular due to environmental benefits and advances in battery technology."}
    {"_id": "doc8", "text": "Space exploration missions are revealing new information about our solar system and beyond."}
    {"_id": "doc9", "text": "Blockchain technology has potential applications beyond cryptocurrency, including supply chain management and secure voting systems."}
    {"_id": "doc10", "text": "The benefits of remote work include greater flexibility and reduced commuting time."}
    
  • queries.jsonl: Query file, each line is a JSON object formatted as {"_id": "xxx", "text": "xxx"}. _id is the query ID, and text is the query text. For example:

    {"_id": "query1", "text": "What are the impacts of climate change?"}
    {"_id": "query2", "text": "What caused the stock market to rise today?"}
    {"_id": "query3", "text": "How is AI changing industries?"}
    {"_id": "query4", "text": "What are the advancements in renewable energy?"}
    {"_id": "query5", "text": "How does a balanced diet improve mental health?"}
    {"_id": "query6", "text": "What new opportunities does virtual reality create?"}
    {"_id": "query7", "text": "Why are electric vehicles becoming more popular?"}
    {"_id": "query8", "text": "What new information has space exploration revealed?"}
    {"_id": "query9", "text": "What are the applications of blockchain technology beyond cryptocurrency?"}
    {"_id": "query10", "text": "What are the benefits of remote work?"}
    
  • qrels: Evaluation file(s) in TSV format, with columns query-id, doc-id, and score. query-id is the query ID, doc-id is the corpus ID, and score is the relevance score. For example:

    query-id  corpus-id  score
    query1    doc1       1
    query2    doc2       1
    query3    doc3       1
    query4    doc4       1
    query5    doc5       1
    query6    doc6       1
    query7    doc7       1
    query8    doc8       1
    query9    doc9       1
    query10   doc10      1
    
  1. Construct the Configuration File

task_cfg = {
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "MTEB",
        "model": [
            {
                "model_name_or_path": "AI-ModelScope/m3e-base",
                "pooling_mode": None,  # load from model config
                "max_seq_length": 512,
                "prompt": "",
                "model_kwargs": {"torch_dtype": "auto"},
                "encode_kwargs": {
                    "batch_size": 128,
                },
            }
        ],
        "eval": {
            "tasks": ["CustomRetrieval"],
            "dataset_path": "custom_eval/text/retrieval",
            "verbosity": 2,
            "output_folder": "outputs",
            "overwrite_results": True,
            "limits": 500,
        },
    },
}

Parameter description, with essential parameters modified from the default configuration:

  • eval:

    • tasks: Evaluation task, must be CustomRetrieval.

    • dataset_path: Path to the custom dataset.

  1. Run the Evaluation

from evalscope.run import run_task

run_task(task_cfg=task_cfg)