MTEB#
This framework supports MTEB and CMTEB, with the following details:
MTEB (Massive Text Embedding Benchmark) is a large-scale benchmark designed to measure the performance of text embedding models across diverse embedding tasks. MTEB includes 56 datasets covering 8 tasks and supports over 112 different languages. The goal of this benchmark is to assist developers in finding the best text embedding models suitable for various tasks.
C-MTEB (Chinese Massive Text Embedding Benchmark) is a dedicated evaluation benchmark for Chinese text vectors, built on MTEB, aimed at assessing the performance of Chinese text vector models. C-MTEB collects 35 public datasets and is divided into 6 categories of evaluation tasks, including retrieval, re-ranking, semantic text similarity (STS), classification, pair classification, and clustering.
Supported Datasets#
Here is an overview of the available tasks and datasets in C-MTEB:
Name |
Hub Link |
Description |
Type |
Category |
Number of Test Samples |
|---|---|---|---|---|---|
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Retrieval |
s2p |
24,832 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Retrieval |
s2p |
7,437 |
||
A large-scale Chinese web search engine paragraph retrieval benchmark |
Retrieval |
s2p |
4,000 |
||
COVID-19 news articles |
Retrieval |
s2p |
949 |
||
Online medical consultation texts |
Retrieval |
s2p |
3,999 |
||
Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba medical search engine systems |
Retrieval |
s2p |
1,000 |
||
Paragraph retrieval dataset collected from Alibaba video search engine systems |
Retrieval |
s2p |
1,000 |
||
T2Ranking: A large-scale Chinese paragraph ranking benchmark |
Re-ranking |
s2p |
24,382 |
||
mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset |
Re-ranking |
s2p |
7,437 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
2,000 |
||
Chinese community medical Q&A |
Re-ranking |
s2p |
4,000 |
||
Original Chinese natural language inference dataset |
Pair Classification |
s2s |
3,000 |
||
Chinese multi-class natural language inference |
Pair Classification |
s2s |
139,000 |
||
Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories. |
Clustering |
p2p |
10,000 |
||
Clustering titles from the THUCNews dataset |
Clustering |
s2s |
10,000 |
||
Clustering titles + abstracts from the THUCNews dataset |
Clustering |
p2p |
10,000 |
||
ATEC NLP Sentence Pair Similarity Competition |
STS |
s2s |
20,000 |
||
Banking Question Semantic Similarity |
STS |
s2s |
10,000 |
||
Large-scale Chinese Question Matching Corpus |
STS |
s2s |
12,500 |
||
Translated PAWS evaluation pairs |
STS |
s2s |
2,000 |
||
Translated STS-B into Chinese |
STS |
s2s |
1,360 |
||
Ant Financial Question Matching Corpus |
STS |
s2s |
3,861 |
||
QQ Browser Query Title Corpus |
STS |
s2s |
5,000 |
||
News Short Text Classification |
Classification |
s2s |
10,000 |
||
Long Text Classification of Application Descriptions |
Classification |
s2s |
2,600 |
||
Sentiment Analysis of User Reviews on Food Delivery Platforms |
Classification |
s2s |
1,000 |
||
Sentiment Analysis of User Reviews on Online Shopping Websites |
Classification |
s2s |
1,000 |
||
A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative |
Classification |
s2s |
3,000 |
||
Reviews of iPhone |
Classification |
s2s |
533 |
For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.
Environment Setup#
Install dependencies
pip install mteb
Configure Evaluation Parameters#
The framework supports two evaluation modes: single-stage evaluation and two-stage evaluation:
Single-Stage Evaluation: Directly use the model for prediction and compute metrics. Supports tasks such as retrieval, re-ranking, and classification for embedding models.
Two-Stage Evaluation: Use the model for retrieval first, then use the model for re-ranking, and compute metrics. Supports re-ranking models.
Single-stage Evaluation#
Example configuration file:
one_stage_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "MTEB",
"model": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"pooling_mode": None,
"max_seq_length": 512,
"prompt": "",
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {
"batch_size": 128,
},
}
],
"eval": {
"tasks": [
"TNews",
"CLSClusteringS2S",
"T2Reranking",
"T2Retrieval",
"ATEC",
],
"verbosity": 2,
"output_folder": "outputs",
"overwrite_results": True,
"topk": 10,
"limits": 500,
},
},
}
Parameter Explanation#
eval_backend: Default value isRAGEval, indicating the use of the RAGEval evaluation backend.eval_config: A dictionary containing the following fields:tool: The evaluation tool, usingMTEB.model: A list of model configurations. Only one model can be placed for single-stage evaluation, containing the following fields:model_name_or_path:strThe model name or path. Supports automatic downloading from the ModelScope repository.is_cross_encoder:boolWhether the model is a cross-encoder. Default is False.pooling_mode:Optional[str]Pooling mode. Default ismean. Possible values are: βclsβ, βlasttokenβ, βmaxβ, βmeanβ, βmean_sqrt_len_tokensβ, or βweightedmeanβ. For thebgeseries models, please set it to βclsβ.max_seq_length:intMaximum sequence length. Default is 512.prompt:strPrompt for the retrieval task in front of the model. Default is an empty string.model_kwargs:dictKeyword arguments for the model. Default value is{"torch_dtype": "auto"}.config_kwargs:Dict[str, Any]Keyword arguments for the configuration. Default is an empty dictionary.encode_kwargs:dictKeyword arguments for encoding. Default value is:{ "show_progress_bar": True, "batch_size": 32 }hub:strSource of the model, which can be βmodelscopeβ or βhuggingfaceβ.
eval: A dictionary containing the following fields:tasks:List[str]Task names. See Task List.top_k:intNumber of top results to select for retrieval tasks.verbosity:intLevel of verbosity. Range is 0-3.output_folder:strOutput folder. Default is βoutputsβ.overwrite_results:boolWhether to overwrite results. Default is True.limits:Optional[int]Limit the number of samples. Default is None; it is not recommended to set this for retrieval tasks.hub:strSource of the dataset, which can be βmodelscopeβ or βhuggingfaceβ.
Two-stage Evaluation#
Example configuration file: first perform retrieval, then reranking:
two_stage_task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "MTEB",
"model": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"is_cross_encoder": False,
"max_seq_length": 512,
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {
"batch_size": 64,
},
},
{
"model_name_or_path": "OpenBMB/MiniCPM-Reranker",
"is_cross_encoder": True,
"max_seq_length": 512,
"prompt": "Generate a retrieval representation for this question",
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {
"batch_size": 32,
},
},
],
"eval": {
"tasks": ["T2Retrieval"],
"verbosity": 2,
"output_folder": "outputs",
"overwrite_results": True,
"topk": 5,
"limits": 100,
},
},
}
Parameter Explanation#
The basic parameters are the same as those for single-stage evaluation. The difference lies in the model field, where two models need to be provided. The first model is used for retrieval, and the second model is used for reranking. The reranking model needs to be a cross-encoder, i.e., is_cross_encoder=True.
Model Evaluation#
from evalscope.run import run_task
from evalscope.utils.logger import get_logger
logger = get_logger()
one_stage_task_cfg = one_stage_task_cfg
# or
# two_stage_task_cfg = two_stage_task_cfg
# Run task
run_task(task_cfg=one_stage_task_cfg)
# or
# run_task(task_cfg=two_stage_task_cfg)
The following is an example of the output:
One-Stage Evaluation
Outputs
{
"dataset_revision": "317f262bf1e6126357bbe89e875451e4b0938fe4",
"evaluation_time": 16.50650382041931,
"kg_co2_emissions": null,
"mteb_version": "1.14.15",
"scores": {
"validation": [
{
"accuracy": 0.4744,
"f1": 0.44562489526640825,
"f1_weighted": 0.47540307398330806,
"hf_subset": "default",
"languages": [
"cmn-Hans"
],
"main_score": 0.4744,
"scores_per_experiment": [
{
"accuracy": 0.48,
"f1": 0.4536376605217497,
"f1_weighted": 0.47800277926811163
},
{
"accuracy": 0.48,
"f1": 0.44713633954639176,
"f1_weighted": 0.4826984434763292
},
{
"accuracy": 0.462,
"f1": 0.433365706955334,
"f1_weighted": 0.4640970055245127
},
{
"accuracy": 0.484,
"f1": 0.4586732839614161,
"f1_weighted": 0.4857359110392786
},
{
"accuracy": 0.462,
"f1": 0.4293797541165097,
"f1_weighted": 0.4632657330831137
},
{
"accuracy": 0.474,
"f1": 0.44775120246296396,
"f1_weighted": 0.4737182842092953
},
{
"accuracy": 0.47,
"f1": 0.4431197566080463,
"f1_weighted": 0.4714830140231783
},
{
"accuracy": 0.472,
"f1": 0.44322381694059326,
"f1_weighted": 0.47100005556357255
},
{
"accuracy": 0.484,
"f1": 0.45454749692062835,
"f1_weighted": 0.4856239367465818
},
{
"accuracy": 0.476,
"f1": 0.44541393463044954,
"f1_weighted": 0.47840557689910646
}
]
}
]
},
"task_name": "TNews"
}
Two-stage Evaluation
first stage
{
"dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
"evaluation_time": 599.5170171260834,
"kg_co2_emissions": null,
"mteb_version": "1.14.15",
"scores": {
"dev": [
{
"hf_subset": "default",
"languages": [
"cmn-Hans"
],
"main_score": 0.73143,
"map_at_1": 0.22347,
"map_at_10": 0.63237,
"map_at_100": 0.67533,
"map_at_1000": 0.67651,
"map_at_20": 0.66282,
"map_at_3": 0.43874,
"map_at_5": 0.54049,
"mrr_at_1": 0.7898912852884447,
"mrr_at_10": 0.8402654617870331,
"mrr_at_100": 0.8421827758769684,
"mrr_at_1000": 0.8422583001072272,
"mrr_at_20": 0.8415411456315557,
"mrr_at_3": 0.8307469752761716,
"mrr_at_5": 0.8368029984218875,
"nauc_map_at_1000_diff1": 0.17749400860890877,
"nauc_map_at_1000_max": 0.42844516520725967,
"nauc_map_at_1000_std": 0.18789871694419072,
"nauc_map_at_100_diff1": 0.17747467084779375,
"nauc_map_at_100_max": 0.42732291785494575,
"nauc_map_at_100_std": 0.18694287087286737,
"nauc_map_at_10_diff1": 0.19976199493034202,
"nauc_map_at_10_max": 0.3374436217668296,
"nauc_map_at_10_std": 0.07951451707732717,
"nauc_map_at_1_diff1": 0.41727578149080663,
"nauc_map_at_1_max": -0.1402656422184478,
"nauc_map_at_1_std": -0.26168722519030313,
"nauc_map_at_20_diff1": 0.1811898211371171,
"nauc_map_at_20_max": 0.40563441466210043,
"nauc_map_at_20_std": 0.15927727170010608,
"nauc_map_at_3_diff1": 0.31255422845809033,
"nauc_map_at_3_max": 0.007523677231905161,
"nauc_map_at_3_std": -0.19578481884353466,
"nauc_map_at_5_diff1": 0.26073699217160473,
"nauc_map_at_5_max": 0.14665611579604088,
"nauc_map_at_5_std": -0.09600383298672226,
"nauc_mrr_at_1000_diff1": 0.3819666309367981,
"nauc_mrr_at_1000_max": 0.6285393024619401,
"nauc_mrr_at_1000_std": 0.3294970299417527,
"nauc_mrr_at_100_diff1": 0.3819436006743644,
"nauc_mrr_at_100_max": 0.6286346262471935,
"nauc_mrr_at_100_std": 0.32963045935037844,
"nauc_mrr_at_10_diff1": 0.3819124721154632,
"nauc_mrr_at_10_max": 0.6292778905762176,
"nauc_mrr_at_10_std": 0.3298187966196067,
"nauc_mrr_at_1_diff1": 0.3862589251033909,
"nauc_mrr_at_1_max": 0.589976680174432,
"nauc_mrr_at_1_std": 0.2780515387897469,
"nauc_mrr_at_20_diff1": 0.38198959771391816,
"nauc_mrr_at_20_max": 0.6290569436652999,
"nauc_mrr_at_20_std": 0.3301570340189363,
"nauc_mrr_at_3_diff1": 0.3825046940733129,
"nauc_mrr_at_3_max": 0.6282507269128365,
"nauc_mrr_at_3_std": 0.3260807934869131,
"nauc_mrr_at_5_diff1": 0.3816317396711923,
"nauc_mrr_at_5_max": 0.6288655177904692,
"nauc_mrr_at_5_std": 0.3298854062538469,
"nauc_ndcg_at_1000_diff1": 0.21319598381916555,
"nauc_ndcg_at_1000_max": 0.5328295949130256,
"nauc_ndcg_at_1000_std": 0.2946773445135694,
"nauc_ndcg_at_100_diff1": 0.2089807772703975,
"nauc_ndcg_at_100_max": 0.5239397690321543,
"nauc_ndcg_at_100_std": 0.29123456982125717,
"nauc_ndcg_at_10_diff1": 0.20555333230027603,
"nauc_ndcg_at_10_max": 0.44316027023003046,
"nauc_ndcg_at_10_std": 0.1921835220940756,
"nauc_ndcg_at_1_diff1": 0.3862589251033909,
"nauc_ndcg_at_1_max": 0.589976680174432,
"nauc_ndcg_at_1_std": 0.2780515387897469,
"nauc_ndcg_at_20_diff1": 0.20754208582741446,
"nauc_ndcg_at_20_max": 0.4786092392092643,
"nauc_ndcg_at_20_std": 0.23536973680564616,
"nauc_ndcg_at_3_diff1": 0.1902823773882388,
"nauc_ndcg_at_3_max": 0.5400466380622567,
"nauc_ndcg_at_3_std": 0.2713874990424778,
"nauc_ndcg_at_5_diff1": 0.18279298790691637,
"nauc_ndcg_at_5_max": 0.4916119327522918,
"nauc_ndcg_at_5_std": 0.2375397192963552,
"nauc_precision_at_1000_diff1": -0.20510380600112582,
"nauc_precision_at_1000_max": 0.4958820760698651,
"nauc_precision_at_1000_std": 0.5402465580496146,
"nauc_precision_at_100_diff1": -0.1994322347949809,
"nauc_precision_at_100_max": 0.5206762748551254,
"nauc_precision_at_100_std": 0.5568154081333078,
"nauc_precision_at_10_diff1": -0.16707155441197413,
"nauc_precision_at_10_max": 0.5600612846655972,
"nauc_precision_at_10_std": 0.49419688804691536,
"nauc_precision_at_1_diff1": 0.3862589251033909,
"nauc_precision_at_1_max": 0.589976680174432,
"nauc_precision_at_1_std": 0.2780515387897469,
"nauc_precision_at_20_diff1": -0.18471041949530417,
"nauc_precision_at_20_max": 0.5458950955439645,
"nauc_precision_at_20_std": 0.5355982267058214,
"nauc_precision_at_3_diff1": -0.03826790088047189,
"nauc_precision_at_3_max": 0.5833083970750171,
"nauc_precision_at_3_std": 0.380196662597275,
"nauc_precision_at_5_diff1": -0.11789367842600275,
"nauc_precision_at_5_max": 0.5708494593335263,
"nauc_precision_at_5_std": 0.42860609671688105,
"nauc_recall_at_1000_diff1": 0.1341309660059583,
"nauc_recall_at_1000_max": 0.5923755841077135,
"nauc_recall_at_1000_std": 0.5980459502693942,
"nauc_recall_at_100_diff1": 0.12181394285840096,
"nauc_recall_at_100_max": 0.47090136790318127,
"nauc_recall_at_100_std": 0.3959369184297595,
"nauc_recall_at_10_diff1": 0.17356300971546512,
"nauc_recall_at_10_max": 0.25475707245853674,
"nauc_recall_at_10_std": 0.041819982320384745,
"nauc_recall_at_1_diff1": 0.41727578149080663,
"nauc_recall_at_1_max": -0.1402656422184478,
"nauc_recall_at_1_std": -0.26168722519030313,
"nauc_recall_at_20_diff1": 0.14273713155999543,
"nauc_recall_at_20_max": 0.36251116771924663,
"nauc_recall_at_20_std": 0.1912123941692314,
"nauc_recall_at_3_diff1": 0.2873719855400218,
"nauc_recall_at_3_max": -0.041198403561830285,
"nauc_recall_at_3_std": -0.21921947922872737,
"nauc_recall_at_5_diff1": 0.23680082643694844,
"nauc_recall_at_5_max": 0.06580524171324151,
"nauc_recall_at_5_std": -0.14104561361502632,
"ndcg_at_1": 0.78989,
"ndcg_at_10": 0.73143,
"ndcg_at_100": 0.78829,
"ndcg_at_1000": 0.80026,
"ndcg_at_20": 0.75787,
"ndcg_at_3": 0.7417,
"ndcg_at_5": 0.72641,
"precision_at_1": 0.78989,
"precision_at_10": 0.37304,
"precision_at_100": 0.04828,
"precision_at_1000": 0.00511,
"precision_at_20": 0.21403,
"precision_at_3": 0.65461,
"precision_at_5": 0.54942,
"recall_at_1": 0.22347,
"recall_at_10": 0.73318,
"recall_at_100": 0.91093,
"recall_at_1000": 0.97197,
"recall_at_20": 0.81286,
"recall_at_3": 0.46573,
"recall_at_5": 0.59383
}
]
},
"task_name": "T2Retrieval"
}
second stage
{
"dataset_revision": "8731a845f1bf500a4f111cf1070785c793d10e64",
"evaluation_time": 332.15709686279297,
"kg_co2_emissions": null,
"mteb_version": "1.14.15",
"scores": {
"dev": [
{
"hf_subset": "default",
"languages": [
"cmn-Hans"
],
"main_score": 0.661,
"map_at_1": 0.24264,
"map_at_10": 0.56291,
"map_at_100": 0.56291,
"map_at_1000": 0.56291,
"map_at_20": 0.56291,
"map_at_3": 0.4714,
"map_at_5": 0.56291,
"mrr_at_1": 0.841969139049623,
"mrr_at_10": 0.8689147524694633,
"mrr_at_100": 0.8689147524694633,
"mrr_at_1000": 0.8689147524694633,
"mrr_at_20": 0.8689147524694633,
"mrr_at_3": 0.8664883979192248,
"mrr_at_5": 0.8689147524694633,
"nauc_map_at_1000_diff1": 0.12071580301051653,
"nauc_map_at_1000_max": 0.2536691069727338,
"nauc_map_at_1000_std": 0.343624832364704,
"nauc_map_at_100_diff1": 0.12071580301051653,
"nauc_map_at_100_max": 0.2536691069727338,
"nauc_map_at_100_std": 0.343624832364704,
"nauc_map_at_10_diff1": 0.12071580301051653,
"nauc_map_at_10_max": 0.2536691069727338,
"nauc_map_at_10_std": 0.343624832364704,
"nauc_map_at_1_diff1": 0.47964980727810325,
"nauc_map_at_1_max": -0.08015044571696166,
"nauc_map_at_1_std": 0.3507257834956417,
"nauc_map_at_20_diff1": 0.12071580301051653,
"nauc_map_at_20_max": 0.2536691069727338,
"nauc_map_at_20_std": 0.343624832364704,
"nauc_map_at_3_diff1": 0.23481937699306626,
"nauc_map_at_3_max": 0.10372745264123306,
"nauc_map_at_3_std": 0.45345158923063256,
"nauc_map_at_5_diff1": 0.12071580301051653,
"nauc_map_at_5_max": 0.2536691069727338,
"nauc_map_at_5_std": 0.343624832364704,
"nauc_mrr_at_1000_diff1": 0.23393918304502795,
"nauc_mrr_at_1000_max": 0.8703379129725659,
"nauc_mrr_at_1000_std": 0.5785333616122065,
"nauc_mrr_at_100_diff1": 0.23393918304502795,
"nauc_mrr_at_100_max": 0.8703379129725659,
"nauc_mrr_at_100_std": 0.5785333616122065,
"nauc_mrr_at_10_diff1": 0.23393918304502795,
"nauc_mrr_at_10_max": 0.8703379129725659,
"nauc_mrr_at_10_std": 0.5785333616122065,
"nauc_mrr_at_1_diff1": 0.2520016067648708,
"nauc_mrr_at_1_max": 0.8560897633767299,
"nauc_mrr_at_1_std": 0.5642467684745208,
"nauc_mrr_at_20_diff1": 0.23393918304502795,
"nauc_mrr_at_20_max": 0.8703379129725659,
"nauc_mrr_at_20_std": 0.5785333616122065,
"nauc_mrr_at_3_diff1": 0.2343988881957151,
"nauc_mrr_at_3_max": 0.8695482778251757,
"nauc_mrr_at_3_std": 0.5799167198804328,
"nauc_mrr_at_5_diff1": 0.23393918304502795,
"nauc_mrr_at_5_max": 0.8703379129725659,
"nauc_mrr_at_5_std": 0.5785333616122065,
"nauc_ndcg_at_1000_diff1": 0.11252208055013257,
"nauc_ndcg_at_1000_max": 0.3417865079349515,
"nauc_ndcg_at_1000_std": 0.3623961771041499,
"nauc_ndcg_at_100_diff1": 0.11252208055013257,
"nauc_ndcg_at_100_max": 0.3417865079349515,
"nauc_ndcg_at_100_std": 0.3623961771041499,
"nauc_ndcg_at_10_diff1": 0.10015448775533999,
"nauc_ndcg_at_10_max": 0.3761759074862075,
"nauc_ndcg_at_10_std": 0.35152523471339914,
"nauc_ndcg_at_1_diff1": 0.2524564785684737,
"nauc_ndcg_at_1_max": 0.8566368743831702,
"nauc_ndcg_at_1_std": 0.5635391925059349,
"nauc_ndcg_at_20_diff1": 0.11228113618796766,
"nauc_ndcg_at_20_max": 0.34274993051851965,
"nauc_ndcg_at_20_std": 0.36216437469674284,
"nauc_ndcg_at_3_diff1": -0.062134030685870506,
"nauc_ndcg_at_3_max": 0.7183183844837573,
"nauc_ndcg_at_3_std": 0.3352626268658533,
"nauc_ndcg_at_5_diff1": -0.04476981761624879,
"nauc_ndcg_at_5_max": 0.6272060974309411,
"nauc_ndcg_at_5_std": 0.21341258393783158,
"nauc_precision_at_1000_diff1": -0.3554940965683014,
"nauc_precision_at_1000_max": 0.605443274008298,
"nauc_precision_at_1000_std": -0.13073611213585504,
"nauc_precision_at_100_diff1": -0.35549409656830133,
"nauc_precision_at_100_max": 0.6054432740082977,
"nauc_precision_at_100_std": -0.13073611213585531,
"nauc_precision_at_10_diff1": -0.3554940965683011,
"nauc_precision_at_10_max": 0.6054432740082981,
"nauc_precision_at_10_std": -0.1307361121358551,
"nauc_precision_at_1_diff1": 0.2524564785684737,
"nauc_precision_at_1_max": 0.8566368743831702,
"nauc_precision_at_1_std": 0.5635391925059349,
"nauc_precision_at_20_diff1": -0.3554940965683011,
"nauc_precision_at_20_max": 0.6054432740082981,
"nauc_precision_at_20_std": -0.1307361121358551,
"nauc_precision_at_3_diff1": -0.3377658698816377,
"nauc_precision_at_3_max": 0.6780151277792397,
"nauc_precision_at_3_std": 0.12291559606586676,
"nauc_precision_at_5_diff1": -0.3554940965683011,
"nauc_precision_at_5_max": 0.6054432740082981,
"nauc_precision_at_5_std": -0.1307361121358551,
"nauc_recall_at_1000_diff1": 0.1091970342988605,
"nauc_recall_at_1000_max": 0.18339955163544436,
"nauc_recall_at_1000_std": 0.30756376767627086,
"nauc_recall_at_100_diff1": 0.1091970342988605,
"nauc_recall_at_100_max": 0.18339955163544436,
"nauc_recall_at_100_std": 0.30756376767627086,
"nauc_recall_at_10_diff1": 0.1091970342988605,
"nauc_recall_at_10_max": 0.18339955163544436,
"nauc_recall_at_10_std": 0.30756376767627086,
"nauc_recall_at_1_diff1": 0.47964980727810325,
"nauc_recall_at_1_max": -0.08015044571696166,
"nauc_recall_at_1_std": 0.3507257834956417,
"nauc_recall_at_20_diff1": 0.1091970342988605,
"nauc_recall_at_20_max": 0.18339955163544436,
"nauc_recall_at_20_std": 0.30756376767627086,
"nauc_recall_at_3_diff1": 0.22013063499116758,
"nauc_recall_at_3_max": 0.054749114965246065,
"nauc_recall_at_3_std": 0.4258163949018153,
"nauc_recall_at_5_diff1": 0.1091970342988605,
"nauc_recall_at_5_max": 0.18339955163544436,
"nauc_recall_at_5_std": 0.30756376767627086,
"ndcg_at_1": 0.84188,
"ndcg_at_10": 0.661,
"ndcg_at_100": 0.6534,
"ndcg_at_1000": 0.6534,
"ndcg_at_20": 0.65358,
"ndcg_at_3": 0.7826,
"ndcg_at_5": 0.74517,
"precision_at_1": 0.84188,
"precision_at_10": 0.27471,
"precision_at_100": 0.02747,
"precision_at_1000": 0.00275,
"precision_at_20": 0.13736,
"precision_at_3": 0.68633,
"precision_at_5": 0.54942,
"recall_at_1": 0.24264,
"recall_at_10": 0.59383,
"recall_at_100": 0.59383,
"recall_at_1000": 0.59383,
"recall_at_20": 0.59383,
"recall_at_3": 0.48934,
"recall_at_5": 0.59383
}
]
},
"task_name": "T2Retrieval"
}
Custom Dataset Evaluation#
Custom Text Retrieval Evaluation#
Construct the Dataset with the following format:
retrieval_data
βββ corpus.jsonl
βββ queries.jsonl
βββ qrels
βββ test.tsv
Where:
corpus.jsonl: Corpus file, each line is a JSON object formatted as{"_id": "xxx", "text": "xxx"}._idis the corpus ID, andtextis the corpus text. For example:{"_id": "doc1", "text": "Climate change is leading to more extreme weather patterns."} {"_id": "doc2", "text": "The stock market surged today, led by tech stocks."} {"_id": "doc3", "text": "AI is transforming various industries by automating tasks and providing insights."} {"_id": "doc4", "text": "With technological advancements, renewable energy sources like wind and solar are becoming more widespread."} {"_id": "doc5", "text": "Recent studies show that a balanced diet and regular exercise can significantly improve mental health."} {"_id": "doc6", "text": "Virtual reality is creating new opportunities in education, entertainment, and training."} {"_id": "doc7", "text": "Electric vehicles are becoming increasingly popular due to environmental benefits and advances in battery technology."} {"_id": "doc8", "text": "Space exploration missions are revealing new information about our solar system and beyond."} {"_id": "doc9", "text": "Blockchain technology has potential applications beyond cryptocurrency, including supply chain management and secure voting systems."} {"_id": "doc10", "text": "The benefits of remote work include greater flexibility and reduced commuting time."}queries.jsonl: Query file, each line is a JSON object formatted as{"_id": "xxx", "text": "xxx"}._idis the query ID, andtextis the query text. For example:{"_id": "query1", "text": "What are the impacts of climate change?"} {"_id": "query2", "text": "What caused the stock market to rise today?"} {"_id": "query3", "text": "How is AI changing industries?"} {"_id": "query4", "text": "What are the advancements in renewable energy?"} {"_id": "query5", "text": "How does a balanced diet improve mental health?"} {"_id": "query6", "text": "What new opportunities does virtual reality create?"} {"_id": "query7", "text": "Why are electric vehicles becoming more popular?"} {"_id": "query8", "text": "What new information has space exploration revealed?"} {"_id": "query9", "text": "What are the applications of blockchain technology beyond cryptocurrency?"} {"_id": "query10", "text": "What are the benefits of remote work?"}qrels: Evaluation file(s) in TSV format, with columnsquery-id,doc-id, andscore.query-idis the query ID,doc-idis the corpus ID, andscoreis the relevance score. For example:query-id corpus-id score query1 doc1 1 query2 doc2 1 query3 doc3 1 query4 doc4 1 query5 doc5 1 query6 doc6 1 query7 doc7 1 query8 doc8 1 query9 doc9 1 query10 doc10 1
Construct the Configuration File
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "MTEB",
"model": [
{
"model_name_or_path": "AI-ModelScope/m3e-base",
"pooling_mode": None, # load from model config
"max_seq_length": 512,
"prompt": "",
"model_kwargs": {"torch_dtype": "auto"},
"encode_kwargs": {
"batch_size": 128,
},
}
],
"eval": {
"tasks": ["CustomRetrieval"],
"dataset_path": "custom_eval/text/retrieval",
"verbosity": 2,
"output_folder": "outputs",
"overwrite_results": True,
"limits": 500,
},
},
}
Parameter description, with essential parameters modified from the default configuration:
eval:tasks: Evaluation task, must beCustomRetrieval.dataset_path: Path to the custom dataset.
Run the Evaluation
from evalscope.run import run_task
run_task(task_cfg=task_cfg)