MTEB#

CMTEB Evaluation Dataset#

Name	Hub Link	Description	Type	Category	Number of Test Samples
T2Retrieval	C-MTEB/T2Retrieval	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Retrieval	s2p	24,832
MMarcoRetrieval	C-MTEB/MMarcoRetrieval	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Retrieval	s2p	7,437
DuRetrieval	C-MTEB/DuRetrieval	A large-scale Chinese web search engine paragraph retrieval benchmark	Retrieval	s2p	4,000
CovidRetrieval	C-MTEB/CovidRetrieval	COVID-19 news articles	Retrieval	s2p	949
CmedqaRetrieval	C-MTEB/CmedqaRetrieval	Online medical consultation texts	Retrieval	s2p	3,999
EcomRetrieval	C-MTEB/EcomRetrieval	Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems	Retrieval	s2p	1,000
MedicalRetrieval	C-MTEB/MedicalRetrieval	Paragraph retrieval dataset collected from Alibaba medical search engine systems	Retrieval	s2p	1,000
VideoRetrieval	C-MTEB/VideoRetrieval	Paragraph retrieval dataset collected from Alibaba video search engine systems	Retrieval	s2p	1,000
T2Reranking	C-MTEB/T2Reranking	T2Ranking: A large-scale Chinese paragraph ranking benchmark	Re-ranking	s2p	24,382
MMarcoReranking	C-MTEB/MMarco-reranking	mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset	Re-ranking	s2p	7,437
CMedQAv1	C-MTEB/CMedQAv1-reranking	Chinese community medical Q&A	Re-ranking	s2p	2,000
CMedQAv2	C-MTEB/CMedQAv2-reranking	Chinese community medical Q&A	Re-ranking	s2p	4,000
Ocnli	C-MTEB/OCNLI	Original Chinese natural language inference dataset	Pair Classification	s2s	3,000
Cmnli	C-MTEB/CMNLI	Chinese multi-class natural language inference	Pair Classification	s2s	139,000
CLSClusteringS2S	C-MTEB/CLSClusteringS2S	Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	s2s	10,000
CLSClusteringP2P	C-MTEB/CLSClusteringP2P	Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories.	Clustering	p2p	10,000
ThuNewsClusteringS2S	C-MTEB/ThuNewsClusteringS2S	Clustering titles from the THUCNews dataset	Clustering	s2s	10,000
ThuNewsClusteringP2P	C-MTEB/ThuNewsClusteringP2P	Clustering titles + abstracts from the THUCNews dataset	Clustering	p2p	10,000
ATEC	C-MTEB/ATEC	ATEC NLP Sentence Pair Similarity Competition	STS	s2s	20,000
BQ	C-MTEB/BQ	Banking Question Semantic Similarity	STS	s2s	10,000
LCQMC	C-MTEB/LCQMC	Large-scale Chinese Question Matching Corpus	STS	s2s	12,500
PAWSX	C-MTEB/PAWSX	Translated PAWS evaluation pairs	STS	s2s	2,000
STSB	C-MTEB/STSB	Translated STS-B into Chinese	STS	s2s	1,360
AFQMC	C-MTEB/AFQMC	Ant Financial Question Matching Corpus	STS	s2s	3,861
QBQTC	C-MTEB/QBQTC	QQ Browser Query Title Corpus	STS	s2s	5,000
TNews	C-MTEB/TNews-classification	News Short Text Classification	Classification	s2s	10,000
IFlyTek	C-MTEB/IFlyTek-classification	Long Text Classification of Application Descriptions	Classification	s2s	2,600
Waimai	C-MTEB/waimai-classification	Sentiment Analysis of User Reviews on Food Delivery Platforms	Classification	s2s	1,000
OnlineShopping	C-MTEB/OnlineShopping-classification	Sentiment Analysis of User Reviews on Online Shopping Websites	Classification	s2s	1,000
MultilingualSentiment	C-MTEB/MultilingualSentiment-classification	A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative	Classification	s2s	3,000
JDReview	C-MTEB/JDReview-classification	Reviews of iPhone	Classification	s2s	533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.