MSR-VTT#
Overview#
MSR-VTT is a large-scale open-domain video captioning benchmark for evaluating video-to-text generation.
The native adapter groups records by video_id, so multiple annotation rows for one video become one sample
with multiple reference captions.
Task Description#
Task Type: Video captioning
Input: Video clip or URL
Output: One concise natural-language caption
Domains: Open-domain video understanding and description
Evaluation Notes#
Default data source:
AI-ModelScope/msr-vtton ModelScope,validationsplitHugging Face
VLM2Vec/MSR-VTTremains available by settingextra_params.dataset_hub="huggingface"Primary metric: CIDEr
Additional metrics: BLEU-1/2/3/4, METEOR, ROUGE-L
Set
extra_params.video_dirto prefer local media files over URL metadata
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
|
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
497 |
Prompt Length (Mean) |
43 chars |
Prompt Length (Min/Max) |
43 / 43 chars |
Video Statistics:
Metric |
Value |
|---|---|
Total Videos |
497 |
Videos per Sample |
min: 1, max: 1, mean: 1 |
Formats |
mp4 |
Sample Example#
Subset: default
{
"input": [
{
"id": "36044e4b",
"content": [
{
"text": "Describe the video in one concise sentence."
},
{
"video": "https://www.youtube.com/watch?v=A9pM9iOuAzM",
"format": "mp4",
"start": 116.03,
"end": 126.21
}
]
}
],
"target": "[\"a family is having coversation\"]",
"id": 0,
"group_id": 0,
"metadata": {
"references": [
"a family is having coversation"
],
"subset": "default",
"dataset_id": "AI-ModelScope/msr-vtt",
"dataset_hub": "modelscope",
"video": "https://www.youtube.com/watch?v=A9pM9iOuAzM",
"start": 116.03,
"end": 126.21,
"fps": null,
"video_id": "video6513",
"category": 14
}
}
Prompt Template#
Prompt Template:
Describe the video in one concise sentence.
Extra Parameters#
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Dataset hub used to load MSR-VTT annotations. Choices: [‘huggingface’, ‘modelscope’, ‘local’] |
|
|
`` |
Source split to load; defaults to validation for ModelScope and test for Hugging Face. |
|
|
`` |
Optional dataset revision; leave empty to use the hub default. |
|
|
`` |
Optional local directory containing MSR-VTT video files. |
|
|
`` |
Optional extension override for local videos, for example “mp4”. |
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets msr_vtt \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['msr_vtt'],
dataset_args={
'msr_vtt': {
# extra_params: {} # uses default extra parameters
}
},
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)