MSVD#

Overview#

MSVD is a classic video captioning benchmark with short web videos annotated by many human captions. The native adapter treats each video as one evaluation sample and uses all available captions as references.

Task Description#

Task Type: Video captioning
Input: Video clip
Output: One concise natural-language caption
Domains: Open-domain video understanding and description

Evaluation Notes#

Default data source: evalscope/MSVD on ModelScope, test split
Hugging Face VLM2Vec/MSVD remains available by setting dataset_hub="huggingface" in TaskConfig
Primary metric: CIDEr
Additional metrics: BLEU-1/2/3/4, METEOR, ROUGE-L
Set extra_params.video_dir when the dataset only provides video file names and local media files are required

Properties#

Property	Value
Benchmark Name	`msvd`
Dataset ID	evalscope/MSVD
Paper	Paper
Tags	`ImageCaptioning`, `MultiModal`, `Video`
Metrics	`Bleu_1`, `Bleu_2`, `Bleu_3`, `Bleu_4`, `METEOR`, `ROUGE_L`, `CIDEr`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	670
Prompt Length (Mean)	43 chars
Prompt Length (Min/Max)	43 / 43 chars

Video Statistics:

Metric	Value
Total Videos	670
Videos per Sample	min: 1, max: 1, mean: 1
Formats	mp4

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "a4b83275",
      "content": [
        {
          "text": "Describe the video in one concise sentence."
        },
        {
          "video": "fr9H1WLcF1A_256_261.avi",
          "format": "mp4"
        }
      ]
    }
  ],
  "target": "[\"two young men are playing table tennis\", \"men are playing table tennis\", \"two men are playing table tennis\", \"two men are playing a tabletennis\", \"a couple of people are playing a game of ping pong\", \"peoples are playing table tennis\", \"two ... [TRUNCATED 306 chars] ... e boys are playing\", \"people are playing ping pong\", \"18 kids and counting show the newest baby josie\", \"there is some kids and playing to each other\", \"the men played pingpong together\", \"the boys are playing ping pong\", \"2 boys is playing\"]",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "references": [
      "two young men are playing table tennis",
      "men are playing table tennis",
      "two men are playing table tennis",
      "two men are playing a tabletennis",
      "a couple of people are playing a game of ping pong",
      "peoples are playing table tennis",
      "two men are playing pingpong",
      "two guys play table tennis",
      "two people are playing ping pong",
      "two boys are playing ping pong",
      "... [TRUNCATED 12 more items] ..."
    ],
    "subset": "default",
    "dataset_id": "evalscope/MSVD",
    "dataset_hub": "modelscope",
    "video": "fr9H1WLcF1A_256_261.avi",
    "video_id": "fr9H1WLcF1A_256_261",
    "source": "MSVD"
  }
}

Prompt Template#

Prompt Template:

Describe the video in one concise sentence.

Extra Parameters#

Parameter	Type	Default	Description
`video_dir`	`str`	``	Optional local directory containing MSVD video files.
`video_extension`	`str`	``	Optional extension override for local videos, for example “mp4”.

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets msvd \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['msvd'],
    dataset_args={
        'msvd': {
            # extra_params: {}  # uses default extra parameters
        }
    },
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)