EQ-Bench#

Overview#

EQ-Bench is a benchmark for evaluating language models on emotional intelligence tasks. It assesses the ability to predict likely emotional responses of characters in dialogues by rating the intensity of possible emotional reactions.

Task Description#

  • Task Type: Emotional Intelligence Assessment

  • Input: Dialogue scenario with characters

  • Output: Emotion intensity ratings in specific format

  • Domains: Emotional understanding, social cognition

Key Features#

  • Tests ability to predict emotional responses in conversations

  • Requires rating intensity of multiple possible emotions

  • Uses official EQ-Bench v2 scoring algorithm

  • Scoring includes sigmoid scaling for small differences

  • Adjustment constant ensures random answers score 0

Evaluation Notes#

  • Default evaluation uses the validation split

  • Primary metric: EQ-Bench Score (0-100 scale, reported as 0-1)

  • Uses zero-shot evaluation (no few-shot examples)

  • Responses must include emotion ratings in specific JSON-like format

  • Official algorithm from Paper | Homepage

Properties#

Property

Value

Benchmark Name

eq_bench

Dataset ID

evalscope/EQ-Bench

Paper

N/A

Tags

InstructionFollowing

Metrics

eq_bench_score

Default Shots

0-shot

Evaluation Split

validation

Data Statistics#

Metric

Value

Total Samples

171

Prompt Length (Mean)

1550.02 chars

Prompt Length (Min/Max)

922 / 3737 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "97129bc9",
      "content": "Your task is to predict the likely emotional responses of a character in this dialogue:\n\nRobert: Claudia, you've always been the idealist. But let's be practical for once, shall we?\nClaudia: Practicality, according to you, means bulldozing ev ... [TRUNCATED] ... ary:\n\nRemorseful: <score>\nIndifferent: <score>\nAffectionate: <score>\nAnnoyed: <score>\n\n\n[End of answer]\n\nRemember: zero is a valid score, meaning they are likely not feeling that emotion. You must score at least one emotion > 0.\n\nYour answer:"
    }
  ],
  "target": "{'emotion1': 'Remorseful', 'emotion2': 'Indifferent', 'emotion3': 'Affectionate', 'emotion4': 'Annoyed', 'emotion1_score': 2, 'emotion2_score': 3, 'emotion3_score': 0, 'emotion4_score': 5}",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "reference_answer": {
      "emotion1": "Remorseful",
      "emotion2": "Indifferent",
      "emotion3": "Affectionate",
      "emotion4": "Annoyed",
      "emotion1_score": 2,
      "emotion2_score": 3,
      "emotion3_score": 0,
      "emotion4_score": 5
    },
    "reference_answer_fullscale": {
      "emotion1": "Remorseful",
      "emotion2": "Indifferent",
      "emotion3": "Affectionate",
      "emotion4": "Annoyed",
      "emotion1_score": 0,
      "emotion2_score": "6",
      "emotion3_score": 0,
      "emotion4_score": "7"
    }
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

{question}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets eq_bench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['eq_bench'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)