CoinFlip#

Overview#

CoinFlip is a symbolic reasoning benchmark that tests LLMs’ ability to track binary state changes through sequences of actions. Each problem involves determining a coin’s final state (heads/tails) after various flipping operations.

Task Description#

Task Type: Symbolic Reasoning / State Tracking
Input: Description of coin flip operations by different people
Output: Final coin state (YES for heads-up, NO for tails-up)
Focus: Binary state tracking and logical inference

Key Features#

Tests state tracking through action sequences
Binary reasoning (flip/no-flip) decisions
Requires careful attention to operator effects
Evaluates systematic logical reasoning
Clear, unambiguous answers

Evaluation Notes#

Default configuration uses 0-shot evaluation
Answers should follow “ANSWER: YES/NO” format
Five metrics: accuracy, precision, recall, F1, yes_ratio
F1 score is the primary aggregation metric
Supports few-shot evaluation with reasoning examples

Properties#

Property	Value
Benchmark Name	`coin_flip`
Dataset ID	extraordinarylab/coin-flip
Paper	N/A
Tags	`Reasoning`, `Yes/No`
Metrics	`accuracy`, `precision`, `recall`, `f1_score`, `yes_ratio`
Default Shots	0-shot
Evaluation Split	`test`
Train Split	`validation`
Aggregation	`f1`

Data Statistics#

Metric	Value
Total Samples	3,333
Prompt Length (Mean)	500.15 chars
Prompt Length (Min/Max)	453 / 551 chars

Sample Example#

Subset: default

{
  "input": [
    {
      "id": "05503706",
      "content": [
        {
          "text": "\nSolve the following coin flip problem step by step. The last line of your response should be of the form \"ANSWER: [ANSWER]\" (without quotes) where [ANSWER] is the answer to the problem.\n\nQ: A coin is heads up. rushawn flips the coin. yerania ... [TRUNCATED] ...  the coin. jostin does not flip the coin.  Is the coin still heads up?\n\nRemember to put your answer on its own line at the end in the form \"ANSWER: [ANSWER]\" (without quotes) where [ANSWER] is the answer YES or NO to the problem.\n\nReasoning:\n"
        }
      ]
    }
  ],
  "target": "NO",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "answer": "NO"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

Solve the following coin flip problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer YES or NO to the problem.

Reasoning:

Few-shot Template

Here are some examples of how to solve similar problems:

{fewshot}

Solve the following coin flip problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.

{question}

Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer YES or NO to the problem.

Reasoning:

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets coin_flip \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['coin_flip'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)