CoinFlip#
Overview#
CoinFlip is a symbolic reasoning benchmark that tests LLMs’ ability to track binary state changes through sequences of actions. Each problem involves determining a coin’s final state (heads/tails) after various flipping operations.
Task Description#
Task Type: Symbolic Reasoning / State Tracking
Input: Description of coin flip operations by different people
Output: Final coin state (YES for heads-up, NO for tails-up)
Focus: Binary state tracking and logical inference
Key Features#
Tests state tracking through action sequences
Binary reasoning (flip/no-flip) decisions
Requires careful attention to operator effects
Evaluates systematic logical reasoning
Clear, unambiguous answers
Evaluation Notes#
Default configuration uses 0-shot evaluation
Answers should follow “ANSWER: YES/NO” format
Five metrics: accuracy, precision, recall, F1, yes_ratio
F1 score is the primary aggregation metric
Supports few-shot evaluation with reasoning examples
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
N/A |
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Train Split |
|
Aggregation |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
3,333 |
Prompt Length (Mean) |
500.15 chars |
Prompt Length (Min/Max) |
453 / 551 chars |
Sample Example#
Subset: default
{
"input": [
{
"id": "05503706",
"content": [
{
"text": "\nSolve the following coin flip problem step by step. The last line of your response should be of the form \"ANSWER: [ANSWER]\" (without quotes) where [ANSWER] is the answer to the problem.\n\nQ: A coin is heads up. rushawn flips the coin. yerania ... [TRUNCATED] ... the coin. jostin does not flip the coin. Is the coin still heads up?\n\nRemember to put your answer on its own line at the end in the form \"ANSWER: [ANSWER]\" (without quotes) where [ANSWER] is the answer YES or NO to the problem.\n\nReasoning:\n"
}
]
}
],
"target": "NO",
"id": 0,
"group_id": 0,
"metadata": {
"answer": "NO"
}
}
Note: Some content was truncated for display.
Prompt Template#
Prompt Template:
Solve the following coin flip problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.
{question}
Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer YES or NO to the problem.
Reasoning:
Few-shot Template
Here are some examples of how to solve similar problems:
{fewshot}
Solve the following coin flip problem step by step. The last line of your response should be of the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer to the problem.
{question}
Remember to put your answer on its own line at the end in the form "ANSWER: [ANSWER]" (without quotes) where [ANSWER] is the answer YES or NO to the problem.
Reasoning:
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets coin_flip \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['coin_flip'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)