ZebraLogicBench#

Overview#

ZebraLogicBench is a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). It tests systematic logical reasoning abilities.

Task Description#

  • Task Type: Logic Grid Puzzle Solving

  • Input: Logic puzzle with houses, attributes, and clues

  • Output: JSON solution with reasoning explanation

  • Domains: Constraint satisfaction, logical deduction

Key Features#

  • Puzzles derived from constraint satisfaction problems

  • Requires systematic step-by-step logical reasoning

  • Varying difficulty levels (Easy/Hard) and sizes (Small/Medium/Large/XL)

  • Tests ability to process multiple interdependent clues

  • Solutions must be valid JSON format

Evaluation Notes#

  • Default evaluation uses the test split with zero-shot

  • Multiple metrics tracked:

    • puzzle_acc: Correctly solved complete puzzles

    • cell_acc: Correctly identified individual cells

    • Difficulty-based: easy_puzzle_acc, hard_puzzle_acc

    • Size-based: small, medium, large, xl_puzzle_acc

    • avg_reason_lens: Average reasoning length

  • Output must include reasoning and solution in JSON format

Properties#

Property

Value

Benchmark Name

zebralogicbench

Dataset ID

allenai/ZebraLogicBench-private

Paper

N/A

Tags

Reasoning

Metrics

puzzle_acc, cell_acc, easy_puzzle_acc, hard_puzzle_acc, small_puzzle_acc, medium_puzzle_acc, large_puzzle_acc, xl_puzzle_acc, avg_reason_lens, no_answer_num

Default Shots

0-shot

Evaluation Split

test

Data Statistics#

Metric

Value

Total Samples

1,000

Prompt Length (Mean)

3262.38 chars

Prompt Length (Min/Max)

2011 / 5658 chars

Sample Example#

Subset: grid_mode

{
  "input": [
    {
      "id": "e6c901c7",
      "content": "# Example Puzzle\n\nThere are 3 houses, numbered 1 to 3 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:\n - Each perso ... [TRUNCATED] ... Animal\": \"___\"\n        },\n        \"House 5\": {\n            \"Name\": \"___\",\n            \"Nationality\": \"___\",\n            \"BookGenre\": \"___\",\n            \"Food\": \"___\",\n            \"Color\": \"___\",\n            \"Animal\": \"___\"\n        }\n    }\n}\n\n"
    }
  ],
  "target": "{\"header\": [\"House\", \"Name\", \"Nationality\", \"BookGenre\", \"Food\", \"Color\", \"Animal\"], \"rows\": [[\"1\", \"Bob\", \"german\", \"mystery\", \"grilled cheese\", \"yellow\", \"dog\"], [\"2\", \"Eric\", \"norwegian\", \"fantasy\", \"stew\", \"blue\", \"fish\"], [\"3\", \"Peter\", \"dane\", \"science fiction\", \"spaghetti\", \"green\", \"cat\"], [\"4\", \"Arnold\", \"swede\", \"biography\", \"stir fry\", \"red\", \"bird\"], [\"5\", \"Alice\", \"brit\", \"romance\", \"pizza\", \"white\", \"horse\"]]}",
  "id": 0,
  "group_id": 0,
  "metadata": {
    "created_at": "2024-07-03T21:21:29.209499"
  }
}

Note: Some content was truncated for display.

Prompt Template#

Prompt Template:

# Example Puzzle

There are 3 houses, numbered 1 to 3 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:
 - Each person has a unique name: `Peter`, `Eric`, `Arnold`.
 - Each person has a unique favorite drink: `tea`, `water`, `milk`

## Clues for the Example Puzzle

1. Peter is in the second house.
2. Arnold is directly left of the one who only drinks water.
3. The one who only drinks water is directly left of the person who likes milk.

## Answer to the Example Puzzle

{{
    "reasoning": "Given Clue 1, we know Peter is in House 2. According to Clue 2, Arnold is directly left of the one who only drinks water. The person in House 3 cannot be on the left of anyone, so Arnold must be in House 1. Thus, Peter drinks water, and Eric lives in House 3. Then, according to Clue 3, Eric drinks milk. Therefore, Arnold drinks tea.",
    "solution": {{
        "House 1": {{
            "Name": "Arnold",
            "Drink": "tea"
        }},
        "House 2": {{
            "Name": "Peter",
            "Drink": "water"
        }},
        "House 3": {{
            "Name": "Eric",
            "Drink": "milk"
        }}
    }}
}}

# Puzzle to Solve

{question}


# Instruction

Now please solve the above puzzle. Present your reasoning and solution in the following json format:

{json_template}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets zebralogicbench \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['zebralogicbench'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)