BrowseComp#

Overview#

BrowseComp is an OpenAI benchmark for evaluating browsing and search agents. It contains 1,266 hard-to-find, fact-seeking questions with short, verifiable answers. EvalScope loads the mirrored dataset from ModelScope (evalscope/browse_comp).

Task Description#

Task Type: Search-agent factual question answering
Input: Challenging natural-language question that generally requires persistent web browsing
Output: Explanation, exact answer, and confidence
Grading: LLM judge compares the final answer against the reference answer

Key Features#

Tests persistence, creative search, and multi-hop evidence gathering
Uses short answers to keep grading tractable
Official data is distributed as encrypted CSV rows and decrypted at evaluation time
Classified as an Agent benchmark and compatible with EvalScope agent loop modes
Supports single-turn model evaluation by default and native/external agent execution when TaskConfig.agent_config is provided

Evaluation Notes#

Default evaluation loads evalscope/browse_comp from ModelScope through the standard EvalScope dataset loader.
Use TaskConfig.agent_config to evaluate BrowseComp with EvalScope agent loop capabilities such as native tool-use or external agent runners.
The primary metric is is_correct; is_incorrect is also reported.
LLM judge is enabled by default. JudgeStrategy.RULE falls back to normalized exact match.

Properties#

Property	Value
Benchmark Name	`browsecomp`
Dataset ID	evalscope/browse_comp
Paper	Paper
Tags	`Agent`, `Knowledge`, `QA`
Metrics	`is_correct`, `is_incorrect`
Default Shots	0-shot
Evaluation Split	`test`

Data Statistics#

Metric	Value
Total Samples	1,266
Prompt Length (Mean)	811.02 chars
Prompt Length (Min/Max)	424 / 2219 chars

Sample Example#

Sample example not available.

Prompt Template#

Prompt Template:

{question}

Your response should be in the following format:
Explanation: {{your explanation for your final answer}}
Exact Answer: {{your succinct, final answer}}
Confidence: {{your confidence score between 0% and 100% for your answer}}

Usage#

Using CLI#

evalscope eval \
    --model YOUR_MODEL \
    --api-url OPENAI_API_COMPAT_URL \
    --api-key EMPTY_TOKEN \
    --datasets browsecomp \
    --limit 10  # Remove this line for formal evaluation

Using Python#

from evalscope import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='YOUR_MODEL',
    api_url='OPENAI_API_COMPAT_URL',
    api_key='EMPTY_TOKEN',
    datasets=['browsecomp'],
    limit=10,  # Remove this line for formal evaluation
)

run_task(task_cfg=task_cfg)