BrowseComp#
Overview#
BrowseComp is an OpenAI benchmark for evaluating browsing and search agents. It contains 1,266 hard-to-find, fact-seeking questions with short, verifiable answers. EvalScope loads the mirrored dataset from ModelScope (evalscope/browse_comp).
Task Description#
Task Type: Search-agent factual question answering
Input: Challenging natural-language question that generally requires persistent web browsing
Output: Explanation, exact answer, and confidence
Grading: LLM judge compares the final answer against the reference answer
Key Features#
Tests persistence, creative search, and multi-hop evidence gathering
Uses short answers to keep grading tractable
Official data is distributed as encrypted CSV rows and decrypted at evaluation time
Classified as an Agent benchmark and compatible with EvalScope agent loop modes
Supports single-turn model evaluation by default and native/external agent execution when
TaskConfig.agent_configis provided
Evaluation Notes#
Default evaluation loads
evalscope/browse_compfrom ModelScope through the standard EvalScope dataset loader.Use
TaskConfig.agent_configto evaluate BrowseComp with EvalScope agent loop capabilities such as native tool-use or external agent runners.The primary metric is
is_correct;is_incorrectis also reported.LLM judge is enabled by default.
JudgeStrategy.RULEfalls back to normalized exact match.
Properties#
Property |
Value |
|---|---|
Benchmark Name |
|
Dataset ID |
|
Paper |
|
Tags |
|
Metrics |
|
Default Shots |
0-shot |
Evaluation Split |
|
Data Statistics#
Metric |
Value |
|---|---|
Total Samples |
1,266 |
Prompt Length (Mean) |
811.02 chars |
Prompt Length (Min/Max) |
424 / 2219 chars |
Sample Example#
Sample example not available.
Prompt Template#
Prompt Template:
{question}
Your response should be in the following format:
Explanation: {{your explanation for your final answer}}
Exact Answer: {{your succinct, final answer}}
Confidence: {{your confidence score between 0% and 100% for your answer}}
Usage#
Using CLI#
evalscope eval \
--model YOUR_MODEL \
--api-url OPENAI_API_COMPAT_URL \
--api-key EMPTY_TOKEN \
--datasets browsecomp \
--limit 10 # Remove this line for formal evaluation
Using Python#
from evalscope import run_task
from evalscope.config import TaskConfig
task_cfg = TaskConfig(
model='YOUR_MODEL',
api_url='OPENAI_API_COMPAT_URL',
api_key='EMPTY_TOKEN',
datasets=['browsecomp'],
limit=10, # Remove this line for formal evaluation
)
run_task(task_cfg=task_cfg)