FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights
Abstract
Researchers developed FIRE-Bench, a comprehensive evaluation framework that challenges autonomous agents to rediscover established scientific findings through complete research cycles involving hypothesis generation, experimentation, coding, and evidence-based conclusion drawing.
Autonomous agents powered by large language models (LLMs) promise to accelerate scientific discovery end-to-end, but rigorously evaluating their capacity for verifiable discovery remains a central challenge. Existing benchmarks face a trade-off: they either heavily rely on LLM-as-judge evaluations of automatically generated research outputs or optimize convenient yet isolated performance metrics that provide coarse proxies for scientific insight. To address this gap, we introduce FIRE-Bench (Full-cycle Insight Rediscovery Evaluation), a benchmark that evaluates agents through the rediscovery of established findings from recent, high-impact machine learning research. Agents are given only a high-level research question extracted from a published, verified study and must autonomously explore ideas, design experiments, implement code, execute their plans, and derive conclusions supported by empirical evidence. We evaluate a range of state-of-the-art agents with frontier LLMs backbones like gpt-5 on FIRE-Bench. Our results show that full-cycle scientific research remains challenging for current agent systems: even the strongest agents achieve limited rediscovery success (<50 F1), exhibit high variance across runs, and display recurring failure modes in experimental design, execution, and evidence-based reasoning. FIRE-Bench provides a rigorous and diagnostic framework for measuring progress toward reliable agent-driven scientific discovery.
Community
FIRE-Bench is a human-grounded benchmark designed to test whether AI can actually do science end-to-end, from ideation, planning, to implementation, execution, and conclusions. It converts recent, expert-validated scientific insights from top ML conferences into masked discovery challenges, forcing agents to rediscover human-verified insights rather than reproduce methods.
By anchoring open-ended exploration to human-grounded ground truth and evaluating discovery at the claim level, FIRE-Bench reveals a clear “science gap”: today’s best agents are less capable (<50 F1), unreliable, and fail mainly at planning and reasoning, not coding. The benchmark offers a scalable, structured way to convert a paper into a constrained discovery problem to measure progress toward reliable, creative, full-cycle scientific discovery, and targets a path toward live, continuously updated evaluation of research-capable AI.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models (2026)
- HeurekaBench: A Benchmarking Framework for AI Co-scientist (2026)
- NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents (2025)
- DR-Arena: an Automated Evaluation Framework for Deep Research Agents (2026)
- EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery (2026)
- ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
- IDRBench: Interactive Deep Research Benchmark (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper