DeepResearch Bench: which 'deep research' tool is the best?

The race to Research Supremacy now has comparison benchmark!

Jun 18, 2025

According to this paper, the current winner is Gemini-2.5-Pro (preview).

https://arxiv.org/abs/2506.11763, Figure 1

They also have a Leaderboard on Huggingface where you can compare the individual results from each tool, so I imagine this will get updated over time.

Ok… what is “DeepResearch Bench”? Well… it’s, “benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields.” So, the folks who helped put this together are much smarter than I am.

Here’s how they built it:

https://arxiv.org/abs/2506.11763, Figure 2. Overview of DeepResearch Bench. (a)Distribution analysis and dataset construction pipeline. (b)Overview of RACE(a Reference-based Adaptive Criteria-driven Evaluation framework with Dynamic Weighting) (c)Overview of FACT (a Framework for Factual Abundance and Citation Trustworthiness)

Since that’s a lot to follow… here’s a brief breakdown of their workflow:

Create tasks: 100 PhD-level prompts
Agent Execution: Dispatch a Deep Research Agent to complete the task.
Evaluate the output:
1. Compare against an expert reference report for quality.
2. Count and verify the citations.
Score the results

And if you’re paying attention they have not one, but two frameworks in there: RACE and FACT.

RACE: assesses the overall quality of the report and how well the report satisfies the task in a way that is aligned with how humans would judge it’s academic quality.

FACT: measures the effectiveness of the report’s citations. (Because everyone hates citation hallucinations!)

Now, if you’re building an Agentic Deep Research tool of any type… I highly suggest taking a look at their prompts! They’re in their github repo and in the appendix of their paper.

https://arxiv.org/pdf/2506.11763

Open Disruption

Discussion about this post