Ranking graph
Surgical AI benchmark
Evaluating AI models on contemporary surgical cases.
Surg Bench evaluates leading AI models on recent surgical exam cases and shows how they compare on answer quality, refusals, response time, and specialty coverage.
What Surg Bench measures
Overall answer quality, answered-only quality, refusal behavior, response time, and specialty-level consistency.
How to use this page
Start with the benchmark board, check grader agreement, use the specialty views to see where performance is broad or narrow, then open the example case for one question-level walkthrough.
01
Benchmark board
Loading benchmark context...
Selected view
Reading the board
| Model | Score | Rejects | Wins |
|---|
Quality vs coverage
Answered score against refusal rate
Top-right is stronger; left edge is more reliable.
Speed vs quality
Answered score against response time
Top-left is stronger; left edge is faster.
02
Grader agreement
Two independent grader models scored each open-ended response. This section summarizes how closely they aligned across the full benchmark and where the largest aggregate differences appeared.
Agreement summary
How closely the graders matched
Loading grader agreement summary...
Agreement density
Gemini 2.5 Flash against GPT-5 Mini
The diagonal marks exact agreement; denser cells show where paired scores concentrate.
Gap by specialty
Where the graders differed most
Bars show mean absolute score gap by specialty. Lower is closer agreement.
03
Specialty breakdown
These views show whether a strong overall rank comes from broad consistency across surgery topics or from a smaller set of standout specialties.
Category heatmap
Per-category score grid
Category wins
Who leads the most specialties
Current leaders
Category-by-category winners
Example case
Representative case
Loading showcase example...
Prompt
Reference answer
What the case expected
Scoring lens
What the graders looked for
Model answers
How each model handled this case
04
How Surg Bench works
This section explains what was tested, how answers were scored, and what is intentionally not reproduced on the public page.
Benchmark design
What Surg Bench evaluates
Surg Bench evaluates AI models on recent surgical exam cases drawn from a 2025 textbook.
This public page shows the benchmark structure, scoring views, and model results without reproducing the protected source material.
Because these cases are open-ended and not multiple-choice, answers were graded by two independent model graders rather than by simple answer matching.
Source material policy
What stays out of the public release
- The full set of 290 source cases from the textbook
- The full answer and image corpus needed to reconstruct the book
- Question-by-question grader notes across the entire benchmark
- Only one illustrative case is reproduced above; the rest of the benchmark is summarized at aggregate level
How to read the metrics
What each view is telling you
- All cases treats rejections and empty answers as zero.
- Answered only measures the quality of non-empty responses.
- Reject rate shows how often a model declined or failed to answer.
- Response time uses the median time taken on answered cases.