Surgical AI benchmark

Evaluating AI models on contemporary surgical cases.

Surg Bench evaluates leading AI models on recent surgical exam cases and shows how they compare on answer quality, refusals, response time, and specialty coverage.

Explore results Read methodology

Cases 290

Sub-prompts 1,249

Models 16

Categories 14

About Surg Bench

The benchmark is built from a 2025 surgical exam text, which makes it less likely that strong scores come from memorized training data.

This page introduces the benchmark, shows the overall results, and includes one representative case so visitors can see what the task and grading look like in practice.

Generated

January 26, 2026

What Surg Bench measures

Overall answer quality, answered-only quality, refusal behavior, response time, and specialty-level consistency.

How to use this page

Start with the benchmark board, check grader agreement, use the specialty views to see where performance is broad or narrow, then open the example case for one question-level walkthrough.

Benchmark board

Loading benchmark context...

View

Metric

Ranking graph

All questions counted

Selected view

Reading the board

Model	Score	Rejects	Wins

Quality vs coverage

Answered score against refusal rate

Top-right is stronger; left edge is more reliable.

Speed vs quality

Answered score against response time

Top-left is stronger; left edge is faster.

Grader agreement

Two independent grader models scored each open-ended response. This section summarizes how closely they aligned across the full benchmark and where the largest aggregate differences appeared.

Agreement summary

How closely the graders matched

Loading grader agreement summary...

Agreement density

Gemini 2.5 Flash against GPT-5 Mini

The diagonal marks exact agreement; denser cells show where paired scores concentrate.

Gap by specialty

Where the graders differed most

Bars show mean absolute score gap by specialty. Lower is closer agreement.

Specialty breakdown

These views show whether a strong overall rank comes from broad consistency across surgery topics or from a smaller set of standout specialties.

Category heatmap

Per-category score grid

Lower Higher

Category wins

Who leads the most specialties

Current leaders

Category-by-category winners

Example case

Representative case

Loading showcase example...

Prompt

Reference answer

What the case expected

Scoring lens

What the graders looked for

Model answers

How each model handled this case

How Surg Bench works

This section explains what was tested, how answers were scored, and what is intentionally not reproduced on the public page.

Benchmark design

What Surg Bench evaluates

Surg Bench evaluates AI models on recent surgical exam cases drawn from a 2025 textbook.

This public page shows the benchmark structure, scoring views, and model results without reproducing the protected source material.

Because these cases are open-ended and not multiple-choice, answers were graded by two independent model graders rather than by simple answer matching.

Source material policy

What stays out of the public release

The full set of 290 source cases from the textbook
The full answer and image corpus needed to reconstruct the book
Question-by-question grader notes across the entire benchmark
Only one illustrative case is reproduced above; the rest of the benchmark is summarized at aggregate level

How to read the metrics

What each view is telling you

All cases treats rejections and empty answers as zero.
Answered only measures the quality of non-empty responses.
Reject rate shows how often a model declined or failed to answer.
Response time uses the median time taken on answered cases.