Merge pull request #1866 from qodo-ai/tr/new_benchmark23

docs: update PR benchmark to ranking-based methodology with expanded …
2025-07-21 04:50:39 +08:00 · 2025-06-12 10:24:25 +03:00
parent 39067a07ef 3c1f47b4e9
commit 3c572306cf
1 changed files with 170 additions and 139 deletions
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@ -2,200 +2,231 @@

 ## Methodology

-Qodo Merge PR Benchmark evaluates and compares the performance of two Large Language Models (LLMs) in analyzing pull request code and providing meaningful code suggestions.
+Qodo Merge PR Benchmark evaluates and compares the performance of Large Language Models (LLMs) in analyzing pull request code and providing meaningful code suggestions.
 Our diverse dataset comprises of 400 pull requests from over 100 repositories, spanning various programming languages and frameworks to reflect real-world scenarios.

- For each pull request, two distinct LLMs process the same prompt using the Qodo Merge `improve` tool, each generating two sets of responses. The prompt for response generation can be found [here](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/code_suggestions/pr_code_suggestions_prompts_not_decoupled.toml).
+- For each pull request, we have pre-generated suggestions from [11](https://qodo-merge-docs.qodo.ai/pr_benchmark/#models-used-for-generating-the-benchmark-baseline) different top-performing models using the Qodo Merge `improve` tool. The prompt for response generation can be found [here](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/code_suggestions/pr_code_suggestions_prompts_not_decoupled.toml).

- Subsequently, a high-performing third model (an AI judge) evaluates the responses from the initial two models to determine the superior one. We utilize OpenAI's `o3` model as the judge, though other models have yielded consistent results. The prompt for this comparative judgment is available [here](https://github.com/Codium-ai/pr-agent-settings/tree/main/benchmark).
+- To benchmark a model, we generate its suggestions for the same pull requests and ask a high-performing judge model to **rank** the new model's output against the 11 pre-generated baseline suggestions. We utilize OpenAI's `o3` model as the judge, though other models have yielded consistent results. The prompt for this ranking judgment is available [here](https://github.com/Codium-ai/pr-agent-settings/tree/main/benchmark).

- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
+- We aggregate ranking outcomes across all pull requests, calculating performance metrics for the evaluated model. We also analyze the qualitative feedback from the judge to identify the model's comparative strengths and weaknesses against the established baselines.
 This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.

- For each model we build a "Model Card", comparing it against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. See example for the full output [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)

-Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
-Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.
+[//]: # (Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.)

-## TL;DR
+[//]: # (Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope. We do specify the thinking budget used by each model, which can be a factor in the model's performance.)

-Here's a summary of the win rates based on the benchmark:
+[//]: # ()

-[//]: # (| Model A                        | Model B                        | Model A Win Rate | Model B Win Rate |)
-
-[//]: # (|:-------------------------------|:-------------------------------|:----------------:|:----------------:|)
-
-[//]: # (| Gemini-2.5-pro-preview-05-06   | GPT-4.1                        |      70.4%       |      29.6%       |)
-
-[//]: # (| Gemini-2.5-pro-preview-05-06   | Sonnet 3.7                     |      78.1%       |      21.9%       |)
-
-[//]: # (| GPT-4.1                        | Sonnet 3.7                     |      61.0%       |      39.0%       |)
+## Results

 <table>
  <thead>
    <tr>
-      <th style="text-align:left;">Model A</th>
-      <th style="text-align:left;">Model B</th>
-      <th style="text-align:center;">Model A Win Rate</th> <th style="text-align:center;">Model B Win Rate</th> </tr>
+      <th style="text-align:left;">Model Name</th>
+      <th style="text-align:left;">Version (Date)</th>
+      <th style="text-align:left;">Thinking budget tokens</th>
+      <th style="text-align:center;">Score</th>
+    </tr>
  </thead>
  <tbody>
    <tr>
-      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
-      <td style="text-align:left;">GPT-4.1</td>
-      <td style="text-align:center; color: #1E8449;"><b>70.4%</b></td> <td style="text-align:center; color: #D8000C;"><b>29.6%</b></td> </tr>
+      <td style="text-align:left;">o3</td>
+      <td style="text-align:left;">2025-04-16</td>
+      <td style="text-align:left;">'medium' (<a href="https://ai.google.dev/gemini-api/docs/openai">8000</a>)</td>
+      <td style="text-align:center;"><b>62.5</b></td>
+    </tr>
    <tr>
-      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
-      <td style="text-align:left;">Sonnet 3.7</td>
-      <td style="text-align:center; color: #1E8449;"><b>78.1%</b></td> <td style="text-align:center; color: #D8000C;"><b>21.9%</b></td> </tr>
+      <td style="text-align:left;">o4-mini</td>
+      <td style="text-align:left;">2025-04-16</td>
+      <td style="text-align:left;">'medium' (<a href="https://ai.google.dev/gemini-api/docs/openai">8000</a>)</td>
+      <td style="text-align:center;"><b>57.7</b></td>
+    </tr>
    <tr>
-      <td style="text-align:left;">Gemini-2.5-pro-preview-05-06</td>
-      <td style="text-align:left;">Gemini-2.5-flash-preview-04-17</td>
-      <td style="text-align:center; color: #1E8449;"><b>73.0%</b></td> <td style="text-align:center; color: #D8000C;"><b>27.0%</b></td> </tr>
+      <td style="text-align:left;">Gemini-2.5-pro</td>
+      <td style="text-align:left;">2025-06-05</td>
+      <td style="text-align:left;">4096</td>
+      <td style="text-align:center;"><b>56.3</b></td>
+    </tr>
    <tr>
-      <td style="text-align:left;">Gemini-2.5-flash-preview-04-17</td>
-      <td style="text-align:left;">GPT-4.1</td>
-      <td style="text-align:center; color: #1E8449;"><b>54.6%</b></td> <td style="text-align:center; color: #D8000C;"><b>45.4%</b></td> </tr>
+      <td style="text-align:left;">Gemini-2.5-pro</td>
+      <td style="text-align:left;">2025-06-05</td>
+      <td style="text-align:left;">1024</td>
+      <td style="text-align:center;"><b>44.3</b></td>
+    </tr>
    <tr>
-      <td style="text-align:left;">Gemini-2.5-flash-preview-04-17</td>
-      <td style="text-align:left;">Sonnet 3.7</td>
-      <td style="text-align:center; color: #1E8449;"><b>60.6%</b></td> <td style="text-align:center; color: #D8000C;"><b>39.4%</b></td> </tr>
+      <td style="text-align:left;">Claude-4-sonnet</td>
+      <td style="text-align:left;">2025-05-14</td>
+      <td style="text-align:left;">4096</td>
+      <td style="text-align:center;"><b>39.7</b></td>
+    </tr>
+    <tr>
+      <td style="text-align:left;">Claude-4-sonnet</td>
+      <td style="text-align:left;">2025-05-14</td>
+      <td style="text-align:left;"></td>
+      <td style="text-align:center;"><b>39.0</b></td>
+    </tr>
+    <tr>
+      <td style="text-align:left;">Gemini-2.5-flash</td>
+      <td style="text-align:left;">2025-04-17</td>
+      <td style="text-align:left;"></td>
+      <td style="text-align:center;"><b>33.5</b></td>
+    </tr>
+    <tr>
+      <td style="text-align:left;">Claude-3.7-sonnet</td>
+      <td style="text-align:left;">2025-02-19</td>
+      <td style="text-align:left;"></td>
+      <td style="text-align:center;"><b>32.4</b></td>
+    </tr>
    <tr>
      <td style="text-align:left;">GPT-4.1</td>
-      <td style="text-align:left;">Sonnet 3.7</td>
-      <td style="text-align:center; color: #1E8449;"><b>61.0%</b></td> <td style="text-align:center; color: #D8000C;"><b>39.0%</b></td> </tr>
+      <td style="text-align:left;">2025-04-14</td>
+      <td style="text-align:left;"></td>
+      <td style="text-align:center;"><b>26.5</b></td>
+    </tr>
  </tbody>
 </table>

+## Results Analysis

-## Gemini-2.5-pro-preview-05-06 - Model Card
+### O3

-### Comparison against GPT-4.1
+Final score: **62.5**

-![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
+strengths:

-#### Analysis Summary
+- **High precision & compliance:** Generally respects task rules (limits, “added lines” scope, YAML schema) and avoids false-positive advice, often returning an empty list when appropriate.  
+- **Clear, actionable output:** Suggestions are concise, well-explained and include correct before/after patches, so reviewers can apply them directly.  
+- **Good critical-bug detection rate:** Frequently spots compile-breakers or obvious runtime faults (nil / NPE, overflow, race, wrong selector, etc.), putting it at least on par with many peers.  
+- **Consistent formatting:** Produces syntactically valid YAML with correct labels, making automated consumption easy.

-Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.
+weaknesses:

-#### Detailed Analysis
+- **Narrow coverage:** Tends to stop after 1-2 issues; regularly misses additional critical defects that better answers catch, so it is seldom the top-ranked review.  
+- **Occasional inaccuracies:** A few replies introduce new bugs, give partial/duplicate fixes, or (rarely) violate rules (e.g., import suggestions), hurting trust.  
+- **Conservative bias:** Prefers silence over risk; while this keeps precision high, it lowers recall and overall usefulness on larger diffs.  
+- **Little added insight:** Rarely offers broader context, optimisations or holistic improvements, causing it to rank only mid-tier in many comparisons.

-Gemini-2.5-pro-preview-05-06 strengths:  
+### O4 Mini ('medium' thinking tokens)

- better_bug_coverage: Detects and explains more critical issues, winning in ~70 % of comparisons and achieving a higher average score.  
- actionable_fixes: Supplies clear code snippets, correct language labels, and often multiple coherent suggestions per diff.  
- deeper_reasoning: Shows stronger grasp of logic, edge cases, and cross-file implications, leading to broader, high-impact reviews.  
+Final score: **57.7**

-Gemini-2.5-pro-preview-05-06 weaknesses:  
+strengths:

- guideline_violations: More prone to over-eager advice—non-critical tweaks, touching unchanged code, suggesting new imports, or minor format errors.  
- occasional_overreach: Some fixes are speculative or risky, potentially introducing new bugs.  
- redundant_or_duplicate: At times repeats the same point or exceeds the required brevity.  
+- **Good rule adherence:** Most answers respect the “new-lines only”, 3-suggestion, and YAML-schema limits, and frequently choose the safe empty list when the diff truly adds no critical bug.
+- **Clear, minimal patches:** When the model does spot a defect it usually supplies terse, valid before/after snippets and short, targeted explanations, making fixes easy to read and apply.
+- **Language & domain breadth:** Demonstrates competence across many ecosystems (C/C++, Java, TS/JS, Go, Rust, Python, Bash, Markdown, YAML, SQL, CSS, translation files, etc.) and can detect both compile-time and runtime mistakes.
+- **Often competitive:** In a sizeable minority of cases the model ties for best or near-best answer, occasionally being the only response to catch a subtle crash or build blocker.
+
+weaknesses:
+
+- **High miss rate:** A large share of examples show the model returning an empty list or only minor advice while other reviewers catch clear, high-impact bugs—indicative of weak defect-detection recall.
+- **False or harmful fixes:** Several answers introduce new compilation errors, propose out-of-scope changes, or violate explicit rules (e.g., adding imports, version bumps, touching untouched lines), reducing trustworthiness.
+- **Shallow coverage:** Even when it identifies one real issue it often stops there, missing additional critical problems found by stronger peers; breadth and depth are inconsistent.
+
+### Gemini-2.5 Pro (4096 thinking tokens)
+
+Final score: **56.3**
+
+strengths:
+
+- **High formatting compliance:** The model almost always produces valid YAML, respects the three-suggestion limit, and supplies clear before/after code snippets and short rationales.
+- **Good “first-bug” detection:** It frequently notices the single most obvious regression (crash, compile error, nil/NPE risk, wrong path, etc.) and gives a minimal, correct patch—often judged “on-par” with other solid answers.
+- **Clear, concise writing:** Explanations are brief yet understandable for reviewers; fixes are scoped to the changed lines and rarely include extraneous context.
+- **Low rate of harmful fixes:** Truly dangerous or build-breaking advice is rare; most mistakes are omissions rather than wrong code.
+
+weaknesses:
+
+- **Limited breadth of review:** The model regularly stops after the first or second issue, missing additional critical problems that stronger answers surface, so it is often out-ranked by more comprehensive peers.
+- **Occasional guideline violations:** A noticeable minority of answers touch unchanged lines, exceed the 3-item cap, suggest adding imports, or drop the required YAML wrapper, leading to automatic downgrades.
+- **False positives / speculative fixes:** In several cases it flags non-issues (style, performance, redundant code) or supplies debatable “improvements”, lowering precision and sometimes breaching the “critical bugs only” rule.
+- **Inconsistent error coverage:** For certain domains (build scripts, schema files, test code) it either returns an empty list when real regressions exist or proposes cosmetic edits, indicating gaps in specialised knowledge.
+
+### Claude-4 Sonnet (4096 thinking tokens)
+
+Final score: **39.7**
+
+strengths:
+
+- **High guideline & format compliance:** Almost always returns valid YAML, keeps ≤ 3 suggestions, avoids forbidden import/boiler-plate changes and provides clear before/after snippets.
+- **Good pinpoint accuracy on single issues:** Frequently spots at least one real critical bug and proposes a concise, technically correct fix that compiles/runs.
+- **Clarity & brevity of patches:** Explanations are short, actionable, and focused on changed lines, making the advice easy for reviewers to apply.
+
+weaknesses:
+
+- **Low coverage / recall:** Regularly surfaces only one minor issue (or none) while missing other, often more severe, problems caught by peer models.
+- **High “empty-list” rate:** In many diffs the model returns no suggestions even when clear critical bugs exist, offering zero reviewer value.
+- **Occasional incorrect or harmful fixes:** A non-trivial number of suggestions are speculative, contradict code intent, or would break compilation/runtime; sometimes duplicates or contradicts itself.
+- **Inconsistent severity labelling & duplication:** Repeats the same point in multiple slots, marks cosmetic edits as “critical”, or leaves `improved_code` identical to original.


-### Comparison against Sonnet 3.7
+### Claude-4 Sonnet

-![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
+Final score: **39.0**

-#### Analysis Summary
+strengths:

-Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.
+- **Consistently well-formatted & rule-compliant output:** Almost every answer follows the required YAML schema, keeps within the 3-suggestion limit, and returns an empty list when no issues are found, showing good instruction following.

-See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)
+- **Actionable, code-level patches:** When it does spot a defect the model usually supplies clear, minimal diffs or replacement snippets that compile / run, making the fix easy to apply.
+
+- **Decent hit-rate on “obvious” bugs:** The model reliably catches the most blatant syntax errors, null-checks, enum / cast problems, and other first-order issues, so it often ties or slightly beats weaker baseline replies.
+
+weaknesses:
+
+- **Shallow coverage:** It frequently stops after one easy bug and overlooks additional, equally-critical problems that stronger reviewers find, leaving significant risks unaddressed.
+
+- **False positives & harmful fixes:** In a noticeable minority of cases it misdiagnoses code, suggests changes that break compilation or behaviour, or flags non-issues, sometimes making its output worse than doing nothing.
+
+- **Drifts into non-critical or out-of-scope advice:** The model regularly proposes style tweaks, documentation edits, or changes to unchanged lines, violating the “critical new-code only” requirement.


-#### Detailed Analysis
+### Gemini-2.5 Flash

-Gemini-2.5-pro-preview-05-06 strengths:  
+strengths:

- higher_accuracy_and_coverage: finds real critical bugs and supplies actionable patches in most examples (better in 78 % of cases).  
- guideline_awareness: usually respects new-lines-only scope, ≤3 suggestions, proper YAML, and stays silent when no issues exist.  
- detailed_reasoning_and_patches: explanations tie directly to the diff and fixes are concrete, often catching multiple related defects that 'Sonnet 3.7' overlooks.
+- **High precision / low false-positive rate:** The model often stays silent or gives a single, well-justified fix, so when it does speak the suggestion is usually correct and seldom touches unchanged lines, keeping guideline compliance high.  
+- **Good guideline awareness:** YAML structure is consistently valid; suggestions rarely exceed the 3-item limit and generally restrict themselves to newly-added lines.  
+- **Clear, concise patches:** When a defect is found, the model produces short rationales and tidy “improved_code” blocks that reviewers can apply directly.  
+- **Risk-averse behaviour pays off in “no-bug” PRs:** In examples where the diff truly contained no critical issue, the model’s empty output ranked above peers that offered speculative or stylistic advice.

-Gemini-2.5-pro-preview-05-06 weaknesses:  
+weaknesses:

- occasional_rule_violations: sometimes proposes new imports, package-version changes, or edits outside the added lines.  
- overzealous_suggestions: may add speculative or stylistic fixes that exceed the “critical” scope, or mis-label severity.  
- sporadic_technical_slips: a few patches contain minor coding errors, oversized snippets, or duplicate/contradicting advice.
+- **Very low recall / shallow coverage:** In a large majority of cases it gives 0-1 suggestions and misses other evident, critical bugs highlighted by peer models, leading to inferior rankings.  
+- **Occasional incorrect or harmful fixes:** A noticeable subset of answers propose changes that break functionality or misunderstand the code (e.g. bad constant, wrong header logic, speculative rollbacks).  
+- **Non-actionable placeholders:** Some “improved_code” sections contain comments or “…” rather than real patches, reducing practical value.  
+- 
+### GPT-4.1

-## GPT-4.1 - Model Card
+Final score: **26.5**

-### Comparison against Sonnet 3.7
+strengths:

-![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}
+- **Consistent format & guideline obedience:** Output is almost always valid YAML, within the 3-suggestion limit, and rarely touches lines not prefixed with “+”.  
+- **Low false-positive rate:** When no real defect exists, the model correctly returns an empty list instead of inventing speculative fixes, avoiding the “noise” many baseline answers add.  
+- **Clear, concise patches when it does act:** In the minority of cases where it detects a bug (e.g., ex-13, 46, 212), the fix is usually correct, minimal, and easy to apply.

-#### Analysis Summary
+weaknesses:

-Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.  
-Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice. 
-
-See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)
+- **Very low recall / coverage:** In a large majority of examples it outputs an empty list or only 1 trivial suggestion while obvious critical issues remain unfixed; it systematically misses circular bugs, null-checks, schema errors, etc.  
+- **Shallow analysis:** Even when it finds one problem it seldom looks deeper, so more severe or additional bugs in the same diff are left unaddressed.  
+- **Occasional technical inaccuracies:** A noticeable subset of suggestions are wrong (mis-ordered assertions, harmful Bash `set` change, false dangling-reference claims) or carry metadata errors (mis-labeling files as “python”).  
+- **Repetitive / derivative fixes:** Many outputs duplicate earlier simplistic ideas (e.g., single null-check) without new insight, showing limited reasoning breadth.


-#### Detailed Analysis
+## Appendix - models used for generating the benchmark baseline

-GPT-4.1 strengths:  
- Strong guideline adherence: usually stays strictly on `+` lines, avoids non-critical or stylistic advice, and rarely suggests forbidden imports; often outputs an empty list when no real bug exists.  
- Lower false-positive rate: suggestions are more accurate and seldom introduce new bugs; fixes compile more reliably.  
- Good schema discipline: YAML is almost always well-formed and fields are populated correctly.  
-
-GPT-4.1 weaknesses:  
- Misses bugs: often returns an empty list even when a clear critical issue is present, so coverage is narrower.  
- Sparse feedback: when it does comment, it tends to give fewer suggestions and sometimes lacks depth or completeness.  
- Occasional metadata/slip-ups (wrong language tags, overly broad code spans), though less harmful than Sonnet 3.7 errors.  
-
-### Comparison against Gemini-2.5-pro-preview-05-06
-
-![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
-
-#### Analysis Summary
-
-Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.
-
-#### Detailed Analysis
-
-GPT-4.1 strengths: 
- strict_compliance: Usually sticks to the “critical bugs only / new ‘+’ lines only” rule, so outputs rarely violate task constraints.  
- low_risk: Conservative behaviour avoids harmful or speculative fixes; safer when no obvious issue exists.  
- concise_formatting: Tends to produce minimal, correctly-structured YAML without extra noise.  
-
-GPT-4.1 weaknesses:
- under_detection: Frequently returns an empty list even when real bugs are present, missing ~70 % of the time.  
- shallow_analysis: When it does suggest fixes, coverage is narrow and technical depth is limited, sometimes with wrong language tags or minor format slips.  
- occasional_inaccuracy: A few suggestions are unfounded or duplicate, and rare guideline breaches (e.g., import advice) still occur.  
+- anthropic_sonnet_3.7_v1:0
+- claude-4-opus-20250514
+- claude-4-sonnet-20250514
+- claude-4-sonnet-20250514_thinking_2048
+- gemini-2.5-flash-preview-04-17
+- gemini-2.5-pro-preview-05-06
+- gemini-2.5-pro-preview-06-05_1024
+- gemini-2.5-pro-preview-06-05_4096
+- gpt-4.1
+- o3
+- o4-mini_medium


-## Sonnet 3.7 - Model Card
-
-### Comparison against GPT-4.1
-
-![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768}
-
-#### Analysis Summary
-
-Model 'GPT-4.1' is safer and more compliant, preferring silence over speculation, which yields fewer rule breaches and false positives but misses some real bugs.  
-Model 'Sonnet 3.7' is more adventurous and often uncovers important issues that 'GPT-4.1' ignores, yet its aggressive style leads to frequent guideline violations and a higher proportion of incorrect or non-critical advice. 
-
-See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.md)
-
-#### Detailed Analysis
-
-'Sonnet 3.7' strengths:
- Better bug discovery breadth: more willing to dive into logic and spot critical problems that 'GPT-4.1' overlooks; often supplies multiple, detailed fixes.  
- Richer explanations & patches: gives fuller context and, when correct, proposes more functional or user-friendly solutions.  
- Generally correct language/context tagging and targeted code snippets.  
-
-'Sonnet 3.7' weaknesses:
- Guideline violations: frequently flags non-critical issues, edits untouched code, or recommends adding imports, breaching task rules.  
- Higher error rate: suggestions are more speculative and sometimes introduce new defects or duplicate work already done.  
- Occasional schema or formatting mistakes (missing list value, duplicated suggestions), reducing reliability.  
-
-
-### Comparison against Gemini-2.5-pro-preview-05-06
-
-![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}
-
-#### Analysis Summary
-
-Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently identifies genuine, high-impact bugs and provides well-formed, actionable fixes. Model 'Sonnet 3.7' is safer against false positives and tends to be concise but often misses important defects or offers low-value or incorrect suggestions.
-
-See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)