docs: add benchmark methodology and improve model comparison formatting

2025-07-21 04:50:39 +08:00 · 2025-05-13 08:39:19 +03:00
parent 489a16a3e6
commit 25530a8b2c
1 changed files with 22 additions and 8 deletions
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@ -2,9 +2,23 @@

 ## Methodology

-...
+Qodo Merge PR Benchmark evaluates and compares the performance of two Large Language Models (LLMs) in analyzing pull request code and providing meaningful code suggestions.
+Our diverse dataset comprises of 400 pull requests from over 100 repositories, spanning various programming languages and frameworks to reflect real-world scenarios.

-## Gemini-2.5-pro-preview-05-06
+- For each pull request, two distinct LLMs process the same prompt using the Qodo Merge `improve` tool, each generating two sets of responses. The prompt for response generation can be found [here](https://github.com/qodo-ai/pr-agent/blob/main/pr_agent/settings/code_suggestions/pr_code_suggestions_prompts_not_decoupled.toml).
+
+- Subsequently, a high-performing third model (an AI judge) evaluates the responses from the initial two models to determine the superior one. We utilize OpenAI's `o3` model as the judge, though other models have yielded consistent results. The prompt for this comparative judgment is available [here](https://github.com/Codium-ai/pr-agent-settings/tree/main/benchmark).
+
+- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
+This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.
+
+- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback.
+
+Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
+Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.
+
+
+## Gemini-2.5-pro-preview-05-06 - Model Card

 ### Gemini-2.5-pro-preview-05-06 vs GPT-4.1

@ -14,15 +28,15 @@

 Model 'Gemini-2.5-pro-preview-05-06' is generally more useful thanks to wider and more accurate bug detection and concrete patches, but it sacrifices compliance discipline and sometimes oversteps the task rules. Model 'GPT-4.1' is safer and highly rule-abiding, yet often too timid—missing many genuine issues and providing limited insight. An ideal reviewer would combine 'GPT-4.1’ restraint with 'Gemini-2.5-pro-preview-05-06' thoroughness.

-#### Gemini-2.5-pro-preview-05-06 vs GPT-4.1  - Detailed Analysis
+#### Detailed Analysis

-strengths:  
+Gemini-2.5-pro-preview-05-06 vs GPT-4.1 strengths:  

 - better_bug_coverage: Detects and explains more critical issues, winning in ~70 % of comparisons and achieving a higher average score.  
 - actionable_fixes: Supplies clear code snippets, correct language labels, and often multiple coherent suggestions per diff.  
 - deeper_reasoning: Shows stronger grasp of logic, edge cases, and cross-file implications, leading to broader, high-impact reviews.  

-weaknesses:  
+Gemini-2.5-pro-preview-05-06 vs GPT-4.1 weaknesses:  

 - guideline_violations: More prone to over-eager advice—non-critical tweaks, touching unchanged code, suggesting new imports, or minor format errors.  
 - occasional_overreach: Some fixes are speculative or risky, potentially introducing new bugs.  
@ -40,15 +54,15 @@ Model 'Gemini-2.5-pro-preview-05-06' is the stronger reviewer—more frequently
 See raw results [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)


-#### Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7  - Detailed Analysis
+#### Detailed Analysis

-strengths:  
+Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7 strengths:  

 - higher_accuracy_and_coverage: finds real critical bugs and supplies actionable patches in most examples (better in 78 % of cases).  
 - guideline_awareness: usually respects new-lines-only scope, ≤3 suggestions, proper YAML, and stays silent when no issues exist.  
 - detailed_reasoning_and_patches: explanations tie directly to the diff and fixes are concrete, often catching multiple related defects that 'Sonnet 3.7' overlooks.

-weaknesses:  
+Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7 weaknesses:  

 - occasional_rule_violations: sometimes proposes new imports, package-version changes, or edits outside the added lines.  
 - overzealous_suggestions: may add speculative or stylistic fixes that exceed the “critical” scope, or mis-label severity.