diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 2624e1b2..37d2d022 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -21,15 +21,42 @@ Other factors like speed, cost, and availability, while also relevant for model Here's a summary of the win rates based on the benchmark: -| Model A | Model B | Model A Win Rate | Model B Win Rate | -|-------------------------------|-------------------------------|------------------|------------------| -| Gemini-2.5-pro-preview-05-06 | GPT-4.1 | 70.4% | 29.6% | -| Gemini-2.5-pro-preview-05-06 | Sonnet 3.7 | 78.1% | 21.9% | -| GPT-4.1 | Sonnet 3.7 | 61.0% | 39.0% | +[//]: # (| Model A | Model B | Model A Win Rate | Model B Win Rate |) + +[//]: # (|:-------------------------------|:-------------------------------|:----------------:|:----------------:|) + +[//]: # (| Gemini-2.5-pro-preview-05-06 | GPT-4.1 | 70.4% | 29.6% |) + +[//]: # (| Gemini-2.5-pro-preview-05-06 | Sonnet 3.7 | 78.1% | 21.9% |) + +[//]: # (| GPT-4.1 | Sonnet 3.7 | 61.0% | 39.0% |) + + + + + + + + + + + + + + + + + + + + + + +
Model AModel BModel A Win Rate Model B Win Rate
Gemini-2.5-pro-preview-05-06GPT-4.170.4% 29.6%
Gemini-2.5-pro-preview-05-06Sonnet 3.778.1% 21.9%
GPT-4.1Sonnet 3.761.0% 39.0%
## Gemini-2.5-pro-preview-05-06 - Model Card -### Gemini-2.5-pro-preview-05-06 vs GPT-4.1 +### Comparison against GPT-4.1 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768} @@ -52,7 +79,7 @@ Gemini-2.5-pro-preview-05-06 vs GPT-4.1 weaknesses: - redundant_or_duplicate: At times repeats the same point or exceeds the required brevity. -### Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7 +### Comparison against Sonnet 3.7 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768} @@ -79,7 +106,7 @@ Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7 weaknesses: ## GPT-4.1 - Model Card -### GPT-4.1 vs Sonnet 3.7 +### Comparison against Sonnet 3.7 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768} @@ -104,7 +131,7 @@ weaknesses: - Sparse feedback: when it does comment, it tends to give fewer suggestions and sometimes lacks depth or completeness. - Occasional metadata/slip-ups (wrong language tags, overly broad code spans), though less harmful than Sonnet 3.7 errors. -### GPT-4.1 vs Gemini-2.5-pro-preview-05-06 +### Comparison against Gemini-2.5-pro-preview-05-06 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768} @@ -127,7 +154,7 @@ GPT-4.1 weaknesses: ## Sonnet 3.7 - Model Card -### Sonnet 3.7 vs GPT-4.1 +### Comparison against GPT-4.1 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/gpt-4.1_vs_sonnet_3.7_judge_o3.png){width=768} @@ -152,7 +179,7 @@ Model 'Sonnet 3.7' vs 'GPT-4.1' - Occasional schema or formatting mistakes (missing list value, duplicated suggestions), reducing reliability. -### Sonnet 3.7 vs Gemini-2.5-pro-preview-05-06 +### Comparison against Gemini-2.5-pro-preview-05-06 ![Comparison](https://codium.ai/images/qodo_merge_benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06_judge_o3.png){width=768}