diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md
index 2624e1b2..37d2d022 100644
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@@ -21,15 +21,42 @@ Other factors like speed, cost, and availability, while also relevant for model
Here's a summary of the win rates based on the benchmark:
-| Model A | Model B | Model A Win Rate | Model B Win Rate |
-|-------------------------------|-------------------------------|------------------|------------------|
-| Gemini-2.5-pro-preview-05-06 | GPT-4.1 | 70.4% | 29.6% |
-| Gemini-2.5-pro-preview-05-06 | Sonnet 3.7 | 78.1% | 21.9% |
-| GPT-4.1 | Sonnet 3.7 | 61.0% | 39.0% |
+[//]: # (| Model A | Model B | Model A Win Rate | Model B Win Rate |)
+
+[//]: # (|:-------------------------------|:-------------------------------|:----------------:|:----------------:|)
+
+[//]: # (| Gemini-2.5-pro-preview-05-06 | GPT-4.1 | 70.4% | 29.6% |)
+
+[//]: # (| Gemini-2.5-pro-preview-05-06 | Sonnet 3.7 | 78.1% | 21.9% |)
+
+[//]: # (| GPT-4.1 | Sonnet 3.7 | 61.0% | 39.0% |)
+
+
+
+
+ Model A |
+ Model B |
+ Model A Win Rate | Model B Win Rate |
+
+
+
+ Gemini-2.5-pro-preview-05-06 |
+ GPT-4.1 |
+ 70.4% | 29.6% |
+
+ Gemini-2.5-pro-preview-05-06 |
+ Sonnet 3.7 |
+ 78.1% | 21.9% |
+
+ GPT-4.1 |
+ Sonnet 3.7 |
+ 61.0% | 39.0% |
+
+
## Gemini-2.5-pro-preview-05-06 - Model Card
-### Gemini-2.5-pro-preview-05-06 vs GPT-4.1
+### Comparison against GPT-4.1
{width=768}
@@ -52,7 +79,7 @@ Gemini-2.5-pro-preview-05-06 vs GPT-4.1 weaknesses:
- redundant_or_duplicate: At times repeats the same point or exceeds the required brevity.
-### Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7
+### Comparison against Sonnet 3.7
{width=768}
@@ -79,7 +106,7 @@ Gemini-2.5-pro-preview-05-06 vs Sonnet 3.7 weaknesses:
## GPT-4.1 - Model Card
-### GPT-4.1 vs Sonnet 3.7
+### Comparison against Sonnet 3.7
{width=768}
@@ -104,7 +131,7 @@ weaknesses:
- Sparse feedback: when it does comment, it tends to give fewer suggestions and sometimes lacks depth or completeness.
- Occasional metadata/slip-ups (wrong language tags, overly broad code spans), though less harmful than Sonnet 3.7 errors.
-### GPT-4.1 vs Gemini-2.5-pro-preview-05-06
+### Comparison against Gemini-2.5-pro-preview-05-06
{width=768}
@@ -127,7 +154,7 @@ GPT-4.1 weaknesses:
## Sonnet 3.7 - Model Card
-### Sonnet 3.7 vs GPT-4.1
+### Comparison against GPT-4.1
{width=768}
@@ -152,7 +179,7 @@ Model 'Sonnet 3.7' vs 'GPT-4.1'
- Occasional schema or formatting mistakes (missing list value, duplicated suggestions), reducing reliability.
-### Sonnet 3.7 vs Gemini-2.5-pro-preview-05-06
+### Comparison against Gemini-2.5-pro-preview-05-06
{width=768}