docs: add Claude-4 Opus evaluation section with strengths and weaknesses

2025-07-21 04:50:39 +08:00 · 2025-07-06 21:47:22 +03:00
parent ef2e69dbf3
commit 17a90c536f
1 changed files with 22 additions and 0 deletions
--- a/docs/docs/pr_benchmark/index.md
+++ b/docs/docs/pr_benchmark/index.md
@ -82,6 +82,12 @@ A list of the models used for generating the baseline suggestions, and example r
      <td style="text-align:left;"></td>
      <td style="text-align:center;"><b>33.5</b></td>
    </tr>
+    <tr>
+      <td style="text-align:left;">Claude-4-opus-20250514</td>
+      <td style="text-align:left;">2025-05-14</td>
+      <td style="text-align:left;"></td>
+      <td style="text-align:center;"><b>32.8</b></td>
+    </tr>
    <tr>
      <td style="text-align:left;">Claude-3.7-sonnet</td>
      <td style="text-align:left;">2025-02-19</td>
@ -240,6 +246,22 @@ weaknesses:
 - **Introduces new problems:** Several suggestions add unsupported APIs, undeclared variables, wrong types, or break compilation, hurting trust in the recommendations.
 - **Rule violations:** It often edits lines outside the diff, exceeds the 3-suggestion cap, or labels cosmetic tweaks as “critical”, showing inconsistent guideline compliance.

+### Claude-4 Opus
+
+final score: **32.8**
+
+strengths:
+
+- **Format & rule adherence:** Almost always returns valid YAML, stays within the ≤3-suggestion limit, and usually restricts edits to newly-added lines, so its output is easy to apply automatically.
+- **Concise, focused patches:** When it does find a real bug it gives short, well-scoped explanations plus minimal diff snippets, often outperforming verbose baselines in clarity.
+- **Able to catch subtle edge-cases:** In several examples it detected overflow, race-condition or enum-mismatch issues that many other models missed, showing solid code‐analysis capability.
+
+weaknesses:
+
+- **Low recall / narrow coverage:** In a large share of the 399 examples the model produced an empty list or only one minor tip while more serious defects were present, causing it to be rated inferior to most baselines.
+- **Frequent incorrect or no-op fixes:** It sometimes supplies identical “before/after” code, flags non-issues, or suggests changes that would break compilation or logic, reducing reviewer trust.
+- **Shaky guideline consistency:** Although generally compliant, it still occasionally violates rules (touches unchanged lines, offers stylistic advice, adds imports) and duplicates suggestions, indicating unstable internal checks.
+
 ## Appendix - Example Results

 Some examples of benchmarked PRs and their results: