mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-07-07 22:30:38 +08:00
docs: add Claude-4 Opus evaluation section with strengths and weaknesses
This commit is contained in:
@ -82,6 +82,12 @@ A list of the models used for generating the baseline suggestions, and example r
|
||||
<td style="text-align:left;"></td>
|
||||
<td style="text-align:center;"><b>33.5</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left;">Claude-4-opus-20250514</td>
|
||||
<td style="text-align:left;">2025-05-14</td>
|
||||
<td style="text-align:left;"></td>
|
||||
<td style="text-align:center;"><b>32.8</b></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td style="text-align:left;">Claude-3.7-sonnet</td>
|
||||
<td style="text-align:left;">2025-02-19</td>
|
||||
@ -240,6 +246,22 @@ weaknesses:
|
||||
- **Introduces new problems:** Several suggestions add unsupported APIs, undeclared variables, wrong types, or break compilation, hurting trust in the recommendations.
|
||||
- **Rule violations:** It often edits lines outside the diff, exceeds the 3-suggestion cap, or labels cosmetic tweaks as “critical”, showing inconsistent guideline compliance.
|
||||
|
||||
### Claude-4 Opus
|
||||
|
||||
final score: **32.8**
|
||||
|
||||
strengths:
|
||||
|
||||
- **Format & rule adherence:** Almost always returns valid YAML, stays within the ≤3-suggestion limit, and usually restricts edits to newly-added lines, so its output is easy to apply automatically.
|
||||
- **Concise, focused patches:** When it does find a real bug it gives short, well-scoped explanations plus minimal diff snippets, often outperforming verbose baselines in clarity.
|
||||
- **Able to catch subtle edge-cases:** In several examples it detected overflow, race-condition or enum-mismatch issues that many other models missed, showing solid code‐analysis capability.
|
||||
|
||||
weaknesses:
|
||||
|
||||
- **Low recall / narrow coverage:** In a large share of the 399 examples the model produced an empty list or only one minor tip while more serious defects were present, causing it to be rated inferior to most baselines.
|
||||
- **Frequent incorrect or no-op fixes:** It sometimes supplies identical “before/after” code, flags non-issues, or suggests changes that would break compilation or logic, reducing reviewer trust.
|
||||
- **Shaky guideline consistency:** Although generally compliant, it still occasionally violates rules (touches unchanged lines, offers stylistic advice, adds imports) and duplicates suggestions, indicating unstable internal checks.
|
||||
|
||||
## Appendix - Example Results
|
||||
|
||||
Some examples of benchmarked PRs and their results:
|
||||
|
Reference in New Issue
Block a user