diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 7c2c7096..53851e36 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -12,7 +12,7 @@ Our diverse dataset comprises of 400 pull requests from over 100 repositories, s - We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses. This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses. -- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. +- For each model we build a "Model Card", comparing it against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. See example for the full output [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md) Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions. Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.