From 3ddd53d4fee33cf1f506593d3b1ba775cb4db4d3 Mon Sep 17 00:00:00 2001 From: mrT23 Date: Tue, 13 May 2025 11:48:01 +0300 Subject: [PATCH] docs: add link to example model card in benchmark documentation --- docs/docs/pr_benchmark/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/docs/pr_benchmark/index.md b/docs/docs/pr_benchmark/index.md index 7c2c7096..53851e36 100644 --- a/docs/docs/pr_benchmark/index.md +++ b/docs/docs/pr_benchmark/index.md @@ -12,7 +12,7 @@ Our diverse dataset comprises of 400 pull requests from over 100 repositories, s - We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses. This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses. -- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. +- For each model we build a "Model Card", comparing it against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. See example for the full output [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md) Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions. Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.