docs: add link to example model card in benchmark documentation

This commit is contained in:
mrT23
2025-05-13 11:48:01 +03:00
parent eb449202c9
commit 3ddd53d4fe

View File

@ -12,7 +12,7 @@ Our diverse dataset comprises of 400 pull requests from over 100 repositories, s
- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses. - We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses. This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.
- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. - For each model we build a "Model Card", comparing it against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. See example for the full output [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)
Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions. Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope. Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.