mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-07-03 20:30:41 +08:00
docs: add link to example model card in benchmark documentation
This commit is contained in:
@ -12,7 +12,7 @@ Our diverse dataset comprises of 400 pull requests from over 100 repositories, s
|
|||||||
- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
|
- We aggregate comparison outcomes across all the pull requests, calculating the win rate for each model. We also analyze the qualitative feedback (the "why" explanations from the judge) to identify each model's comparative strengths and weaknesses.
|
||||||
This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.
|
This approach provides not just a quantitative score but also a detailed analysis of each model's strengths and weaknesses.
|
||||||
|
|
||||||
- The final output is a "Model Card", comparing the evaluated model against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback.
|
- For each model we build a "Model Card", comparing it against others. To ensure full transparency and enable community scrutiny, we also share the raw code suggestions generated by each model, and the judge's specific feedback. See example for the full output [here](https://github.com/Codium-ai/pr-agent-settings/blob/main/benchmark/sonnet_37_vs_gemini-2.5-pro-preview-05-06.md)
|
||||||
|
|
||||||
Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
|
Note that this benchmark focuses on quality: the ability of an LLM to process complex pull request with multiple files and nuanced task to produce high-quality code suggestions.
|
||||||
Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.
|
Other factors like speed, cost, and availability, while also relevant for model selection, are outside this benchmark's scope.
|
||||||
|
Reference in New Issue
Block a user