mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-07-05 21:30:40 +08:00
Add PR evaluation prompt and link to fine-tuning benchmark documentation
This commit is contained in:
@ -74,6 +74,7 @@ Here are the prompts, and example outputs, used as input-output pairs to fine-tu
|
|||||||
<br>
|
<br>
|
||||||
|
|
||||||
We experimented with three model as judges: `gpt-4-turbo-2024-04-09`, `gpt-4o`, and `claude-3-opus-20240229`. All three produced similar results, with the same ranking order. This strengthens the validity of our testing protocol.
|
We experimented with three model as judges: `gpt-4-turbo-2024-04-09`, `gpt-4o`, and `claude-3-opus-20240229`. All three produced similar results, with the same ranking order. This strengthens the validity of our testing protocol.
|
||||||
|
The evaluation prompt can be found [here](https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/pr_evaluate_prompt_response.toml)
|
||||||
|
|
||||||
Here is an example of a judge model feedback:
|
Here is an example of a judge model feedback:
|
||||||
|
|
||||||
|
68
pr_agent/settings/pr_evaluate_prompt_response.toml
Normal file
68
pr_agent/settings/pr_evaluate_prompt_response.toml
Normal file
@ -0,0 +1,68 @@
|
|||||||
|
[pr_evaluate_prompt]
|
||||||
|
prompt="""\
|
||||||
|
You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff.
|
||||||
|
|
||||||
|
|
||||||
|
The task to be evaluated is:
|
||||||
|
|
||||||
|
***** Start of Task *****
|
||||||
|
{{pr_task|trim}}
|
||||||
|
|
||||||
|
***** End of Task *****
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Response 1 to the task is:
|
||||||
|
|
||||||
|
***** Start of Response 1 *****
|
||||||
|
|
||||||
|
{{pr_response1|trim}}
|
||||||
|
|
||||||
|
***** End of Response 1 *****
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Response 2 to the task is:
|
||||||
|
|
||||||
|
***** Start of Response 2 *****
|
||||||
|
|
||||||
|
{{pr_response2|trim}}
|
||||||
|
|
||||||
|
***** End of Response 2 *****
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Guidelines to evaluate the responses:
|
||||||
|
- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related.
|
||||||
|
- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task.
|
||||||
|
|
||||||
|
After that, rank each response. Criterions to rank each response:
|
||||||
|
- How well does the response follow the specific task instructions and requirements?
|
||||||
|
- How well does the response analyze and understand the PR code diff?
|
||||||
|
- How well will a person perceive it as a good response that correctly addresses the task?
|
||||||
|
- How well does the reponse prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important?
|
||||||
|
- Don't neccessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better.
|
||||||
|
|
||||||
|
|
||||||
|
The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions:
|
||||||
|
=====
|
||||||
|
class PRRankRespones(BaseModel):
|
||||||
|
which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.")
|
||||||
|
why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.")
|
||||||
|
score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.")
|
||||||
|
score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.")
|
||||||
|
=====
|
||||||
|
|
||||||
|
|
||||||
|
Example output:
|
||||||
|
```yaml
|
||||||
|
which_response_was_better: "X"
|
||||||
|
why: "Response X is better because it is more practical, and addresses the task requirements better since ..."
|
||||||
|
score_response1: ...
|
||||||
|
score_response2: ...
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Response (should be a valid YAML, and nothing else):
|
||||||
|
```yaml
|
||||||
|
"""
|
Reference in New Issue
Block a user