diff --git a/docs/docs/finetuning_benchmark/index.md b/docs/docs/finetuning_benchmark/index.md index b39b7d80..dc7cfa6b 100644 --- a/docs/docs/finetuning_benchmark/index.md +++ b/docs/docs/finetuning_benchmark/index.md @@ -74,6 +74,7 @@ Here are the prompts, and example outputs, used as input-output pairs to fine-tu
We experimented with three model as judges: `gpt-4-turbo-2024-04-09`, `gpt-4o`, and `claude-3-opus-20240229`. All three produced similar results, with the same ranking order. This strengthens the validity of our testing protocol. +The evaluation prompt can be found [here](https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/pr_evaluate_prompt_response.toml) Here is an example of a judge model feedback: diff --git a/pr_agent/settings/pr_evaluate_prompt_response.toml b/pr_agent/settings/pr_evaluate_prompt_response.toml new file mode 100644 index 00000000..2cc0c6e7 --- /dev/null +++ b/pr_agent/settings/pr_evaluate_prompt_response.toml @@ -0,0 +1,68 @@ +[pr_evaluate_prompt] +prompt="""\ +You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff. + + +The task to be evaluated is: + +***** Start of Task ***** +{{pr_task|trim}} + +***** End of Task ***** + + + +Response 1 to the task is: + +***** Start of Response 1 ***** + +{{pr_response1|trim}} + +***** End of Response 1 ***** + + + +Response 2 to the task is: + +***** Start of Response 2 ***** + +{{pr_response2|trim}} + +***** End of Response 2 ***** + + + +Guidelines to evaluate the responses: +- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related. +- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task. + +After that, rank each response. Criterions to rank each response: +- How well does the response follow the specific task instructions and requirements? +- How well does the response analyze and understand the PR code diff? +- How well will a person perceive it as a good response that correctly addresses the task? +- How well does the reponse prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important? +- Don't neccessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better. + + +The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions: +===== +class PRRankRespones(BaseModel): + which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.") + why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.") + score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.") + score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.") +===== + + +Example output: +```yaml +which_response_was_better: "X" +why: "Response X is better because it is more practical, and addresses the task requirements better since ..." +score_response1: ... +score_response2: ... +``` + + +Response (should be a valid YAML, and nothing else): +```yaml +"""