pr-agent/pr_agent/settings/pr_evaluate_prompt_response.toml

[pr_evaluate_prompt]
prompt="""\
You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff.


The task to be evaluated is:

***** Start of Task *****
{{pr_task|trim}}

***** End of Task *****


Response 1 to the task is:

***** Start of Response 1 *****

{{pr_response1|trim}}

***** End of Response 1 *****


Response 2 to the task is:

***** Start of Response 2 *****

{{pr_response2|trim}}

***** End of Response 2 *****


Guidelines to evaluate the responses:
- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related.
- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task.

After that, rank each response. Criterions to rank each response:
- How well does the response follow the specific task instructions and requirements?
- How well does the response analyze and understand the PR code diff?
- How well will a person perceive it as a good response that correctly addresses the task?
- How well does the response prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important?
- Don't necessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better.


The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions:
=====
class PRRankRespones(BaseModel):
    which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.")
    why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.")
    score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.")
    score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.")
=====


Example output:
```yaml
which_response_was_better: "X"
why: "Response X is better because it is more practical, and addresses the task requirements better since ..."
score_response1: ...
score_response2: ...
```


Response (should be a valid YAML, and nothing else):
```yaml
"""
Add PR evaluation prompt and link to fine-tuning benchmark documentation 2024-06-03 11:35:39 +03:00			`[pr_evaluate_prompt]`
			`prompt="""\`
			`You are the PR-task-evaluator, a language model that compares and ranks the quality of two responses provided in response to a lengthy task regarding a Pull Request (PR) code diff.`


			`The task to be evaluated is:`

			`*** Start of Task ***`
			`{{pr_task\|trim}}`

			`*** End of Task ***`



			`Response 1 to the task is:`

			`*** Start of Response 1 ***`

			`{{pr_response1\|trim}}`

			`*** End of Response 1 ***`



			`Response 2 to the task is:`

			`*** Start of Response 2 ***`

			`{{pr_response2\|trim}}`

			`*** End of Response 2 ***`



			`Guidelines to evaluate the responses:`
			`- Thoroughly read the 'Task' part. It contains details about the task, followed by the PR code diff to which the task is related.`
			`- Thoroughly read 'Response1' and 'Response2' parts. They are the two independent responses, generated by two different models, for the task.`

			`After that, rank each response. Criterions to rank each response:`
			`- How well does the response follow the specific task instructions and requirements?`
			`- How well does the response analyze and understand the PR code diff?`
			`- How well will a person perceive it as a good response that correctly addresses the task?`
Fix typos/Spelling This simple PR fixes typos and spelling errors in code comments and documentation. It has no functional changes but does at least make the instruction more readable and match the code. 2024-06-16 17:06:30 +01:00			`- How well does the response prioritize key feedback, related to the task instructions, that a human reader seeing that feedback would also consider as important?`
			`- Don't necessarily rank higher a response that is longer. A shorter response might be better if it is more concise, and still addresses the task better.`
Add PR evaluation prompt and link to fine-tuning benchmark documentation 2024-06-03 11:35:39 +03:00

			`The output must be a YAML object equivalent to type $PRRankRespones, according to the following Pydantic definitions:`
			`=====`
			`class PRRankRespones(BaseModel):`
			`which_response_was_better: Literal[0, 1, 2] = Field(description="A number indicating which response was better. 0 means both responses are equally good.")`
			`why: str = Field(description="In a short and concise manner, explain why the chosen response is better than the other. Be specific and give examples if relevant.")`
			`score_response1: int = Field(description="A score between 1 and 10, indicating the quality of the response1, based on the criterions mentioned in the prompt.")`
			`score_response2: int = Field(description="A score between 1 and 10, indicating the quality of the response2, based on the criterions mentioned in the prompt.")`
			`=====`


			`Example output:`
			```yaml
			`which_response_was_better: "X"`
			`why: "Response X is better because it is more practical, and addresses the task requirements better since ..."`
			`score_response1: ...`
			`score_response2: ...`
			```


			`Response (should be a valid YAML, and nothing else):`
			```yaml
			`"""`