mirror of
https://github.com/qodo-ai/pr-agent.git
synced 2025-07-05 05:10:38 +08:00
Merge pull request #13 from Codium-ai/docs/pr_compression_doc
Docs/pr compression doc
This commit is contained in:
42
PR_COMPRESSION.md
Normal file
42
PR_COMPRESSION.md
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
# Git Patch Logic
|
||||||
|
There are two scenarios:
|
||||||
|
1. The PR is small enough to fit in a single prompt (including system and user prompt)
|
||||||
|
2. The PR is too large to fit in a single prompt (including system and user prompt)
|
||||||
|
|
||||||
|
For both scenarios, we first use the following strategy
|
||||||
|
#### Repo language prioritization strategy
|
||||||
|
|
||||||
|
We prioritize the languages of the repo based on the following criteria:
|
||||||
|
1. Exclude binary files and non code files (e.g. images, pdfs, etc)
|
||||||
|
2. Given the main languages used in the repo
|
||||||
|
2. We sort the PR files by the most common languages in the repo (in descending order):
|
||||||
|
* ```[[file.py, file2.py],[file3.js, file4.jsx],[readme.md]]```
|
||||||
|
|
||||||
|
|
||||||
|
## Small PR
|
||||||
|
In this case, we can fit the entire PR in a single prompt:
|
||||||
|
1. Exclude binary files and non code files (e.g. images, pdfs, etc)
|
||||||
|
2. We Expand the surrounding context of each patch to 6 lines above and below the patch
|
||||||
|
## Large PR
|
||||||
|
|
||||||
|
### Motivation
|
||||||
|
Pull Requests can be very long and contain a lot of information with varying degree of relevance to the pr-agent.
|
||||||
|
We want to be able to pack as much information as possible in a single LMM prompt, while keeping the information relevant to the pr-agent.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#### PR compression strategy
|
||||||
|
We prioritize additions over deletions:
|
||||||
|
- Combine all deleted files into a single list (`deleted files`)
|
||||||
|
- File patches are a list of hunks, remove all hunks of type deletion-only from the hunks in the file patch
|
||||||
|
#### Adaptive and token-aware file patch fitting
|
||||||
|
We use [tiktoken](https://github.com/openai/tiktoken) to tokenize the patches after the modifications described above, and we use the following strategy to fit the patches into the prompt:
|
||||||
|
1. Withing each language we sort the files by the number of tokens in the file (in descending order):
|
||||||
|
* ```[[file2.py, file.py],[file4.jsx, file3.js],[readme.md]]```
|
||||||
|
2. Iterate through the patches in the order described above
|
||||||
|
2. Add the patches to the prompt until the prompt reaches a certain buffer from the max token length
|
||||||
|
3. If there are still patches left, add the remaining patches as a list called `other modified files` to the prompt until the prompt reaches the max token length (hard stop), skip the rest of the patches.
|
||||||
|
4. If we haven't reached the max token length, add the `deleted files` to the prompt until the prompt reaches the max token length (hard stop), skip the rest of the patches.
|
||||||
|
|
||||||
|
### Example
|
||||||
|

|
@ -14,6 +14,7 @@ CodiumAI `pr-agent` is an open-source tool is powered by GPT-4 aming to help dev
|
|||||||
- [Quickstart](#Quickstart)
|
- [Quickstart](#Quickstart)
|
||||||
- [Usage and Tools](#usage-and-tools)
|
- [Usage and Tools](#usage-and-tools)
|
||||||
- [Configuration](#Configuration)
|
- [Configuration](#Configuration)
|
||||||
|
- [How it works](#how-it-works)
|
||||||
- [Roadmap](#roadmap)
|
- [Roadmap](#roadmap)
|
||||||
- [Similar projects](#similar-projects)
|
- [Similar projects](#similar-projects)
|
||||||
|
|
||||||
@ -282,6 +283,12 @@ Example for extended suggestion:
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## How it works
|
||||||
|

|
||||||
|
|
||||||
|
Check out the [PR Compression strategy](./PR_COMPRESSION.md) page for more details on how pr-agent works.
|
||||||
|
|
||||||
|
|
||||||
## Roadmap
|
## Roadmap
|
||||||
|
|
||||||
- [ ] Support open-source models, as a replacement for openai models. Note that a minimal requirement for each open-source model is to have 8k+ context, and good support for generating json as an output
|
- [ ] Support open-source models, as a replacement for openai models. Note that a minimal requirement for each open-source model is to have 8k+ context, and good support for generating json as an output
|
||||||
|
BIN
pics/git_patch_logic.png
Normal file
BIN
pics/git_patch_logic.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 346 KiB |
BIN
pics/pr_agent_overview.png
Normal file
BIN
pics/pr_agent_overview.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 406 KiB |
Reference in New Issue
Block a user