Merge pull request #357 from jamesrom/feat/file_ignores

Add support for ignoring files
This commit is contained in:
mrT23
2023-10-08 16:30:02 +03:00
committed by GitHub
7 changed files with 166 additions and 28 deletions

View File

@ -40,7 +40,7 @@ For other git providers, update CONFIG.GIT_PROVIDER accordingly, and check the `
```
docker run --rm -it -e OPENAI.KEY=<your key> -e GITHUB.USER_TOKEN=<your token> codiumai/pr-agent --pr_url <pr_url> ask "<your question>"
```
Note: If you want to ensure you're running a specific version of the Docker image, consider using the image's digest.
Note: If you want to ensure you're running a specific version of the Docker image, consider using the image's digest.
The digest is a unique identifier for a specific version of an image. You can pull and run an image using its digest by referencing it like so: repository@sha256:digest. Always ensure you're using the correct and trusted digest for your operations.
1. To request a review for a PR using a specific digest, run the following command:
@ -89,17 +89,17 @@ chmod 600 pr_agent/settings/.secrets.toml
```
export PYTHONPATH=[$PYTHONPATH:]<PATH to pr_agent folder>
python3 -m pr_agent.cli --pr_url <pr_url> /review
python3 -m pr_agent.cli --pr_url <pr_url> /ask <your question>
python3 -m pr_agent.cli --pr_url <pr_url> /describe
python3 -m pr_agent.cli --pr_url <pr_url> /improve
python3 -m pr_agent.cli --pr_url <pr_url> review
python3 -m pr_agent.cli --pr_url <pr_url> ask <your question>
python3 -m pr_agent.cli --pr_url <pr_url> describe
python3 -m pr_agent.cli --pr_url <pr_url> improve
```
---
### Method 3: Run as a GitHub Action
You can use our pre-built Github Action Docker image to run PR-Agent as a Github Action.
You can use our pre-built Github Action Docker image to run PR-Agent as a Github Action.
1. Add the following file to your repository under `.github/workflows/pr_agent.yml`:
@ -153,7 +153,7 @@ OPENAI_KEY: <your key>
The GITHUB_TOKEN secret is automatically created by GitHub.
3. Merge this change to your main branch.
3. Merge this change to your main branch.
When you open your next PR, you should see a comment from `github-actions` bot with a review of your PR, and instructions on how to use the rest of the tools.
4. You may configure PR-Agent by adding environment variables under the env section corresponding to any configurable property in the [configuration](pr_agent/settings/configuration.toml) file. Some examples:
@ -221,12 +221,12 @@ git clone https://github.com/Codium-ai/pr-agent.git
- Copy your app's webhook secret to the webhook_secret field.
- Set deployment_type to 'app' in [configuration.toml](./pr_agent/settings/configuration.toml)
> The .secrets.toml file is not copied to the Docker image by default, and is only used for local development.
> The .secrets.toml file is not copied to the Docker image by default, and is only used for local development.
> If you want to use the .secrets.toml file in your Docker image, you can add remove it from the .dockerignore file.
> In most production environments, you would inject the secrets file as environment variables or as mounted volumes.
> In most production environments, you would inject the secrets file as environment variables or as mounted volumes.
> For example, in order to inject a secrets file as a volume in a Kubernetes environment you can update your pod spec to include the following,
> assuming you have a secret named `pr-agent-settings` with a key named `.secrets.toml`:
```
```
volumes:
- name: settings-volume
secret:
@ -322,7 +322,7 @@ Example IAM permissions to that user to allow access to CodeCommit:
"codecommit:PostComment*",
"codecommit:PutCommentReaction",
"codecommit:UpdatePullRequestDescription",
"codecommit:UpdatePullRequestTitle"
"codecommit:UpdatePullRequestTitle"
],
"Resource": "*"
}
@ -366,8 +366,8 @@ WEBHOOK_SECRET=$(python -c "import secrets; print(secrets.token_hex(10))")
- Your OpenAI key.
- In the [gitlab] section, fill in personal_access_token and shared_secret. The access token can be a personal access token, or a group or project access token.
- Set deployment_type to 'gitlab' in [configuration.toml](./pr_agent/settings/configuration.toml)
5. Create a webhook in GitLab. Set the URL to the URL of your app's server. Set the secret token to the generated secret from step 2.
In the "Trigger" section, check the comments and merge request events boxes.
5. Create a webhook in GitLab. Set the URL to the URL of your app's server. Set the secret token to the generated secret from step 2.
In the "Trigger" section, check the comments and merge request events boxes.
6. Test your installation by opening a merge request or commenting or a merge request using one of CodiumAI's commands.

View File

@ -29,6 +29,16 @@ In addition to general configuration options, each tool has its own configuratio
The [Tools Guide](./docs/TOOLS_GUIDE.md) provides a detailed description of the different tools and their configurations.
#### Ignoring files from analysis
In some cases, you may want to exclude specific files or directories from the analysis performed by CodiumAI PR-Agent. This can be useful, for example, when you have files that are generated automatically or files that shouldn't be reviewed, like vendored code.
To ignore files or directories, edit the **[ignore.toml](/pr_agent/settings/ignore.toml)** configuration file. This setting is also exposed the following environment variables:
- `IGNORE.GLOB`
- `IGNORE.REGEX`
See [dynaconf envvars documentation](https://www.dynaconf.com/envvars/).
#### git provider
The [git_provider](pr_agent/settings/configuration.toml#L4) field in the configuration file determines the GIT provider that will be used by PR-Agent. Currently, the following providers are supported:
`
@ -101,7 +111,7 @@ Any configuration value in [configuration file](pr_agent/settings/configuration.
When running PR-Agent from [GitHub App](INSTALL.md#method-5-run-as-a-github-app), the default configurations from a pre-built docker will be initially loaded.
#### GitHub app automatic tools
The [github_app](pr_agent/settings/configuration.toml#L56) section defines GitHub app specific configurations.
The [github_app](pr_agent/settings/configuration.toml#L56) section defines GitHub app specific configurations.
An important parameter is `pr_commands`, which is a list of tools that will be **run automatically** when a new PR is opened:
```
[github_app]
@ -133,7 +143,7 @@ Note that a local `.pr_agent.toml` file enables you to edit and customize the de
#### Editing the prompts
The prompts for the various PR-Agent tools are defined in the `pr_agent/settings` folder.
In practice, the prompts are loaded and stored as a standard setting object.
In practice, the prompts are loaded and stored as a standard setting object.
Hence, editing them is similar to editing any other configuration value - just place the relevant key in `.pr_agent.toml`file, and override the default value.
For example, if you want to edit the prompts of the [describe](./pr_agent/settings/pr_description_prompts.toml) tool, you can add the following to your `.pr_agent.toml` file:
@ -158,7 +168,7 @@ You can configure settings in GitHub action by adding environment variables unde
PR_CODE_SUGGESTIONS.NUM_CODE_SUGGESTIONS: 6 # Increase number of code suggestions
github_action.auto_review: "true" # Enable auto review
github_action.auto_describe: "true" # Enable auto describe
github_action.auto_improve: "false" # Disable auto improve
github_action.auto_improve: "false" # Disable auto improve
```
specifically, `github_action.auto_review`, `github_action.auto_describe` and `github_action.auto_improve` are used to enable/disable automatic tools that run when a new PR is opened.
@ -171,7 +181,7 @@ To use a different model than the default (GPT-4), you need to edit [configurati
For models and environments not from OPENAI, you might need to provide additional keys and other parameters. See below for instructions.
#### Azure
To use Azure, set in your .secrets.toml:
To use Azure, set in your .secrets.toml:
```
api_key = "" # your azure api key
api_type = "azure"
@ -180,16 +190,16 @@ api_base = "" # The base URL for your Azure OpenAI resource. e.g. "https://<you
deployment_id = "" # The deployment name you chose when you deployed the engine
```
and
and
```
[config]
model="" # the OpenAI model you've deployed on Azure (e.g. gpt-3.5-turbo)
```
in the configuration.toml
in the configuration.toml
#### Huggingface
**Local**
**Local**
You can run Huggingface models locally through either [VLLM](https://docs.litellm.ai/docs/providers/vllm) or [Ollama](https://docs.litellm.ai/docs/providers/ollama)
E.g. to use a new Huggingface model locally via Ollama, set:
@ -209,7 +219,7 @@ MAX_TOKENS={
model = "ollama/llama2"
[ollama] # in .secrets.toml
api_base = ... # the base url for your huggingface inference endpoint
api_base = ... # the base url for your huggingface inference endpoint
```
**Inference Endpoints**
@ -230,7 +240,7 @@ model = "huggingface/meta-llama/Llama-2-7b-chat-hf"
[huggingface] # in .secrets.toml
key = ... # your huggingface api key
api_base = ... # the base url for your huggingface inference endpoint
api_base = ... # the base url for your huggingface inference endpoint
```
(you can obtain a Llama2 key from [here](https://replicate.com/replicate/llama-2-70b-chat/api))
@ -251,12 +261,12 @@ Also review the [AiHandler](pr_agent/algo/ai_handler.py) file for instruction ho
### Working with large PRs
The default mode of CodiumAI is to have a single call per tool, using GPT-4, which has a token limit of 8000 tokens.
This mode provide a very good speed-quality-cost tradeoff, and can handle most PRs successfully.
This mode provide a very good speed-quality-cost tradeoff, and can handle most PRs successfully.
When the PR is above the token limit, it employs a [PR Compression strategy](./PR_COMPRESSION.md).
However, for very large PRs, or in case you want to emphasize quality over speed and cost, there are 2 possible solutions:
1) [Use a model](#changing-a-model) with larger context, like GPT-32K, or claude-100K. This solution will be applicable for all the tools.
2) For the `/improve` tool, there is an ['extended' mode](./docs/IMPROVE.md) (`/improve --extended`),
2) For the `/improve` tool, there is an ['extended' mode](./docs/IMPROVE.md) (`/improve --extended`),
which divides the PR to chunks, and process each chunk separately. With this mode, regardless of the model, no compression will be done (but for large PRs, multiple model calls may occur)
### Appendix - additional configurations walkthrough
@ -305,4 +315,4 @@ And use the following settings (you have to replace the values) in .secrets.toml
[azure_devops]
org = "https://dev.azure.com/YOUR_ORGANIZATION/"
pat = "YOUR_PAT_TOKEN"
```
```

View File

@ -0,0 +1,31 @@
import fnmatch
import re
from pr_agent.config_loader import get_settings
def filter_ignored(files):
"""
Filter out files that match the ignore patterns.
"""
try:
# load regex patterns, and translate glob patterns to regex
patterns = get_settings().ignore.regex
patterns += [fnmatch.translate(glob) for glob in get_settings().ignore.glob]
# compile all valid patterns
compiled_patterns = []
for r in patterns:
try:
compiled_patterns.append(re.compile(r))
except re.error:
pass
# keep filenames that _don't_ match the ignore regex
for r in compiled_patterns:
files = [f for f in files if not r.match(f.filename)]
except Exception as e:
print(f"Could not filter file list: {e}")
return files

View File

@ -11,6 +11,7 @@ from github import RateLimitExceededException
from pr_agent.algo import MAX_TOKENS
from pr_agent.algo.git_patch_processing import convert_to_hunks_with_lines_numbers, extend_patch, handle_patch_deletions
from pr_agent.algo.language_handler import sort_files_by_main_languages
from pr_agent.algo.file_filter import filter_ignored
from pr_agent.algo.token_handler import TokenHandler, get_token_encoder
from pr_agent.config_loader import get_settings
from pr_agent.git_providers.git_provider import FilePatchInfo, GitProvider
@ -53,6 +54,8 @@ def get_pr_diff(git_provider: GitProvider, token_handler: TokenHandler, model: s
logging.error(f"Rate limit exceeded for git provider API. original message {e}")
raise
diff_files = filter_ignored(diff_files)
# get pr languages
pr_languages = sort_files_by_main_languages(git_provider.get_languages(), diff_files)
@ -348,16 +351,16 @@ def get_pr_multi_diffs(git_provider: GitProvider,
"""
Retrieves the diff files from a Git provider, sorts them by main language, and generates patches for each file.
The patches are split into multiple groups based on the maximum number of tokens allowed for the given model.
Args:
git_provider (GitProvider): An object that provides access to Git provider APIs.
token_handler (TokenHandler): An object that handles tokens in the context of a pull request.
model (str): The name of the model.
max_calls (int, optional): The maximum number of calls to retrieve diff files. Defaults to 5.
Returns:
List[str]: A list of final diff strings, split into multiple groups based on the maximum number of tokens allowed for the given model.
Raises:
RateLimitExceededException: If the rate limit for the Git provider API is exceeded.
"""
@ -367,6 +370,8 @@ def get_pr_multi_diffs(git_provider: GitProvider,
logging.error(f"Rate limit exceeded for git provider API. original message {e}")
raise
diff_files = filter_ignored(diff_files)
# Sort files by main language
pr_languages = sort_files_by_main_languages(git_provider.get_languages(), diff_files)

View File

@ -14,6 +14,7 @@ global_settings = Dynaconf(
settings_files=[join(current_dir, f) for f in [
"settings/.secrets.toml",
"settings/configuration.toml",
"settings/ignore.toml",
"settings/language_extensions.toml",
"settings/pr_reviewer_prompts.toml",
"settings/pr_questions_prompts.toml",

View File

@ -0,0 +1,11 @@
[ignore]
glob = [
# Ignore files and directories matching these glob patterns.
# See https://docs.python.org/3/library/glob.html
'vendor/**',
]
regex = [
# Ignore files and directories matching these regex patterns.
# See https://learnbyexample.github.io/python-regex-cheatsheet/
]

View File

@ -0,0 +1,80 @@
import pytest
from pr_agent.algo.file_filter import filter_ignored
from pr_agent.config_loader import global_settings
class TestIgnoreFilter:
def test_no_ignores(self):
"""
Test no files are ignored when no patterns are specified.
"""
files = [
type('', (object,), {'filename': 'file1.py'})(),
type('', (object,), {'filename': 'file2.java'})(),
type('', (object,), {'filename': 'file3.cpp'})(),
type('', (object,), {'filename': 'file4.py'})(),
type('', (object,), {'filename': 'file5.py'})()
]
assert filter_ignored(files) == files, "Expected all files to be returned when no ignore patterns are given."
def test_glob_ignores(self, monkeypatch):
"""
Test files are ignored when glob patterns are specified.
"""
monkeypatch.setattr(global_settings.ignore, 'glob', ['*.py'])
files = [
type('', (object,), {'filename': 'file1.py'})(),
type('', (object,), {'filename': 'file2.java'})(),
type('', (object,), {'filename': 'file3.cpp'})(),
type('', (object,), {'filename': 'file4.py'})(),
type('', (object,), {'filename': 'file5.py'})()
]
expected = [
files[1],
files[2]
]
filtered_files = filter_ignored(files)
assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."
def test_regex_ignores(self, monkeypatch):
"""
Test files are ignored when regex patterns are specified.
"""
monkeypatch.setattr(global_settings.ignore, 'regex', ['^file[2-4]\..*$'])
files = [
type('', (object,), {'filename': 'file1.py'})(),
type('', (object,), {'filename': 'file2.java'})(),
type('', (object,), {'filename': 'file3.cpp'})(),
type('', (object,), {'filename': 'file4.py'})(),
type('', (object,), {'filename': 'file5.py'})()
]
expected = [
files[0],
files[4]
]
filtered_files = filter_ignored(files)
assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."
def test_invalid_regex(self, monkeypatch):
"""
Test invalid patterns are quietly ignored.
"""
monkeypatch.setattr(global_settings.ignore, 'regex', ['(((||', '^file[2-4]\..*$'])
files = [
type('', (object,), {'filename': 'file1.py'})(),
type('', (object,), {'filename': 'file2.java'})(),
type('', (object,), {'filename': 'file3.cpp'})(),
type('', (object,), {'filename': 'file4.py'})(),
type('', (object,), {'filename': 'file5.py'})()
]
expected = [
files[0],
files[4]
]
filtered_files = filter_ignored(files)
assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."