Merge pull request #357 from jamesrom/feat/file_ignores

Add support for ignoring files
2025-07-21 04:50:39 +08:00 · 2023-10-08 16:30:02 +03:00
parent fd8c90041c 92e9012fb6
commit 51c817ba29
7 changed files with 166 additions and 28 deletions
--- a/INSTALL.md
+++ b/INSTALL.md
@ -40,7 +40,7 @@ For other git providers, update CONFIG.GIT_PROVIDER accordingly, and check the `
 ```
 docker run --rm -it -e OPENAI.KEY=<your key> -e GITHUB.USER_TOKEN=<your token> codiumai/pr-agent --pr_url <pr_url> ask "<your question>"
 ```
-Note: If you want to ensure you're running a specific version of the Docker image, consider using the image's digest. 
+Note: If you want to ensure you're running a specific version of the Docker image, consider using the image's digest.
 The digest is a unique identifier for a specific version of an image. You can pull and run an image using its digest by referencing it like so: repository@sha256:digest. Always ensure you're using the correct and trusted digest for your operations.

 1. To request a review for a PR using a specific digest, run the following command:
@ -89,17 +89,17 @@ chmod 600 pr_agent/settings/.secrets.toml

 ```
 export PYTHONPATH=[$PYTHONPATH:]<PATH to pr_agent folder>
-python3 -m pr_agent.cli --pr_url <pr_url> /review
-python3 -m pr_agent.cli --pr_url <pr_url> /ask <your question>
-python3 -m pr_agent.cli --pr_url <pr_url> /describe
-python3 -m pr_agent.cli --pr_url <pr_url> /improve
+python3 -m pr_agent.cli --pr_url <pr_url> review
+python3 -m pr_agent.cli --pr_url <pr_url> ask <your question>
+python3 -m pr_agent.cli --pr_url <pr_url> describe
+python3 -m pr_agent.cli --pr_url <pr_url> improve
 ```

 ---

 ### Method 3: Run as a GitHub Action

-You can use our pre-built Github Action Docker image to run PR-Agent as a Github Action. 
+You can use our pre-built Github Action Docker image to run PR-Agent as a Github Action.

 1. Add the following file to your repository under `.github/workflows/pr_agent.yml`:

@ -153,7 +153,7 @@ OPENAI_KEY: <your key>

 The GITHUB_TOKEN secret is automatically created by GitHub.

-3. Merge this change to your main branch. 
+3. Merge this change to your main branch.
 When you open your next PR, you should see a comment from `github-actions` bot with a review of your PR, and instructions on how to use the rest of the tools.

 4. You may configure PR-Agent by adding environment variables under the env section corresponding to any configurable property in the [configuration](pr_agent/settings/configuration.toml) file. Some examples:
@ -221,12 +221,12 @@ git clone https://github.com/Codium-ai/pr-agent.git
   - Copy your app's webhook secret to the webhook_secret field.
   - Set deployment_type to 'app' in [configuration.toml](./pr_agent/settings/configuration.toml)

-> The .secrets.toml file is not copied to the Docker image by default, and is only used for local development. 
+> The .secrets.toml file is not copied to the Docker image by default, and is only used for local development.
 > If you want to use the .secrets.toml file in your Docker image, you can add remove it from the .dockerignore file.
-> In most production environments, you would inject the secrets file as environment variables or as mounted volumes. 
+> In most production environments, you would inject the secrets file as environment variables or as mounted volumes.
 > For example, in order to inject a secrets file as a volume in a Kubernetes environment you can update your pod spec to include the following,
 > assuming you have a secret named `pr-agent-settings` with a key named `.secrets.toml`:
-``` 
+```
       volumes:
        - name: settings-volume
          secret:
@ -322,7 +322,7 @@ Example IAM permissions to that user to allow access to CodeCommit:
                "codecommit:PostComment*",
                "codecommit:PutCommentReaction",
                "codecommit:UpdatePullRequestDescription",
-                "codecommit:UpdatePullRequestTitle"                
+                "codecommit:UpdatePullRequestTitle"
            ],
            "Resource": "*"
        }
@ -366,8 +366,8 @@ WEBHOOK_SECRET=$(python -c "import secrets; print(secrets.token_hex(10))")
    - Your OpenAI key.
    - In the [gitlab] section, fill in personal_access_token and shared_secret. The access token can be a personal access token, or a group or project access token.
    - Set deployment_type to 'gitlab' in [configuration.toml](./pr_agent/settings/configuration.toml)
-5. Create a webhook in GitLab. Set the URL to the URL of your app's server. Set the secret token to the generated secret from step 2. 
-In the "Trigger" section, check the ‘comments’ and ‘merge request events’ boxes. 
+5. Create a webhook in GitLab. Set the URL to the URL of your app's server. Set the secret token to the generated secret from step 2.
+In the "Trigger" section, check the ‘comments’ and ‘merge request events’ boxes.
 6. Test your installation by opening a merge request or commenting or a merge request using one of CodiumAI's commands.


--- a/Usage.md
+++ b/Usage.md
@ -29,6 +29,16 @@ In addition to general configuration options, each tool has its own configuratio

 The [Tools Guide](./docs/TOOLS_GUIDE.md) provides a detailed description of the different tools and their configurations.

+#### Ignoring files from analysis
+In some cases, you may want to exclude specific files or directories from the analysis performed by CodiumAI PR-Agent. This can be useful, for example, when you have files that are generated automatically or files that shouldn't be reviewed, like vendored code.
+
+To ignore files or directories, edit the **[ignore.toml](/pr_agent/settings/ignore.toml)** configuration file. This setting is also exposed the following environment variables:
+
+ - `IGNORE.GLOB`
+ - `IGNORE.REGEX`
+
+See [dynaconf envvars documentation](https://www.dynaconf.com/envvars/).
+
 #### git provider
 The [git_provider](pr_agent/settings/configuration.toml#L4) field in the configuration file determines the GIT provider that will be used by PR-Agent. Currently, the following providers are supported:
 `
@ -101,7 +111,7 @@ Any configuration value in [configuration file](pr_agent/settings/configuration.
 When running PR-Agent from [GitHub App](INSTALL.md#method-5-run-as-a-github-app), the default configurations from a pre-built docker will be initially loaded.

 #### GitHub app automatic tools
-The [github_app](pr_agent/settings/configuration.toml#L56) section defines GitHub app specific configurations. 
+The [github_app](pr_agent/settings/configuration.toml#L56) section defines GitHub app specific configurations.
 An important parameter is `pr_commands`, which is a list of tools that will be **run automatically** when a new PR is opened:
 ```
 [github_app]
@ -133,7 +143,7 @@ Note that a local `.pr_agent.toml` file enables you to edit and customize the de

 #### Editing the prompts
 The prompts for the various PR-Agent tools are defined in the `pr_agent/settings` folder.
-In practice, the prompts are loaded and stored as a standard setting object. 
+In practice, the prompts are loaded and stored as a standard setting object.
 Hence, editing them is similar to editing any other configuration value - just place the relevant key in `.pr_agent.toml`file, and override the default value.

 For example, if you want to edit the prompts of the [describe](./pr_agent/settings/pr_description_prompts.toml) tool, you can add the following to your `.pr_agent.toml` file:
@ -158,7 +168,7 @@ You can configure settings in GitHub action by adding environment variables unde
        PR_CODE_SUGGESTIONS.NUM_CODE_SUGGESTIONS: 6 # Increase number of code suggestions
        github_action.auto_review: "true" # Enable auto review
        github_action.auto_describe: "true" # Enable auto describe
-        github_action.auto_improve: "false" # Disable auto improve      
+        github_action.auto_improve: "false" # Disable auto improve
 ```
 specifically, `github_action.auto_review`, `github_action.auto_describe` and `github_action.auto_improve` are used to enable/disable automatic tools that run when a new PR is opened.

@ -171,7 +181,7 @@ To use a different model than the default (GPT-4), you need to edit [configurati
 For models and environments not from OPENAI, you might need to provide additional keys and other parameters. See below for instructions.

 #### Azure
-To use Azure, set in your .secrets.toml: 
+To use Azure, set in your .secrets.toml:
 ```
 api_key = "" # your azure api key
 api_type = "azure"
@ -180,16 +190,16 @@ api_base = ""  # The base URL for your Azure OpenAI resource. e.g. "https://<you
 deployment_id = ""  # The deployment name you chose when you deployed the engine
 ```

-and 
+and
 ```
 [config]
 model="" # the OpenAI model you've deployed on Azure (e.g. gpt-3.5-turbo)
 ```
-in the configuration.toml 
+in the configuration.toml

 #### Huggingface

-**Local**  
+**Local**
 You can run Huggingface models locally through either [VLLM](https://docs.litellm.ai/docs/providers/vllm) or [Ollama](https://docs.litellm.ai/docs/providers/ollama)

 E.g. to use a new Huggingface model locally via Ollama, set:
@ -209,7 +219,7 @@ MAX_TOKENS={
 model = "ollama/llama2"

 [ollama] # in .secrets.toml
-api_base = ... # the base url for your huggingface inference endpoint 
+api_base = ... # the base url for your huggingface inference endpoint
 ```

 **Inference Endpoints**
@ -230,7 +240,7 @@ model = "huggingface/meta-llama/Llama-2-7b-chat-hf"

 [huggingface] # in .secrets.toml
 key = ... # your huggingface api key
-api_base = ... # the base url for your huggingface inference endpoint 
+api_base = ... # the base url for your huggingface inference endpoint
 ```
 (you can obtain a Llama2 key from [here](https://replicate.com/replicate/llama-2-70b-chat/api))

@ -251,12 +261,12 @@ Also review the [AiHandler](pr_agent/algo/ai_handler.py) file for instruction ho
 ### Working with large PRs

 The default mode of CodiumAI is to have a single call per tool, using GPT-4, which has a token limit of 8000 tokens.
-This mode provide a very good speed-quality-cost tradeoff, and can handle most PRs successfully. 
+This mode provide a very good speed-quality-cost tradeoff, and can handle most PRs successfully.
 When the PR is above the token limit, it employs a [PR Compression strategy](./PR_COMPRESSION.md).

 However, for very large PRs, or in case you want to emphasize quality over speed and cost, there are 2 possible solutions:
 1) [Use a model](#changing-a-model) with larger context, like GPT-32K, or claude-100K. This solution will be applicable for all the tools.
-2) For the `/improve` tool, there is an ['extended' mode](./docs/IMPROVE.md) (`/improve --extended`), 
+2) For the `/improve` tool, there is an ['extended' mode](./docs/IMPROVE.md) (`/improve --extended`),
 which divides the PR to chunks, and process each chunk separately. With this mode, regardless of the model, no compression will be done (but for large PRs, multiple model calls may occur)

 ### Appendix - additional configurations walkthrough
@ -305,4 +315,4 @@ And use the following settings (you have to replace the values) in .secrets.toml
 [azure_devops]
 org = "https://dev.azure.com/YOUR_ORGANIZATION/"
 pat = "YOUR_PAT_TOKEN"
-```
+```
--- a/pr_agent/algo/file_filter.py
+++ b/pr_agent/algo/file_filter.py
@ -0,0 +1,31 @@
+import fnmatch
+import re
+
+from pr_agent.config_loader import get_settings
+
+def filter_ignored(files):
+    """
+    Filter out files that match the ignore patterns.
+    """
+
+    try:
+        # load regex patterns, and translate glob patterns to regex
+        patterns = get_settings().ignore.regex
+        patterns += [fnmatch.translate(glob) for glob in get_settings().ignore.glob]
+
+        # compile all valid patterns
+        compiled_patterns = []
+        for r in patterns:
+            try:
+                compiled_patterns.append(re.compile(r))
+            except re.error:
+                pass
+
+        # keep filenames that _don't_ match the ignore regex
+        for r in compiled_patterns:
+            files = [f for f in files if not r.match(f.filename)]
+
+    except Exception as e:
+        print(f"Could not filter file list: {e}")
+
+    return files
--- a/pr_agent/algo/pr_processing.py
+++ b/pr_agent/algo/pr_processing.py
@ -11,6 +11,7 @@ from github import RateLimitExceededException
 from pr_agent.algo import MAX_TOKENS
 from pr_agent.algo.git_patch_processing import convert_to_hunks_with_lines_numbers, extend_patch, handle_patch_deletions
 from pr_agent.algo.language_handler import sort_files_by_main_languages
+from pr_agent.algo.file_filter import filter_ignored
 from pr_agent.algo.token_handler import TokenHandler, get_token_encoder
 from pr_agent.config_loader import get_settings
 from pr_agent.git_providers.git_provider import FilePatchInfo, GitProvider
@ -53,6 +54,8 @@ def get_pr_diff(git_provider: GitProvider, token_handler: TokenHandler, model: s
        logging.error(f"Rate limit exceeded for git provider API. original message {e}")
        raise

+    diff_files = filter_ignored(diff_files)
+
    # get pr languages
    pr_languages = sort_files_by_main_languages(git_provider.get_languages(), diff_files)

@ -348,16 +351,16 @@ def get_pr_multi_diffs(git_provider: GitProvider,
    """
    Retrieves the diff files from a Git provider, sorts them by main language, and generates patches for each file.
    The patches are split into multiple groups based on the maximum number of tokens allowed for the given model.
-    
+
    Args:
        git_provider (GitProvider): An object that provides access to Git provider APIs.
        token_handler (TokenHandler): An object that handles tokens in the context of a pull request.
        model (str): The name of the model.
        max_calls (int, optional): The maximum number of calls to retrieve diff files. Defaults to 5.
-    
+
    Returns:
        List[str]: A list of final diff strings, split into multiple groups based on the maximum number of tokens allowed for the given model.
-    
+
    Raises:
        RateLimitExceededException: If the rate limit for the Git provider API is exceeded.
    """
@ -367,6 +370,8 @@ def get_pr_multi_diffs(git_provider: GitProvider,
        logging.error(f"Rate limit exceeded for git provider API. original message {e}")
        raise

+    diff_files = filter_ignored(diff_files)
+
    # Sort files by main language
    pr_languages = sort_files_by_main_languages(git_provider.get_languages(), diff_files)

--- a/pr_agent/config_loader.py
+++ b/pr_agent/config_loader.py
@ -14,6 +14,7 @@ global_settings = Dynaconf(
    settings_files=[join(current_dir, f) for f in [
        "settings/.secrets.toml",
        "settings/configuration.toml",
+        "settings/ignore.toml",
        "settings/language_extensions.toml",
        "settings/pr_reviewer_prompts.toml",
        "settings/pr_questions_prompts.toml",
--- a/pr_agent/settings/ignore.toml
+++ b/pr_agent/settings/ignore.toml
@ -0,0 +1,11 @@
+[ignore]
+
+glob = [
+    # Ignore files and directories matching these glob patterns.
+    # See https://docs.python.org/3/library/glob.html
+    'vendor/**',
+]
+regex = [
+    # Ignore files and directories matching these regex patterns.
+    # See https://learnbyexample.github.io/python-regex-cheatsheet/
+]
--- a/tests/unittest/test_file_filter.py
+++ b/tests/unittest/test_file_filter.py
@ -0,0 +1,80 @@
+import pytest
+from pr_agent.algo.file_filter import filter_ignored
+from pr_agent.config_loader import global_settings
+
+class TestIgnoreFilter:
+    def test_no_ignores(self):
+        """
+        Test no files are ignored when no patterns are specified.
+        """
+        files = [
+            type('', (object,), {'filename': 'file1.py'})(),
+            type('', (object,), {'filename': 'file2.java'})(),
+            type('', (object,), {'filename': 'file3.cpp'})(),
+            type('', (object,), {'filename': 'file4.py'})(),
+            type('', (object,), {'filename': 'file5.py'})()
+        ]
+        assert filter_ignored(files) == files, "Expected all files to be returned when no ignore patterns are given."
+
+    def test_glob_ignores(self, monkeypatch):
+        """
+        Test files are ignored when glob patterns are specified.
+        """
+        monkeypatch.setattr(global_settings.ignore, 'glob', ['*.py'])
+
+        files = [
+            type('', (object,), {'filename': 'file1.py'})(),
+            type('', (object,), {'filename': 'file2.java'})(),
+            type('', (object,), {'filename': 'file3.cpp'})(),
+            type('', (object,), {'filename': 'file4.py'})(),
+            type('', (object,), {'filename': 'file5.py'})()
+        ]
+        expected = [
+            files[1],
+            files[2]
+        ]
+
+        filtered_files = filter_ignored(files)
+        assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."
+
+    def test_regex_ignores(self, monkeypatch):
+        """
+        Test files are ignored when regex patterns are specified.
+        """
+        monkeypatch.setattr(global_settings.ignore, 'regex', ['^file[2-4]\..*$'])
+
+        files = [
+            type('', (object,), {'filename': 'file1.py'})(),
+            type('', (object,), {'filename': 'file2.java'})(),
+            type('', (object,), {'filename': 'file3.cpp'})(),
+            type('', (object,), {'filename': 'file4.py'})(),
+            type('', (object,), {'filename': 'file5.py'})()
+        ]
+        expected = [
+            files[0],
+            files[4]
+        ]
+
+        filtered_files = filter_ignored(files)
+        assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."
+
+    def test_invalid_regex(self, monkeypatch):
+        """
+        Test invalid patterns are quietly ignored.
+        """
+        monkeypatch.setattr(global_settings.ignore, 'regex', ['(((||', '^file[2-4]\..*$'])
+
+        files = [
+            type('', (object,), {'filename': 'file1.py'})(),
+            type('', (object,), {'filename': 'file2.java'})(),
+            type('', (object,), {'filename': 'file3.cpp'})(),
+            type('', (object,), {'filename': 'file4.py'})(),
+            type('', (object,), {'filename': 'file5.py'})()
+        ]
+        expected = [
+            files[0],
+            files[4]
+        ]
+
+        filtered_files = filter_ignored(files)
+        assert filtered_files == expected, f"Expected {[file.filename for file in expected]}, but got {[file.filename for file in filtered_files]}."