Improve YAML parsing with additional fallback strategies for AI predictions

Merge pull request #1786 from qodo-ai/pr-1736
Pr 1736
2025-07-12 16:50:37 +08:00 · 2025-05-17 20:38:05 +03:00 · 2025-05-17 15:29:23 +03:00 · 2025-05-16 17:25:10 +03:00 · 2025-05-16 17:23:27 +03:00 · 2025-05-16 17:20:54 +03:00
13 changed files with 203 additions and 14 deletions
--- a/docs/docs/usage-guide/additional_configurations.md
+++ b/docs/docs/usage-guide/additional_configurations.md
@ -164,6 +164,7 @@ Qodo Merge allows you to automatically ignore certain PRs based on various crite
 - PRs with specific titles (using regex matching)
 - PRs between specific branches (using regex matching)
 - PRs from specific repositories (using regex matching)
 - PRs not from specific folders
 - PRs containing specific labels
 - PRs opened by specific users
@ -172,7 +173,7 @@ Qodo Merge allows you to automatically ignore certain PRs based on various crite
 To ignore PRs with a specific title such as "[Bump]: ...", you can add the following to your `configuration.toml` file:
-```
+```toml
 [config]
 ignore_pr_title = ["\\[Bump\\]"]
 ```
@ -183,7 +184,7 @@ Where the `ignore_pr_title` is a list of regex patterns to match the PR title yo
 To ignore PRs from specific source or target branches, you can add the following to your `configuration.toml` file:
-```
+```toml
 [config]
 ignore_pr_source_branches = ['develop', 'main', 'master', 'stage']
 ignore_pr_target_branches = ["qa"]
@ -192,6 +193,18 @@ ignore_pr_target_branches = ["qa"]
 Where the `ignore_pr_source_branches` and `ignore_pr_target_branches` are lists of regex patterns to match the source and target branches you want to ignore.
 They are not mutually exclusive, you can use them together or separately.
 ### Ignoring PRs from specific repositories
 To ignore PRs from specific repositories, you can add the following to your `configuration.toml` file:
 ```toml
 [config]
 ignore_repositories = ["my-org/my-repo1", "my-org/my-repo2"]
 ```
 Where the `ignore_repositories` is a list of regex patterns to match the repositories you want to ignore. This is useful when you have multiple repositories and want to exclude certain ones from analysis.
 ### Ignoring PRs not from specific folders
 To allow only specific folders (often needed in large monorepos), set:
--- a/docs/docs/usage-guide/changing_a_model.md
+++ b/docs/docs/usage-guide/changing_a_model.md
@ -16,6 +16,23 @@ You can give parameters via a configuration file, or from environment variables.
    See [litellm documentation](https://litellm.vercel.app/docs/proxy/quick_start#supported-llms) for the environment variables needed per model, as they may vary and change over time. Our documentation per-model may not always be up-to-date with the latest changes.
    Failing to set the needed keys of a specific model will usually result in litellm not identifying the model type, and failing to utilize it.
 ### OpenAI like API
 To use an OpenAI like API, set the following in your `.secrets.toml` file:
 ```toml
 [openai]
 api_base = "https://api.openai.com/v1"
 api_key = "sk-..."
 ```
 or use the environment variables (make sure to use double underscores `__`):
 ```bash
 OPENAI__API_BASE=https://api.openai.com/v1
 OPENAI__KEY=sk-...
 ```
 ### Azure
 To use Azure, set in your `.secrets.toml` (working from CLI), or in the GitHub `Settings > Secrets and variables` (working from GitHub App or GitHub Action):
--- a/pr_agent/algo/init.py
+++ b/pr_agent/algo/init.py
@ -58,6 +58,7 @@ MAX_TOKENS = {
    'vertex_ai/claude-3-7-sonnet@20250219': 200000,
    'vertex_ai/gemini-1.5-pro': 1048576,
    'vertex_ai/gemini-2.5-pro-preview-03-25': 1048576,
    'vertex_ai/gemini-2.5-pro-preview-05-06': 1048576,
    'vertex_ai/gemini-1.5-flash': 1048576,
    'vertex_ai/gemini-2.0-flash': 1048576,
    'vertex_ai/gemini-2.5-flash-preview-04-17': 1048576,
@ -66,6 +67,7 @@ MAX_TOKENS = {
    'gemini/gemini-1.5-flash': 1048576,
    'gemini/gemini-2.0-flash': 1048576,
    'gemini/gemini-2.5-pro-preview-03-25': 1048576,
    'gemini/gemini-2.5-pro-preview-05-06': 1048576,
    'codechat-bison': 6144,
    'codechat-bison-32k': 32000,
    'anthropic.claude-instant-v1': 100000,
--- a/pr_agent/algo/ai_handlers/litellm_ai_handler.py
+++ b/pr_agent/algo/ai_handlers/litellm_ai_handler.py
@ -59,6 +59,7 @@ class LiteLLMAIHandler(BaseAiHandler):
            litellm.api_version = get_settings().openai.api_version
        if get_settings().get("OPENAI.API_BASE", None):
            litellm.api_base = get_settings().openai.api_base
            self.api_base = get_settings().openai.api_base
        if get_settings().get("ANTHROPIC.KEY", None):
            litellm.anthropic_key = get_settings().anthropic.key
        if get_settings().get("COHERE.KEY", None):
--- a/pr_agent/algo/utils.py
+++ b/pr_agent/algo/utils.py
@ -731,8 +731,9 @@ def try_fix_yaml(response_text: str,
                 response_text_original="") -> dict:
    response_text_lines = response_text.split('\n')
-    keys_yaml = ['relevant line:', 'suggestion content:', 'relevant file:', 'existing code:', 'improved code:']
+    keys_yaml = ['relevant line:', 'suggestion content:', 'relevant file:', 'existing code:', 'improved code:', 'label:']
    keys_yaml = keys_yaml + keys_fix_yaml
    # first fallback - try to convert 'relevant line: ...' to relevant line: |-\n        ...'
    response_text_lines_copy = response_text_lines.copy()
    for i in range(0, len(response_text_lines_copy)):
@ -747,8 +748,29 @@ def try_fix_yaml(response_text: str,
    except:
        pass
-    # second fallback - try to extract only range from first ```yaml to ````
+    # 1.5 fallback - try to convert '|' to '|2'. Will solve cases of indent decreasing during the code
-    snippet_pattern = r'```(yaml)?[\s\S]*?```'
+    response_text_copy = copy.deepcopy(response_text)
    response_text_copy = response_text_copy.replace('|\n', '|2\n')
    try:
        data = yaml.safe_load(response_text_copy)
        get_logger().info(f"Successfully parsed AI prediction after replacing | with |2")
        return data
    except:
        # if it fails, we can try to add spaces to the lines that are not indented properly, and contain '}'.
        response_text_lines_copy = response_text_copy.split('\n')
        for i in range(0, len(response_text_lines_copy)):
            initial_space = len(response_text_lines_copy[i]) - len(response_text_lines_copy[i].lstrip())
            if initial_space == 2 and '|2' not in response_text_lines_copy[i] and '}' in response_text_lines_copy[i]:
                response_text_lines_copy[i] = '    ' + response_text_lines_copy[i].lstrip()
        try:
            data = yaml.safe_load('\n'.join(response_text_lines_copy))
            get_logger().info(f"Successfully parsed AI prediction after replacing | with |2 and adding spaces")
            return data
        except:
            pass
    # second fallback - try to extract only range from first ```yaml to the last ```
    snippet_pattern = r'```yaml([\s\S]*?)```(?=\s*$|")'
    snippet = re.search(snippet_pattern, '\n'.join(response_text_lines_copy))
    if not snippet:
        snippet = re.search(snippet_pattern, response_text_original) # before we removed the "```"
@ -803,16 +825,47 @@ def try_fix_yaml(response_text: str,
    except:
        pass
-    # sixth fallback - try to remove last lines
+    # sixth fallback - replace tabs with spaces
-    for i in range(1, len(response_text_lines)):
+    if '\t' in response_text:
-        response_text_lines_tmp = '\n'.join(response_text_lines[:-i])
+        response_text_copy = copy.deepcopy(response_text)
        response_text_copy = response_text_copy.replace('\t', '    ')
        try:
-            data = yaml.safe_load(response_text_lines_tmp)
+            data = yaml.safe_load(response_text_copy)
-            get_logger().info(f"Successfully parsed AI prediction after removing {i} lines")
+            get_logger().info(f"Successfully parsed AI prediction after replacing tabs with spaces")
            return data
        except:
            pass
    # seventh fallback - add indent for sections of code blocks
    response_text_copy = copy.deepcopy(response_text)
    response_text_copy_lines = response_text_copy.split('\n')
    start_line = -1
    for i, line in enumerate(response_text_copy_lines):
        if 'existing_code:' in line or 'improved_code:' in line:
            start_line = i
        elif line.endswith(': |') or line.endswith(': |-') or line.endswith(': |2') or line.endswith(':'):
            start_line = -1
        elif start_line != -1:
            response_text_copy_lines[i] = '    ' + line
    response_text_copy = '\n'.join(response_text_copy_lines)
    try:
        data = yaml.safe_load(response_text_copy)
        get_logger().info(f"Successfully parsed AI prediction after adding indent for sections of code blocks")
        return data
    except:
        pass
    # # sixth fallback - try to remove last lines
    # for i in range(1, len(response_text_lines)):
    #     response_text_lines_tmp = '\n'.join(response_text_lines[:-i])
    #     try:
    #         data = yaml.safe_load(response_text_lines_tmp)
    #         get_logger().info(f"Successfully parsed AI prediction after removing {i} lines")
    #         return data
    #     except:
    #         pass
 def set_custom_labels(variables, git_provider=None):
    if not get_settings().config.enable_custom_labels:
--- a/pr_agent/git_providers/utils.py
+++ b/pr_agent/git_providers/utils.py
@ -6,8 +6,7 @@ from dynaconf import Dynaconf
 from starlette_context import context
 from pr_agent.config_loader import get_settings
-from pr_agent.git_providers import (get_git_provider,
+from pr_agent.git_providers import get_git_provider_with_context
                                    get_git_provider_with_context)
 from pr_agent.log import get_logger
--- a/pr_agent/identity_providers/identity_provider.py
+++ b/pr_agent/identity_providers/identity_provider.py
@ -10,7 +10,7 @@ class Eligibility(Enum):
 class IdentityProvider(ABC):
    @abstractmethod
-    def verify_eligibility(self, git_provider, git_provier_id, pr_url):
+    def verify_eligibility(self, git_provider, git_provider_id, pr_url):
        pass
    @abstractmethod
--- a/pr_agent/servers/bitbucket_app.py
+++ b/pr_agent/servers/bitbucket_app.py
@ -127,6 +127,14 @@ def should_process_pr_logic(data) -> bool:
        source_branch = pr_data.get("source", {}).get("branch", {}).get("name", "")
        target_branch = pr_data.get("destination", {}).get("branch", {}).get("name", "")
        sender = _get_username(data)
        repo_full_name = pr_data.get("destination", {}).get("repository", {}).get("full_name", "")
        # logic to ignore PRs from specific repositories
        ignore_repos = get_settings().get("CONFIG.IGNORE_REPOSITORIES", [])
        if repo_full_name and ignore_repos:
            if any(re.search(regex, repo_full_name) for regex in ignore_repos):
                get_logger().info(f"Ignoring PR from repository '{repo_full_name}' due to 'config.ignore_repositories' setting")
                return False
        # logic to ignore PRs from specific users
        ignore_pr_users = get_settings().get("CONFIG.IGNORE_PR_AUTHORS", [])
--- a/pr_agent/servers/github_app.py
+++ b/pr_agent/servers/github_app.py
@ -258,6 +258,14 @@ def should_process_pr_logic(body) -> bool:
        source_branch = pull_request.get("head", {}).get("ref", "")
        target_branch = pull_request.get("base", {}).get("ref", "")
        sender = body.get("sender", {}).get("login")
        repo_full_name = body.get("repository", {}).get("full_name", "")
        # logic to ignore PRs from specific repositories
        ignore_repos = get_settings().get("CONFIG.IGNORE_REPOSITORIES", [])
        if ignore_repos and repo_full_name:
            if any(re.search(regex, repo_full_name) for regex in ignore_repos):
                get_logger().info(f"Ignoring PR from repository '{repo_full_name}' due to 'config.ignore_repositories' setting")
                return False
        # logic to ignore PRs from specific users
        ignore_pr_users = get_settings().get("CONFIG.IGNORE_PR_AUTHORS", [])
--- a/pr_agent/servers/gitlab_webhook.py
+++ b/pr_agent/servers/gitlab_webhook.py
@ -113,6 +113,14 @@ def should_process_pr_logic(data) -> bool:
            return False
        title = data['object_attributes'].get('title')
        sender = data.get("user", {}).get("username", "")
        repo_full_name = data.get('project', {}).get('path_with_namespace', "")
        # logic to ignore PRs from specific repositories
        ignore_repos = get_settings().get("CONFIG.IGNORE_REPOSITORIES", [])
        if ignore_repos and repo_full_name:
            if any(re.search(regex, repo_full_name) for regex in ignore_repos):
                get_logger().info(f"Ignoring MR from repository '{repo_full_name}' due to 'config.ignore_repositories' setting")
                return False
        # logic to ignore PRs from specific users
        ignore_pr_users = get_settings().get("CONFIG.IGNORE_PR_AUTHORS", [])
--- a/pr_agent/settings/configuration.toml
+++ b/pr_agent/settings/configuration.toml
@ -55,6 +55,7 @@ ignore_pr_target_branches = [] # a list of regular expressions of target branche
 ignore_pr_source_branches = [] # a list of regular expressions of source branches to ignore from PR agent when an PR is created
 ignore_pr_labels = [] # labels to ignore from PR agent when an PR is created
 ignore_pr_authors = [] # authors to ignore from PR agent when an PR is created
 ignore_repositories = [] # a list of regular expressions of repository full names (e.g. "org/repo") to ignore from PR agent processing
 #
 is_auto_command = false # will be auto-set to true if the command is triggered by an automation
 enable_ai_metadata = false # will enable adding ai metadata
--- a/requirements.txt
+++ b/requirements.txt
@ -13,7 +13,7 @@ google-cloud-aiplatform==1.38.0
 google-generativeai==0.8.3
 google-cloud-storage==2.10.0
 Jinja2==3.1.2
-litellm==1.66.3
+litellm==1.69.3
 loguru==0.7.2
 msrest==0.7.1
 openai>=1.55.3
--- a/tests/unittest/test_ignore_repositories.py
+++ b/tests/unittest/test_ignore_repositories.py
@ -0,0 +1,79 @@
 import pytest
 from pr_agent.servers.github_app import should_process_pr_logic as github_should_process_pr_logic
 from pr_agent.servers.bitbucket_app import should_process_pr_logic as bitbucket_should_process_pr_logic
 from pr_agent.servers.gitlab_webhook import should_process_pr_logic as gitlab_should_process_pr_logic
 from pr_agent.config_loader import get_settings
 def make_bitbucket_payload(full_name):
    return {
        "data": {
            "pullrequest": {
                "title": "Test PR",
                "source": {"branch": {"name": "feature/test"}},
                "destination": {
                    "branch": {"name": "main"},
                    "repository": {"full_name": full_name}
                }
            },
            "actor": {"username": "user", "type": "user"}
        }
    }
 def make_github_body(full_name):
    return {
        "pull_request": {},
        "repository": {"full_name": full_name},
        "sender": {"login": "user"}
    }
 def make_gitlab_body(full_name):
    return {
        "object_attributes": {"title": "Test MR"},
        "project": {"path_with_namespace": full_name}
    }
 PROVIDERS = [
    ("github", github_should_process_pr_logic, make_github_body),
    ("bitbucket", bitbucket_should_process_pr_logic, make_bitbucket_payload),
    ("gitlab", gitlab_should_process_pr_logic, make_gitlab_body),
 ]
 class TestIgnoreRepositories:
    def setup_method(self):
        get_settings().set("CONFIG.IGNORE_REPOSITORIES", [])
    @pytest.mark.parametrize("provider_name, provider_func, body_func", PROVIDERS)
    def test_should_ignore_matching_repository(self, provider_name, provider_func, body_func):
        get_settings().set("CONFIG.IGNORE_REPOSITORIES", ["org/repo-to-ignore"])
        body = {
            "pull_request": {},
            "repository": {"full_name": "org/repo-to-ignore"},
            "sender": {"login": "user"}
        }
        result = provider_func(body_func(body["repository"]["full_name"]))
        # print(f"DEBUG: Provider={provider_name}, test_should_ignore_matching_repository, result={result}")
        assert result is False, f"{provider_name}: PR from ignored repository should be ignored (return False)"
    @pytest.mark.parametrize("provider_name, provider_func, body_func", PROVIDERS)
    def test_should_not_ignore_non_matching_repository(self, provider_name, provider_func, body_func):
        get_settings().set("CONFIG.IGNORE_REPOSITORIES", ["org/repo-to-ignore"])
        body = {
            "pull_request": {},
            "repository": {"full_name": "org/other-repo"},
            "sender": {"login": "user"}
        }
        result = provider_func(body_func(body["repository"]["full_name"]))
        # print(f"DEBUG: Provider={provider_name}, test_should_not_ignore_non_matching_repository, result={result}")
        assert result is True, f"{provider_name}: PR from non-ignored repository should not be ignored (return True)"
    @pytest.mark.parametrize("provider_name, provider_func, body_func", PROVIDERS)
    def test_should_not_ignore_when_config_empty(self, provider_name, provider_func, body_func):
        get_settings().set("CONFIG.IGNORE_REPOSITORIES", [])
        body = {
            "pull_request": {},
            "repository": {"full_name": "org/repo-to-ignore"},
            "sender": {"login": "user"}
        }
        result = provider_func(body_func(body["repository"]["full_name"]))
        # print(f"DEBUG: Provider={provider_name}, test_should_not_ignore_when_config_empty, result={result}")
        assert result is True, f"{provider_name}: PR should not be ignored if ignore_repositories config is empty"
Author	SHA1	Message	Date
mrT23	db5138dc42	Improve YAML parsing with additional fallback strategies for AI predictions	2025-05-17 20:38:05 +03:00
Tal	9a9feb47a6	Merge pull request #1786 from qodo-ai/pr-1736 Pr 1736	2025-05-17 15:29:23 +03:00
mrT23	52ce74a31a	Remove debug print statements from repository filtering tests	2025-05-16 17:25:10 +03:00
mrT23	f47da75e6f	Remove debug print statement from should_process_pr_logic function	2025-05-16 17:23:27 +03:00
mrT23	42557feb97	Enhance repository filtering with regex pattern matching for ignore_repositories	2025-05-16 17:20:54 +03:00
Tal	c15fb16528	Merge pull request #1779 from dnnspaul/main Enable usage of OpenAI like APIs	2025-05-16 16:59:18 +03:00
Tal	d268db5f0d	Merge pull request #1778 from smartandhandsome/main Cleanup: Remove Unused import and Fix Parameter Typo	2025-05-16 16:54:55 +03:00
Tal	ec626f0193	Merge pull request #1785 from qodo-ai/tr/gemini-2.5-pro-preview-05-06 Add Gemini-2.5-pro-preview-05-06 model and update litellm dependency	2025-05-16 16:53:50 +03:00
mrT23	9974015682	Add Gemini-2.5-pro-preview-05-06 model and update litellm dependency	2025-05-16 16:32:45 +03:00
Dennis Paul	250870a3da	enable usage of openai like apis	2025-05-15 16:05:05 +02:00
Sangmin Park	a3c9fbbf2c	revert try except	2025-05-15 19:40:40 +09:00
Sangmin Park	c79b655864	Fix typo in method parameter name	2025-05-15 18:42:08 +09:00
Sangmin Park	e55fd64bda	Remove unnecessary nested try-except block for cleaner code. Streamlined the import statement to remove an unused reference to `get_git_provider`.	2025-05-15 18:41:39 +09:00
Mike Davies	d606672801	Add ignore_repositories config for PR filtering What Changed? * Added support to ignore PRs/MRs from specific repositories in GitHub, Bitbucket, and GitLab webhook logic * Updated configuration.toml to include ignore_repositories option * Added unit tests for ignore_repositories across all supported providers	2025-04-30 14:09:40 -07:00