diff --git a/README.md b/README.md index 4b151f2b..1586d48e 100644 --- a/README.md +++ b/README.md @@ -41,6 +41,10 @@ Qode Merge PR-Agent aims to help efficiently review and handle pull requests, by ## News and Updates +### December 30, 2024 + +Following [feedback](https://research.kudelskisecurity.com/2024/08/29/careful-where-you-code-multiple-vulnerabilities-in-ai-powered-pr-agent/) from the community, we have addressed two vulnerabilities identified in the open-source PR-Agent project. The fixes are now included in the newly released version (v0.26), available as of today. + ### December 25, 2024 The `review` tool previously included a legacy feature for providing code suggestions (controlled by '--pr_reviewer.num_code_suggestion'). This functionality has been deprecated. Use instead the [`improve`](https://qodo-merge-docs.qodo.ai/tools/improve/) tool, which offers higher quality and more actionable code suggestions. diff --git a/docs/docs/faq/index.md b/docs/docs/faq/index.md index dcd5df6b..66f89c48 100644 --- a/docs/docs/faq/index.md +++ b/docs/docs/faq/index.md @@ -1,6 +1,6 @@ # FAQ -??? note "Question: Can Qodo Merge serve as a substitute for a human reviewer?" +??? note "Q: Can Qodo Merge serve as a substitute for a human reviewer?" #### Answer:1 Qodo Merge is designed to assist, not replace, human reviewers. @@ -12,7 +12,7 @@ 1. Preserves user's original PR header 2. Places user's description above the AI-generated PR description - 3. Cannot approve PRs; approval remains reviewer's responsibility + 3. Won't approve PRs; approval remains reviewer's responsibility 4. The code suggestions are optional, and aim to: - Encourage self-review and self-reflection - Highlight potential bugs or oversights @@ -22,15 +22,15 @@ ___ -??? note "Question: I received an incorrect or irrelevant suggestion. Why?" +??? note "Q: I received an incorrect or irrelevant suggestion. Why?" #### Answer:2 - Modern AI models, like Claude 3.5 Sonnet and GPT-4, are improving rapidly but remain imperfect. Users should critically evaluate all suggestions rather than accepting them automatically. - - AI errors are rare, but possible. A main value from reviewing the code suggestions lies in their high probability of catching **mistakes or bugs made by the PR author**. We believe it's worth spending 30-60 seconds reviewing suggestions, even if some aren't relevant, as this practice can enhances code quality and prevent bugs in production. + - AI errors are rare, but possible. A main value from reviewing the code suggestions lies in their high probability of catching **mistakes or bugs made by the PR author**. We believe it's worth spending 30-60 seconds reviewing suggestions, even if some aren't relevant, as this practice can enhance code quality and prevent bugs in production. - - The hierarchical structure of the suggestions is designed to help the user to _quickly_ understand them, and to decide which ones are relevant and which are not: + - The hierarchical structure of the suggestions is designed to help the user _quickly_ understand them, and to decide which ones are relevant and which are not: - Only if the `Category` header is relevant, the user should move to the summarized suggestion description. - Only if the summarized suggestion description is relevant, the user should click on the collapsible, to read the full suggestion description with a code preview example. @@ -40,14 +40,14 @@ ___ ___ -??? note "Question: How can I get more tailored suggestions?" +??? note "Q: How can I get more tailored suggestions?" #### Answer:3 See [here](https://qodo-merge-docs.qodo.ai/tools/improve/#extra-instructions-and-best-practices) for more information on how to use the `extra_instructions` and `best_practices` configuration options, to guide the model to more tailored suggestions. ___ -??? note "Question: Will you store my code ? Are you using my code to train models?" +??? note "Q: Will you store my code? Are you using my code to train models?" #### Answer:4 No. Qodo Merge strict privacy policy ensures that your code is not stored or used for training purposes. @@ -56,12 +56,35 @@ ___ ___ -??? note "Question: Can I use my own LLM keys with Qodo Merge?" +??? note "Q: Can I use my own LLM keys with Qodo Merge?" #### Answer:5 - When you self-host, you use your own keys. + When you self-host the [open-source](https://github.com/Codium-ai/pr-agent) version, you use your own keys. Qodo Merge Pro with SaaS deployment is a hosted version of Qodo Merge, where Qodo manages the infrastructure and the keys. For enterprise customers, on-prem deployment is also available. [Contact us](https://www.codium.ai/contact/#pricing) for more information. +___ + +??? note "Q: Can Qodo Merge review draft/offline PRs?" + #### Answer:5 + + Yes. While Qodo Merge won't automatically review draft PRs, you can still get feedback by manually requesting it through [online commenting](https://qodo-merge-docs.qodo.ai/usage-guide/automations_and_usage/#online-usage). + + For active PRs, you can customize the automatic feedback settings [here](https://qodo-merge-docs.qodo.ai/usage-guide/automations_and_usage/#qodo-merge-automatic-feedback) to match your team's workflow. +___ + +??? note "Q: Can the 'Review effort' feedback be calibrated or customized?" + #### Answer:5 + + Yes, you can customize review effort estimates using the `extra_instructions` configuration option (see [documentation](https://qodo-merge-docs.qodo.ai/tools/review/#configuration-options)). + + Example mapping: + + - Effort 1: < 30 minutes review time + - Effort 2: 30-60 minutes review time + - Effort 3: 60-90 minutes review time + - ... + + Note: The effort levels (1-5) are primarily meant for _comparative_ purposes, helping teams prioritize reviewing smaller PRs first. The actual review duration may vary, as the focus is on providing consistent relative effort estimates. ___ diff --git a/pr_agent/agent/pr_agent.py b/pr_agent/agent/pr_agent.py index 2ea72077..8dea48a5 100644 --- a/pr_agent/agent/pr_agent.py +++ b/pr_agent/agent/pr_agent.py @@ -46,7 +46,6 @@ commands = list(command2class.keys()) class PRAgent: def __init__(self, ai_handler: partial[BaseAiHandler,] = LiteLLMAIHandler): self.ai_handler = ai_handler # will be initialized in run_action - self.forbidden_cli_args = ['enable_auto_approval'] async def handle_request(self, pr_url, request, notify=None) -> bool: # First, apply repo specific settings if exists @@ -61,14 +60,23 @@ class PRAgent: else: action, *args = request + forbidden_cli_args = ['enable_auto_approval', 'base_url', 'url', 'app_name', 'secret_provider', + 'git_provider', 'skip_keys', 'key', 'ANALYTICS_FOLDER', 'uri', 'app_id', 'webhook_secret', + 'bearer_token', 'PERSONAL_ACCESS_TOKEN', 'override_deployment_type', 'private_key', 'api_base', 'api_type', 'api_version'] if args: - for forbidden_arg in self.forbidden_cli_args: - for arg in args: - if forbidden_arg in arg: - get_logger().error( - f"CLI argument for param '{forbidden_arg}' is forbidden. Use instead a configuration file." - ) - return False + for arg in args: + if arg.startswith('--'): + arg_word = arg.lower() + arg_word = arg_word.replace('__', '.') # replace double underscore with dot, e.g. --openai__key -> --openai.key + for forbidden_arg in forbidden_cli_args: + forbidden_arg_word = forbidden_arg.lower() + if '.' not in forbidden_arg_word: + forbidden_arg_word = '.' + forbidden_arg_word + if forbidden_arg_word in arg_word: + get_logger().error( + f"CLI argument for param '{forbidden_arg}' is forbidden. Use instead a configuration file." + ) + return False args = update_settings_from_args(args) action = action.lstrip("/").lower() diff --git a/pr_agent/algo/__init__.py b/pr_agent/algo/__init__.py index cfa1bbe3..72eb7912 100644 --- a/pr_agent/algo/__init__.py +++ b/pr_agent/algo/__init__.py @@ -66,10 +66,10 @@ MAX_TOKENS = { 'claude-3-5-sonnet': 100000, 'groq/llama3-8b-8192': 8192, 'groq/llama3-70b-8192': 8192, + 'groq/llama-3.1-8b-instant': 8192, + 'groq/llama-3.3-70b-versatile': 128000, 'groq/mixtral-8x7b-32768': 32768, - 'groq/llama-3.1-8b-instant': 131072, - 'groq/llama-3.1-70b-versatile': 131072, - 'groq/llama-3.1-405b-reasoning': 131072, + 'groq/gemma2-9b-it': 8192, 'ollama/llama3': 4096, 'watsonx/meta-llama/llama-3-8b-instruct': 4096, "watsonx/meta-llama/llama-3-70b-instruct": 4096, diff --git a/pr_agent/algo/git_patch_processing.py b/pr_agent/algo/git_patch_processing.py index 9b319746..8c7943b3 100644 --- a/pr_agent/algo/git_patch_processing.py +++ b/pr_agent/algo/git_patch_processing.py @@ -364,48 +364,51 @@ __old hunk__ def extract_hunk_lines_from_patch(patch: str, file_name, line_start, line_end, side) -> tuple[str, str]: + try: + patch_with_lines_str = f"\n\n## File: '{file_name.strip()}'\n\n" + selected_lines = "" + patch_lines = patch.splitlines() + RE_HUNK_HEADER = re.compile( + r"^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@[ ]?(.*)") + match = None + start1, size1, start2, size2 = -1, -1, -1, -1 + skip_hunk = False + selected_lines_num = 0 + for line in patch_lines: + if 'no newline at end of file' in line.lower(): + continue - patch_with_lines_str = f"\n\n## File: '{file_name.strip()}'\n\n" - selected_lines = "" - patch_lines = patch.splitlines() - RE_HUNK_HEADER = re.compile( - r"^@@ -(\d+)(?:,(\d+))? \+(\d+)(?:,(\d+))? @@[ ]?(.*)") - match = None - start1, size1, start2, size2 = -1, -1, -1, -1 - skip_hunk = False - selected_lines_num = 0 - for line in patch_lines: - if 'no newline at end of file' in line.lower(): - continue + if line.startswith('@@'): + skip_hunk = False + selected_lines_num = 0 + header_line = line - if line.startswith('@@'): - skip_hunk = False - selected_lines_num = 0 - header_line = line + match = RE_HUNK_HEADER.match(line) - match = RE_HUNK_HEADER.match(line) + section_header, size1, size2, start1, start2 = extract_hunk_headers(match) - section_header, size1, size2, start1, start2 = extract_hunk_headers(match) - - # check if line range is in this hunk - if side.lower() == 'left': # check if line range is in this hunk - if not (start1 <= line_start <= start1 + size1): - skip_hunk = True - continue - elif side.lower() == 'right': - if not (start2 <= line_start <= start2 + size2): - skip_hunk = True - continue - patch_with_lines_str += f'\n{header_line}\n' + if side.lower() == 'left': + # check if line range is in this hunk + if not (start1 <= line_start <= start1 + size1): + skip_hunk = True + continue + elif side.lower() == 'right': + if not (start2 <= line_start <= start2 + size2): + skip_hunk = True + continue + patch_with_lines_str += f'\n{header_line}\n' - elif not skip_hunk: - if side.lower() == 'right' and line_start <= start2 + selected_lines_num <= line_end: - selected_lines += line + '\n' - if side.lower() == 'left' and start1 <= selected_lines_num + start1 <= line_end: - selected_lines += line + '\n' - patch_with_lines_str += line + '\n' - if not line.startswith('-'): # currently we don't support /ask line for deleted lines - selected_lines_num += 1 + elif not skip_hunk: + if side.lower() == 'right' and line_start <= start2 + selected_lines_num <= line_end: + selected_lines += line + '\n' + if side.lower() == 'left' and start1 <= selected_lines_num + start1 <= line_end: + selected_lines += line + '\n' + patch_with_lines_str += line + '\n' + if not line.startswith('-'): # currently we don't support /ask line for deleted lines + selected_lines_num += 1 + except Exception as e: + get_logger().error(f"Failed to extract hunk lines from patch: {e}", artifact={"traceback": traceback.format_exc()}) + return "", "" return patch_with_lines_str.rstrip(), selected_lines.rstrip() diff --git a/pr_agent/algo/pr_processing.py b/pr_agent/algo/pr_processing.py index d45a4fab..6d5a86d9 100644 --- a/pr_agent/algo/pr_processing.py +++ b/pr_agent/algo/pr_processing.py @@ -205,10 +205,11 @@ def pr_generate_extended_diff(pr_languages: list, if not extended_patch: get_logger().warning(f"Failed to extend patch for file: {file.filename}") continue - full_extended_patch = f"\n\n## {file.filename}\n{extended_patch.rstrip()}\n" if add_line_numbers_to_hunks: full_extended_patch = convert_to_hunks_with_lines_numbers(extended_patch, file) + else: + full_extended_patch = f"\n\n## File: '{file.filename.strip()}'\n{extended_patch.rstrip()}\n" # add AI-summary metadata to the patch if file.ai_file_summary and get_settings().get("config.enable_ai_metadata", False): @@ -316,13 +317,13 @@ def generate_full_patch(convert_hunks_to_line_numbers, file_dict, max_tokens_mod # TODO: Option for alternative logic to remove hunks from the patch to reduce the number of tokens # until we meet the requirements if get_settings().config.verbosity_level >= 2: - get_logger().warning(f"Patch too large, skipping it, {filename}") + get_logger().warning(f"Patch too large, skipping it: '{filename}'") remaining_files_list_new.append(filename) continue if patch: if not convert_hunks_to_line_numbers: - patch_final = f"\n\n## File: '{filename.strip()}\n\n{patch.strip()}\n'" + patch_final = f"\n\n## File: '{filename.strip()}'\n\n{patch.strip()}\n" else: patch_final = "\n\n" + patch.strip() patches.append(patch_final) diff --git a/pr_agent/algo/utils.py b/pr_agent/algo/utils.py index af829e14..9b49c060 100644 --- a/pr_agent/algo/utils.py +++ b/pr_agent/algo/utils.py @@ -23,6 +23,7 @@ from pydantic import BaseModel from starlette_context import context from pr_agent.algo import MAX_TOKENS +from pr_agent.algo.git_patch_processing import extract_hunk_lines_from_patch from pr_agent.algo.token_handler import TokenEncoder from pr_agent.algo.types import FilePatchInfo from pr_agent.config_loader import get_settings, global_settings @@ -272,7 +273,11 @@ def convert_to_markdown_v2(output_data: dict, return markdown_text -def extract_relevant_lines_str(end_line, files, relevant_file, start_line, dedent=False): + +def extract_relevant_lines_str(end_line, files, relevant_file, start_line, dedent=False) -> str: + """ + Finds 'relevant_file' in 'files', and extracts the lines from 'start_line' to 'end_line' string from the file content. + """ try: relevant_lines_str = "" if files: @@ -280,10 +285,23 @@ def extract_relevant_lines_str(end_line, files, relevant_file, start_line, deden for file in files: if file.filename.strip() == relevant_file: if not file.head_file: - get_logger().warning(f"No content found in file: {file.filename}") - return "" - relevant_file_lines = file.head_file.splitlines() - relevant_lines_str = "\n".join(relevant_file_lines[start_line - 1:end_line]) + # as a fallback, extract relevant lines directly from patch + patch = file.patch + get_logger().info(f"No content found in file: '{file.filename}' for 'extract_relevant_lines_str'. Using patch instead") + _, selected_lines = extract_hunk_lines_from_patch(patch, file.filename, start_line, end_line,side='right') + if not selected_lines: + get_logger().error(f"Failed to extract relevant lines from patch: {file.filename}") + return "" + # filter out '-' lines + relevant_lines_str = "" + for line in selected_lines.splitlines(): + if line.startswith('-'): + continue + relevant_lines_str += line[1:] + '\n' + else: + relevant_file_lines = file.head_file.splitlines() + relevant_lines_str = "\n".join(relevant_file_lines[start_line - 1:end_line]) + if dedent and relevant_lines_str: # Remove the longest leading string of spaces and tabs common to all lines. relevant_lines_str = textwrap.dedent(relevant_lines_str) @@ -565,27 +583,20 @@ def load_large_diff(filename, new_file_content_str: str, original_file_content_s """ Generate a patch for a modified file by comparing the original content of the file with the new content provided as input. - - Args: - new_file_content_str: The new content of the file as a string. - original_file_content_str: The original content of the file as a string. - - Returns: - The generated or provided patch string. - - Raises: - None. """ - patch = "" + if not original_file_content_str and not new_file_content_str: + return "" + try: diff = difflib.unified_diff(original_file_content_str.splitlines(keepends=True), new_file_content_str.splitlines(keepends=True)) if get_settings().config.verbosity_level >= 2 and show_warning: - get_logger().warning(f"File was modified, but no patch was found. Manually creating patch: {filename}.") + get_logger().info(f"File was modified, but no patch was found. Manually creating patch: {filename}.") patch = ''.join(diff) - except Exception: - pass - return patch + return patch + except Exception as e: + get_logger().exception(f"Failed to generate patch for file: {filename}") + return "" def update_settings_from_args(args: List[str]) -> List[str]: diff --git a/pr_agent/git_providers/bitbucket_server_provider.py b/pr_agent/git_providers/bitbucket_server_provider.py index 4dfa8226..cbbb4a21 100644 --- a/pr_agent/git_providers/bitbucket_server_provider.py +++ b/pr_agent/git_providers/bitbucket_server_provider.py @@ -402,10 +402,21 @@ class BitbucketServerProvider(GitProvider): try: projects_index = path_parts.index("projects") - except ValueError as e: + except ValueError: + projects_index = -1 + + try: + users_index = path_parts.index("users") + except ValueError: + users_index = -1 + + if projects_index == -1 and users_index == -1: raise ValueError(f"The provided URL '{pr_url}' does not appear to be a Bitbucket PR URL") - path_parts = path_parts[projects_index:] + if projects_index != -1: + path_parts = path_parts[projects_index:] + else: + path_parts = path_parts[users_index:] if len(path_parts) < 6 or path_parts[2] != "repos" or path_parts[4] != "pull-requests": raise ValueError( @@ -413,6 +424,8 @@ class BitbucketServerProvider(GitProvider): ) workspace_slug = path_parts[1] + if users_index != -1: + workspace_slug = f"~{workspace_slug}" repo_slug = path_parts[3] try: pr_number = int(path_parts[5]) diff --git a/pr_agent/git_providers/github_provider.py b/pr_agent/git_providers/github_provider.py index 9ee3e2bf..a91b1476 100644 --- a/pr_agent/git_providers/github_provider.py +++ b/pr_agent/git_providers/github_provider.py @@ -174,6 +174,24 @@ class GithubProvider(GitProvider): diff_files = [] invalid_files_names = [] + is_close_to_rate_limit = False + + # The base.sha will point to the current state of the base branch (including parallel merges), not the original base commit when the PR was created + # We can fix this by finding the merge base commit between the PR head and base branches + # Note that The pr.head.sha is actually correct as is - it points to the latest commit in your PR branch. + # This SHA isn't affected by parallel merges to the base branch since it's specific to your PR's branch. + repo = self.repo_obj + pr = self.pr + try: + compare = repo.compare(pr.base.sha, pr.head.sha) # communication with GitHub + merge_base_commit = compare.merge_base_commit + except Exception as e: + get_logger().error(f"Failed to get merge base commit: {e}") + merge_base_commit = pr.base + if merge_base_commit.sha != pr.base.sha: + get_logger().info( + f"Using merge base commit {merge_base_commit.sha} instead of base commit ") + counter_valid = 0 for file in files: if not is_valid_file(file.filename): @@ -181,48 +199,36 @@ class GithubProvider(GitProvider): continue patch = file.patch - - # allow only a limited number of files to be fully loaded. We can manage the rest with diffs only - counter_valid += 1 - avoid_load = False - if counter_valid >= MAX_FILES_ALLOWED_FULL and patch and not self.incremental.is_incremental: - avoid_load = True - if counter_valid == MAX_FILES_ALLOWED_FULL: - get_logger().info(f"Too many files in PR, will avoid loading full content for rest of files") - - if avoid_load: + if is_close_to_rate_limit: new_file_content_str = "" + original_file_content_str = "" else: - new_file_content_str = self._get_pr_file_content(file, self.pr.head.sha) # communication with GitHub + # allow only a limited number of files to be fully loaded. We can manage the rest with diffs only + counter_valid += 1 + avoid_load = False + if counter_valid >= MAX_FILES_ALLOWED_FULL and patch and not self.incremental.is_incremental: + avoid_load = True + if counter_valid == MAX_FILES_ALLOWED_FULL: + get_logger().info(f"Too many files in PR, will avoid loading full content for rest of files") - if self.incremental.is_incremental and self.unreviewed_files_set: - original_file_content_str = self._get_pr_file_content(file, self.incremental.last_seen_commit_sha) - patch = load_large_diff(file.filename, new_file_content_str, original_file_content_str) - self.unreviewed_files_set[file.filename] = patch - else: if avoid_load: - original_file_content_str = "" + new_file_content_str = "" else: - # The base.sha will point to the current state of the base branch (including parallel merges), not the original base commit when the PR was created - # We can fix this by finding the merge base commit between the PR head and base branches - # Note that The pr.head.sha is actually correct as is - it points to the latest commit in your PR branch. - # This SHA isn't affected by parallel merges to the base branch since it's specific to your PR's branch. - repo = self.repo_obj - pr = self.pr - try: - compare = repo.compare(pr.base.sha, pr.head.sha) - merge_base_commit = compare.merge_base_commit - except Exception as e: - get_logger().error(f"Failed to get merge base commit: {e}") - merge_base_commit = pr.base - if merge_base_commit.sha != pr.base.sha: - get_logger().info( - f"Using merge base commit {merge_base_commit.sha} instead of base commit " - f"{pr.base.sha} for {file.filename}") - original_file_content_str = self._get_pr_file_content(file, merge_base_commit.sha) + new_file_content_str = self._get_pr_file_content(file, self.pr.head.sha) # communication with GitHub - if not patch: + if self.incremental.is_incremental and self.unreviewed_files_set: + original_file_content_str = self._get_pr_file_content(file, self.incremental.last_seen_commit_sha) patch = load_large_diff(file.filename, new_file_content_str, original_file_content_str) + self.unreviewed_files_set[file.filename] = patch + else: + if avoid_load: + original_file_content_str = "" + else: + original_file_content_str = self._get_pr_file_content(file, merge_base_commit.sha) + # original_file_content_str = self._get_pr_file_content(file, self.pr.base.sha) + if not patch: + patch = load_large_diff(file.filename, new_file_content_str, original_file_content_str) + if file.status == 'added': edit_type = EDIT_TYPE.ADDED @@ -237,9 +243,14 @@ class GithubProvider(GitProvider): edit_type = EDIT_TYPE.UNKNOWN # count number of lines added and removed - patch_lines = patch.splitlines(keepends=True) - num_plus_lines = len([line for line in patch_lines if line.startswith('+')]) - num_minus_lines = len([line for line in patch_lines if line.startswith('-')]) + if hasattr(file, 'additions') and hasattr(file, 'deletions'): + num_plus_lines = file.additions + num_minus_lines = file.deletions + else: + patch_lines = patch.splitlines(keepends=True) + num_plus_lines = len([line for line in patch_lines if line.startswith('+')]) + num_minus_lines = len([line for line in patch_lines if line.startswith('-')]) + file_patch_canonical_structure = FilePatchInfo(original_file_content_str, new_file_content_str, patch, file.filename, edit_type=edit_type, num_plus_lines=num_plus_lines, diff --git a/pr_agent/settings/configuration.toml b/pr_agent/settings/configuration.toml index ec902799..3bc91099 100644 --- a/pr_agent/settings/configuration.toml +++ b/pr_agent/settings/configuration.toml @@ -88,6 +88,7 @@ publish_description_as_comment_persistent=true ## changes walkthrough section enable_semantic_files_types=true collapsible_file_list='adaptive' # true, false, 'adaptive' +collapsible_file_list_threshold=8 inline_file_summary=false # false, true, 'table' # markers use_description_markers=false @@ -96,7 +97,6 @@ include_generated_by_header=true enable_large_pr_handling=true max_ai_calls=4 async_ai_calls=true -mention_extra_files=true #custom_labels = ['Bug fix', 'Tests', 'Bug fix with tests', 'Enhancement', 'Documentation', 'Other'] [pr_questions] # /ask # diff --git a/pr_agent/settings/pr_description_prompts.toml b/pr_agent/settings/pr_description_prompts.toml index 364dd9af..0a15eee3 100644 --- a/pr_agent/settings/pr_description_prompts.toml +++ b/pr_agent/settings/pr_description_prompts.toml @@ -1,15 +1,11 @@ [pr_description_prompt] system="""You are PR-Reviewer, a language model designed to review a Git Pull Request (PR). -{%- if enable_custom_labels %} -Your task is to provide a full description for the PR content - files walkthrough, title, type, description and labels. -{%- else %} -Your task is to provide a full description for the PR content - files walkthrough, title, type, and description. -{%- endif %} -- Focus on the new PR code (lines starting with '+'). +Your task is to provide a full description for the PR content - type, description, title and files walkthrough. +- Focus on the new PR code (lines starting with '+' in the 'PR Git Diff' section). - Keep in mind that the 'Previous title', 'Previous description' and 'Commit messages' sections may be partial, simplistic, non-informative or out of date. Hence, compare them to the PR diff code, and use them only as a reference. - The generated title and description should prioritize the most significant changes. -- If needed, each YAML output should be in block scalar indicator ('|-') -- When quoting variables or names from the code, use backticks (`) instead of single quote ('). +- If needed, each YAML output should be in block scalar indicator ('|') +- When quoting variables, names or file paths from the code, use backticks (`) instead of single quote ('). {%- if extra_instructions %} @@ -38,22 +34,20 @@ class PRType(str, Enum): {%- if enable_semantic_files_types %} class FileDescription(BaseModel): - filename: str = Field(description="The full file path of the relevant file.") - language: str = Field(description="The programming language of the relevant file.") + filename: str = Field(description="The full file path of the relevant file") +{%- if include_file_summary_changes %} changes_summary: str = Field(description="concise summary of the changes in the relevant file, in bullet points (1-4 bullet points).") - changes_title: str = Field(description="an informative title for the changes in the files, describing its main theme (5-10 words).") +{%- endif %} + changes_title: str = Field(description="one-line summary (5-10 words) capturing the main theme of changes in the file") label: str = Field(description="a single semantic label that represents a type of code changes that occurred in the File. Possible values (partial list): 'bug fix', 'tests', 'enhancement', 'documentation', 'error handling', 'configuration changes', 'dependencies', 'formatting', 'miscellaneous', ...") {%- endif %} class PRDescription(BaseModel): type: List[PRType] = Field(description="one or more types that describe the PR content. Return the label member value (e.g. 'Bug fix', not 'bug_fix')") + description: str = Field(description="summarize the PR changes in up to four bullet points, each up to 8 words. For large PRs, add sub-bullets if needed. Order bullets by importance, with each bullet highlighting a key change group.") + title: str = Field(description="a concise and descriptive title that captures the PR's main theme") {%- if enable_semantic_files_types %} - pr_files: List[FileDescription] = Field(max_items=15, description="a list of the files in the PR, and summary of their changes") -{%- endif %} - description: str = Field(description="an informative and concise description of the PR. Use bullet points. Display first the most significant changes.") - title: str = Field(description="an informative title for the PR, describing its main theme") -{%- if enable_custom_labels %} - labels: List[Label] = Field(min_items=0, description="choose the relevant custom labels that describe the PR content, and return their keys. Use the value field of the Label object to better understand the label meaning.") + pr_files: List[FileDescription] = Field(max_items=20, description="a list of all the files that were changed in the PR, and summary of their changes. Each file must be analyzed regardless of change size.") {%- endif %} ===== @@ -64,30 +58,23 @@ Example output: type: - ... - ... -{%- if enable_semantic_files_types %} -pr_files: -- filename: | - ... - language: | - ... - changes_summary: | - ... - changes_title: | - ... - label: | - ... -... -{%- endif %} description: | ... title: | ... -{%- if enable_custom_labels %} -labels: -- | - ... -- | - ... +{%- if enable_semantic_files_types %} +pr_files: +- filename: | + ... +{%- if include_file_summary_changes %} + changes_summary: | + ... +{%- endif %} + changes_title: | + ... + label: | + label_key_1 +... {%- endif %} ``` @@ -136,7 +123,7 @@ Commit messages: {%- endif %} -The PR Diff: +The PR Git Diff: ===== {{ diff|trim }} ===== diff --git a/pr_agent/tools/pr_description.py b/pr_agent/tools/pr_description.py index 357e9470..18df4f10 100644 --- a/pr_agent/tools/pr_description.py +++ b/pr_agent/tools/pr_description.py @@ -1,6 +1,7 @@ import asyncio import copy import re +import traceback from functools import partial from typing import List, Tuple @@ -57,6 +58,7 @@ class PRDescription: self.ai_handler.main_pr_language = self.main_pr_language # Initialize the variables dictionary + self.COLLAPSIBLE_FILE_LIST_THRESHOLD = get_settings().pr_description.get("collapsible_file_list_threshold", 8) self.vars = { "title": self.git_provider.pr.title, "branch": self.git_provider.get_pr_branch(), @@ -69,6 +71,7 @@ class PRDescription: "custom_labels_class": "", # will be filled if necessary in 'set_custom_labels' function "enable_semantic_files_types": get_settings().pr_description.enable_semantic_files_types, "related_tickets": "", + "include_file_summary_changes": len(self.git_provider.get_diff_files()) <= self.COLLAPSIBLE_FILE_LIST_THRESHOLD } self.user_description = self.git_provider.get_user_description() @@ -85,7 +88,6 @@ class PRDescription: self.patches_diff = None self.prediction = None self.file_label_dict = None - self.COLLAPSIBLE_FILE_LIST_THRESHOLD = 8 async def run(self): try: @@ -114,6 +116,8 @@ class PRDescription: pr_labels, pr_file_changes = [], [] if get_settings().pr_description.publish_labels: pr_labels = self._prepare_labels() + else: + get_logger().debug(f"Publishing labels disabled") if get_settings().pr_description.use_description_markers: pr_title, pr_body, changes_walkthrough, pr_file_changes = self._prepare_pr_answer_with_markers() @@ -137,6 +141,7 @@ class PRDescription: pr_body += show_relevant_configurations(relevant_section='pr_description') if get_settings().config.publish_output: + # publish labels if get_settings().pr_description.publish_labels and pr_labels and self.git_provider.is_supported("get_labels"): original_labels = self.git_provider.get_pr_labels(update=True) @@ -164,7 +169,7 @@ class PRDescription: self.git_provider.publish_description(pr_title, pr_body) # publish final update message - if (get_settings().pr_description.final_update_message): + if (get_settings().pr_description.final_update_message and not get_settings().config.get('is_auto_command', False)): latest_commit_url = self.git_provider.get_latest_commit_url() if latest_commit_url: pr_url = self.git_provider.get_pr_url() @@ -176,35 +181,37 @@ class PRDescription: get_settings().data = {"artifact": pr_body} return except Exception as e: - get_logger().error(f"Error generating PR description {self.pr_id}: {e}") + get_logger().error(f"Error generating PR description {self.pr_id}: {e}", + artifact={"traceback": traceback.format_exc()}) return "" async def _prepare_prediction(self, model: str) -> None: if get_settings().pr_description.use_description_markers and 'pr_agent:' not in self.user_description: - get_logger().info( - "Markers were enabled, but user description does not contain markers. skipping AI prediction") + get_logger().info("Markers were enabled, but user description does not contain markers. skipping AI prediction") return None large_pr_handling = get_settings().pr_description.enable_large_pr_handling and "pr_description_only_files_prompts" in get_settings() - output = get_pr_diff(self.git_provider, self.token_handler, model, large_pr_handling=large_pr_handling, - return_remaining_files=True) + output = get_pr_diff(self.git_provider, self.token_handler, model, large_pr_handling=large_pr_handling, return_remaining_files=True) if isinstance(output, tuple): patches_diff, remaining_files_list = output else: patches_diff = output remaining_files_list = [] + if not large_pr_handling or patches_diff: self.patches_diff = patches_diff if patches_diff: + # generate the prediction get_logger().debug(f"PR diff", artifact=self.patches_diff) self.prediction = await self._get_prediction(model, patches_diff, prompt="pr_description_prompt") - if (remaining_files_list and 'pr_files' in self.prediction and 'label:' in self.prediction and - get_settings().pr_description.mention_extra_files): - get_logger().debug(f"Extending additional files, {len(remaining_files_list)} files") - self.prediction = await self.extend_additional_files(remaining_files_list) + + # extend the prediction with additional files not shown + if get_settings().pr_description.enable_semantic_files_types: + self.prediction = await self.extend_uncovered_files(self.prediction) else: - get_logger().error(f"Error getting PR diff {self.pr_id}") + get_logger().error(f"Error getting PR diff {self.pr_id}", + artifact={"traceback": traceback.format_exc()}) self.prediction = None else: # get the diff in multiple patches, with the token handler only for the files prompt @@ -289,43 +296,81 @@ class PRDescription: prompt="pr_description_only_description_prompts") prediction_headers = prediction_headers.strip().removeprefix('```yaml').strip('`').strip() - # manually add extra files to final prediction - MAX_EXTRA_FILES_TO_OUTPUT = 100 - if get_settings().pr_description.mention_extra_files: - for i, file in enumerate(remaining_files_list): - extra_file_yaml = f"""\ -- filename: | - {file} - changes_summary: | - ... - changes_title: | - ... - label: | - additional files (token-limit) -""" - files_walkthrough = files_walkthrough.strip() + "\n" + extra_file_yaml.strip() - if i >= MAX_EXTRA_FILES_TO_OUTPUT: - files_walkthrough += f"""\ -extra_file_yaml = -- filename: | - Additional {len(remaining_files_list) - MAX_EXTRA_FILES_TO_OUTPUT} files not shown - changes_summary: | - ... - changes_title: | - ... - label: | - additional files (token-limit) -""" - break + # extend the tables with the files not shown + files_walkthrough_extended = await self.extend_uncovered_files(files_walkthrough) # final processing - self.prediction = prediction_headers + "\n" + "pr_files:\n" + files_walkthrough + self.prediction = prediction_headers + "\n" + "pr_files:\n" + files_walkthrough_extended if not load_yaml(self.prediction, keys_fix_yaml=self.keys_fix): get_logger().error(f"Error getting valid YAML in large PR handling for describe {self.pr_id}") if load_yaml(prediction_headers, keys_fix_yaml=self.keys_fix): get_logger().debug(f"Using only headers for describe {self.pr_id}") self.prediction = prediction_headers + async def extend_uncovered_files(self, original_prediction: str) -> str: + try: + prediction = original_prediction + + # get the original prediction filenames + original_prediction_loaded = load_yaml(original_prediction, keys_fix_yaml=self.keys_fix) + if isinstance(original_prediction_loaded, list): + original_prediction_dict = {"pr_files": original_prediction_loaded} + else: + original_prediction_dict = original_prediction_loaded + filenames_predicted = [file['filename'].strip() for file in original_prediction_dict.get('pr_files', [])] + + # extend the prediction with additional files not included in the original prediction + pr_files = self.git_provider.get_diff_files() + prediction_extra = "pr_files:" + MAX_EXTRA_FILES_TO_OUTPUT = 100 + counter_extra_files = 0 + for file in pr_files: + if file.filename in filenames_predicted: + continue + + # add up to MAX_EXTRA_FILES_TO_OUTPUT files + counter_extra_files += 1 + if counter_extra_files > MAX_EXTRA_FILES_TO_OUTPUT: + extra_file_yaml = f"""\ +- filename: | + Additional files not shown + changes_title: | + ... + label: | + additional files +""" + prediction_extra = prediction_extra + "\n" + extra_file_yaml.strip() + get_logger().debug(f"Too many remaining files, clipping to {MAX_EXTRA_FILES_TO_OUTPUT}") + break + + extra_file_yaml = f"""\ +- filename: | + {file.filename} + changes_title: | + ... + label: | + additional files +""" + prediction_extra = prediction_extra + "\n" + extra_file_yaml.strip() + + # merge the two dictionaries + if counter_extra_files > 0: + get_logger().info(f"Adding {counter_extra_files} unprocessed extra files to table prediction") + prediction_extra_dict = load_yaml(prediction_extra, keys_fix_yaml=self.keys_fix) + if isinstance(original_prediction_dict, dict) and isinstance(prediction_extra_dict, dict): + original_prediction_dict["pr_files"].extend(prediction_extra_dict["pr_files"]) + new_yaml = yaml.dump(original_prediction_dict) + if load_yaml(new_yaml, keys_fix_yaml=self.keys_fix): + prediction = new_yaml + if isinstance(original_prediction, list): + prediction = yaml.dump(original_prediction_dict["pr_files"]) + + return prediction + except Exception as e: + get_logger().error(f"Error extending uncovered files {self.pr_id}: {e}") + return original_prediction + + async def extend_additional_files(self, remaining_files_list) -> str: prediction = self.prediction try: @@ -397,31 +442,31 @@ extra_file_yaml = self.data['pr_files'] = self.data.pop('pr_files') def _prepare_labels(self) -> List[str]: - pr_types = [] + pr_labels = [] # If the 'PR Type' key is present in the dictionary, split its value by comma and assign it to 'pr_types' - if 'labels' in self.data: + if 'labels' in self.data and self.data['labels']: if type(self.data['labels']) == list: - pr_types = self.data['labels'] + pr_labels = self.data['labels'] elif type(self.data['labels']) == str: - pr_types = self.data['labels'].split(',') - elif 'type' in self.data: + pr_labels = self.data['labels'].split(',') + elif 'type' in self.data and self.data['type'] and get_settings().pr_description.publish_labels: if type(self.data['type']) == list: - pr_types = self.data['type'] + pr_labels = self.data['type'] elif type(self.data['type']) == str: - pr_types = self.data['type'].split(',') - pr_types = [label.strip() for label in pr_types] + pr_labels = self.data['type'].split(',') + pr_labels = [label.strip() for label in pr_labels] # convert lowercase labels to original case try: if "labels_minimal_to_labels_dict" in self.variables: d: dict = self.variables["labels_minimal_to_labels_dict"] - for i, label_i in enumerate(pr_types): + for i, label_i in enumerate(pr_labels): if label_i in d: - pr_types[i] = d[label_i] + pr_labels[i] = d[label_i] except Exception as e: get_logger().error(f"Error converting labels to original case {self.pr_id}: {e}") - return pr_types + return pr_labels def _prepare_pr_answer_with_markers(self) -> Tuple[str, str, str, List[dict]]: get_logger().info(f"Using description marker replacements {self.pr_id}") @@ -511,6 +556,11 @@ extra_file_yaml = elif 'pr_files' in key.lower() and get_settings().pr_description.enable_semantic_files_types: changes_walkthrough, pr_file_changes = self.process_pr_files_prediction(changes_walkthrough, value) changes_walkthrough = f"{PRDescriptionHeader.CHANGES_WALKTHROUGH.value}\n{changes_walkthrough}" + elif key.lower().strip() == 'description': + if isinstance(value, list): + value = ', '.join(v.rstrip() for v in value) + value = value.replace('\n-', '\n\n-').strip() # makes the bullet points more readable by adding double space + pr_body += f"{value}\n" else: # if the value is a list, join its items by comma if isinstance(value, list): @@ -528,14 +578,18 @@ extra_file_yaml = return file_label_dict for file in self.data['pr_files']: try: - required_fields = ['changes_summary', 'changes_title', 'filename', 'label'] + required_fields = ['changes_title', 'filename', 'label'] if not all(field in file for field in required_fields): # can happen for example if a YAML generation was interrupted in the middle (no more tokens) get_logger().warning(f"Missing required fields in file label dict {self.pr_id}, skipping file", artifact={"file": file}) continue + if not file.get('changes_title'): + get_logger().warning(f"Empty changes title or summary in file label dict {self.pr_id}, skipping file", + artifact={"file": file}) + continue filename = file['filename'].replace("'", "`").replace('"', '`') - changes_summary = file['changes_summary'] + changes_summary = file.get('changes_summary', "").strip() changes_title = file['changes_title'].strip() label = file.get('label').strip().lower() if label not in file_label_dict: @@ -578,12 +632,14 @@ extra_file_yaml = for filename, file_changes_title, file_change_description in list_tuples: filename = filename.replace("'", "`").rstrip() filename_publish = filename.split("/")[-1] - - file_changes_title_code = f"{file_changes_title}" - file_changes_title_code_br = insert_br_after_x_chars(file_changes_title_code, x=(delta - 5)).strip() - if len(file_changes_title_code_br) < (delta - 5): - file_changes_title_code_br += "  " * ((delta - 5) - len(file_changes_title_code_br)) - filename_publish = f"{filename_publish}
{file_changes_title_code_br}
" + if file_changes_title and file_changes_title.strip() != "...": + file_changes_title_code = f"{file_changes_title}" + file_changes_title_code_br = insert_br_after_x_chars(file_changes_title_code, x=(delta - 5)).strip() + if len(file_changes_title_code_br) < (delta - 5): + file_changes_title_code_br += "  " * ((delta - 5) - len(file_changes_title_code_br)) + filename_publish = f"{filename_publish}
{file_changes_title_code_br}
" + else: + filename_publish = f"{filename_publish}" diff_plus_minus = "" delta_nbsp = "" diff_files = self.git_provider.get_diff_files() @@ -592,6 +648,8 @@ extra_file_yaml = num_plus_lines = f.num_plus_lines num_minus_lines = f.num_minus_lines diff_plus_minus += f"+{num_plus_lines}/-{num_minus_lines}" + if len(diff_plus_minus) > 12 or diff_plus_minus == "+0/-0": + diff_plus_minus = "[link]" delta_nbsp = "  " * max(0, (8 - len(diff_plus_minus))) break @@ -600,9 +658,40 @@ extra_file_yaml = if hasattr(self.git_provider, 'get_line_link'): filename = filename.strip() link = self.git_provider.get_line_link(filename, relevant_line_start=-1) + if (not link or not diff_plus_minus) and ('additional files' not in filename.lower()): + get_logger().warning(f"Error getting line link for '{filename}'") + continue + # Add file data to the PR body file_change_description_br = insert_br_after_x_chars(file_change_description, x=(delta - 5)) - pr_body += f""" + pr_body = self.add_file_data(delta_nbsp, diff_plus_minus, file_change_description_br, filename, + filename_publish, link, pr_body) + + # Close the collapsible file list + if use_collapsible_file_list: + pr_body += """""" + else: + pr_body += """""" + pr_body += """""" + + except Exception as e: + get_logger().error(f"Error processing pr files to markdown {self.pr_id}: {str(e)}") + pass + return pr_body, pr_comments + + def add_file_data(self, delta_nbsp, diff_plus_minus, file_change_description_br, filename, filename_publish, link, + pr_body) -> str: + + if not file_change_description_br: + pr_body += f""" + + {filename_publish} + {diff_plus_minus}{delta_nbsp} + + +""" + else: + pr_body += f"""
@@ -622,17 +711,7 @@ extra_file_yaml = """ - if use_collapsible_file_list: - pr_body += """
""" - else: - pr_body += """""" - pr_body += """""" - - except Exception as e: - get_logger().error(f"Error processing pr files to markdown {self.pr_id}: {e}") - pass - return pr_body, pr_comments - + return pr_body def count_chars_without_html(string): if '<' not in string: @@ -641,11 +720,14 @@ def count_chars_without_html(string): return len(no_html_string) -def insert_br_after_x_chars(text, x=70): +def insert_br_after_x_chars(text: str, x=70): """ Insert
into a string after a word that increases its length above x characters. Use proper HTML tags for code and new lines. """ + + if not text: + return "" if count_chars_without_html(text) < x: return text diff --git a/pr_agent/tools/pr_line_questions.py b/pr_agent/tools/pr_line_questions.py index 4ef5b9a7..760c81ff 100644 --- a/pr_agent/tools/pr_line_questions.py +++ b/pr_agent/tools/pr_line_questions.py @@ -79,13 +79,17 @@ class PR_LineQuestions: line_end=line_end, side=side) if self.patch_with_lines: - response = await retry_with_fallback_models(self._get_prediction, model_type=ModelType.WEAK) + model_answer = await retry_with_fallback_models(self._get_prediction, model_type=ModelType.WEAK) + # sanitize the answer so that no line will start with "/" + model_answer_sanitized = model_answer.strip().replace("\n/", "\n /") + if model_answer_sanitized.startswith("/"): + model_answer_sanitized = " " + model_answer_sanitized get_logger().info('Preparing answer...') if comment_id: - self.git_provider.reply_to_comment_from_comment_id(comment_id, response) + self.git_provider.reply_to_comment_from_comment_id(comment_id, model_answer_sanitized) else: - self.git_provider.publish_comment(response) + self.git_provider.publish_comment(model_answer_sanitized) return "" diff --git a/pr_agent/tools/pr_questions.py b/pr_agent/tools/pr_questions.py index 1ab496dc..6f400b96 100644 --- a/pr_agent/tools/pr_questions.py +++ b/pr_agent/tools/pr_questions.py @@ -117,6 +117,16 @@ class PRQuestions: return response def _prepare_pr_answer(self) -> str: + model_answer = self.prediction.strip() + # sanitize the answer so that no line will start with "/" + model_answer_sanitized = model_answer.replace("\n/", "\n /") + if model_answer_sanitized.startswith("/"): + model_answer_sanitized = " " + model_answer_sanitized + if model_answer_sanitized != model_answer: + get_logger().debug(f"Sanitized model answer", + artifact={"model_answer": model_answer, "sanitized_answer": model_answer_sanitized}) + + answer_str = f"### **Ask**❓\n{self.question_str}\n\n" - answer_str += f"### **Answer:**\n{self.prediction.strip()}\n\n" + answer_str += f"### **Answer:**\n{model_answer_sanitized}\n\n" return answer_str diff --git a/tests/unittest/test_bitbucket_provider.py b/tests/unittest/test_bitbucket_provider.py index 5c672928..d883d55b 100644 --- a/tests/unittest/test_bitbucket_provider.py +++ b/tests/unittest/test_bitbucket_provider.py @@ -24,6 +24,13 @@ class TestBitbucketServerProvider: assert repo_slug == "my-repo" assert pr_number == 1 + def test_parse_pr_url_with_users(self): + url = "https://bitbucket.company-server.url/users/username/repos/my-repo/pull-requests/1" + workspace_slug, repo_slug, pr_number = BitbucketServerProvider._parse_pr_url(url) + assert workspace_slug == "~username" + assert repo_slug == "my-repo" + assert pr_number == 1 + def mock_get_content_of_file(self, project_key, repository_slug, filename, at=None, markup=None): content_map = { '9c1cffdd9f276074bfb6fb3b70fbee62d298b058': 'file\nwith\nsome\nlines\nto\nemulate\na\nreal\nfile\n', diff --git a/tests/unittest/test_extend_patch.py b/tests/unittest/test_extend_patch.py index 6689186c..dfe4c502 100644 --- a/tests/unittest/test_extend_patch.py +++ b/tests/unittest/test_extend_patch.py @@ -145,8 +145,8 @@ class TestExtendedPatchMoreLines: # Check that with no extra lines, the patches are the same as the original patches p0 = patches_extended_no_extra_lines[0].strip() p1 = patches_extended_no_extra_lines[1].strip() - assert p0 == '## file1\n' + pr_languages[0]['files'][0].patch.strip() - assert p1 == '## file2\n' + pr_languages[0]['files'][1].patch.strip() + assert p0 == "## File: 'file1'\n" + pr_languages[0]['files'][0].patch.strip() + assert p1 == "## File: 'file2'\n" + pr_languages[0]['files'][1].patch.strip() patches_extended_with_extra_lines, total_tokens, patches_extended_tokens = pr_generate_extended_diff( pr_languages, token_handler, add_line_numbers_to_hunks=False, @@ -154,5 +154,6 @@ class TestExtendedPatchMoreLines: patch_extra_lines_after=1 ) + p0_extended = patches_extended_with_extra_lines[0].strip() - assert p0_extended == '## file1\n\n@@ -3,8 +3,8 @@ \n line0\n line1\n-original content\n+modified content\n line2\n line3\n line4\n line5\n line6' + assert p0_extended == "## File: 'file1'\n\n@@ -3,8 +3,8 @@ \n line0\n line1\n-original content\n+modified content\n line2\n line3\n line4\n line5\n line6"