[Draft] Small improvement on sanity checks per MI repository provided on the autogenerated JSON#1851
[Draft] Small improvement on sanity checks per MI repository provided on the autogenerated JSON#1851
Conversation
|
I think because of the way we iterate on all possible urls and only use the valid ones this might give all the false negatives. Basically we try to see if an MI has a specific client tools repo and if it doesn't we don't use it, which is a valid case since not all MIs have all the suffixes. I think the way this would check, it would log all these non-repos and show them as errors. But I will review this better later on |
|
Take this PR as an example :) |
| res: requests.Response = requests.get(url, timeout=6) | ||
| res.raise_for_status() | ||
| except requests.RequestException as e: | ||
| logging.error(f"Error requesting: {url}; {e}") |
There was a problem hiding this comment.
As pointed out by Kostas, this will log all 404s that we get.
Since they tend to be quite a lot, we may want to skip logging here.
There was a problem hiding this comment.
Why we get 404 on repos that we expect to be valid?
NamelessOne91
left a comment
There was a problem hiding this comment.
I tried to feed the MI with ID 42236, using 50-sles, as version it correctly recognized these as valid, because of a link to a .repo:
- http://download.suse.de/ibs/SUSE:/Maintenance:/42236/SUSE_Updates_openSUSE-SLE_15.6/
- http://download.suse.de/ibs/SUSE:/Maintenance:/42236/SUSE_Updates_SLE-Micro_5.5_x86_64/
Not a fan of handling this by parsing HTML with regEx, but it seems to work with RPM repos.
The 1st case, where the URL itself is pointing to a .repo file, is probably not going to happen in our setup.
There was a problem hiding this comment.
Pull request overview
This PR adds sanity checks to validate maintenance repository URLs by detecting whether they contain .repo files (for RPM repositories) or Release/InRelease files (for Debian repositories). The validation is performed by fetching the URL content and using regex patterns to search for repository indicator files.
Changes:
- Enhanced
create_urlfunction to validate repository URLs by checking for .repo, Release, or InRelease files in the response body - Changed return type from
strtostr | Noneto explicitly handle invalid repositories - Added comprehensive error logging when repositories fail validation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return url if res.ok else "" | ||
| try: | ||
| res: requests.Response = requests.get(url, timeout=6) | ||
| res.raise_for_status() |
There was a problem hiding this comment.
The code now calls 'res.raise_for_status()' on line 59, which is the correct requests library pattern. However, the test mock at jenkins_pipelines/scripts/tests/mock_response.py:18-19 currently always raises HTTPError regardless of the response status. This means all test cases will fail because even successful (200) responses will raise an exception. The MockResponse.raise_for_status() method should only raise HTTPError when status_code indicates an error (4xx, 5xx), not for successful responses.
| # If the URL itself points directly to a .repo file | ||
| if url.lower().endswith('.repo'): | ||
| return url.rsplit('/', 1)[0] + '/' | ||
|
|
There was a problem hiding this comment.
This logic appears problematic. If the URL itself ends with '.repo', it means the URL was constructed to point directly to a .repo file (which seems unlikely based on how create_url is called with MI IDs and suffixes). The code then returns the parent directory by stripping the filename, but this parent directory hasn't been validated to exist. Additionally, this check happens before checking the response body, so even if the URL returns a 404, it would still return the parent directory. Consider moving this check after validating the response, or removing it if URLs are never expected to point directly to .repo files.
| # If the URL itself points directly to a .repo file | |
| if url.lower().endswith('.repo'): | |
| return url.rsplit('/', 1)[0] + '/' |
|
|
||
| @cache | ||
| def create_url(mi_id: str, suffix: str) -> str: | ||
| def create_url(mi_id: str, suffix: str) -> str | None: |
There was a problem hiding this comment.
The return type of create_url has changed from returning an empty string "" for invalid URLs to returning None. However, the existing test at jenkins_pipelines/scripts/tests/test_maintenance_json_generator.py:99 still expects an empty string ("") for invalid cases. This test will fail because it checks 'expected_url: str = ... if valid else ""' and then asserts equality with the result. The test needs to be updated to expect None instead of "" for invalid URLs.
| # Search for links to .repo files in the HTML body | ||
| # Matches: href="something.repo" | ||
| rpm_regex = re.compile(r'href=["\'](?P<link>[^"\']+?\.repo)\b["\']', re.IGNORECASE) | ||
| if rpm_regex.search(body): | ||
| return url | ||
|
|
||
| # Fallback for plain text listings of .repo files | ||
| token_repo_re = re.compile(r'(?P<link>https?://[^\s"\'<>]+?\.repo\b)|(?P<link2>[^\s"\'<>]+?\.repo\b)', re.IGNORECASE) | ||
| if token_repo_re.search(body): | ||
| return url | ||
|
|
||
| # --- Strategy 2: Debian/Ubuntu Repository Check --- | ||
| # Debian repos define a 'Release' or 'InRelease' file at the repository root. | ||
| # Matches: href="Release" or href="InRelease" | ||
| deb_regex = re.compile(r'href=["\'](?P<link>(?:In)?Release)["\']', re.IGNORECASE) | ||
| if deb_regex.search(body): | ||
| logging.info(f"Identified Debian-like repository at: {url}") | ||
| return url |
There was a problem hiding this comment.
The new validation logic searches for .repo files or Release/InRelease files in the response body (lines 71-88). However, the test mock at jenkins_pipelines/scripts/tests/mock_response.py:22-23 returns a MockResponse with no content/text for successful URLs. This means even the valid test case will fail the new validation checks and return None instead of the expected URL. The mock needs to be updated to include realistic HTML content with .repo file references for successful test cases.
| Validates if it's a valid RPM repo (contains .repo) or Debian repo (contains Release/InRelease). | ||
| Returns the directory URL if valid, None otherwise. |
There was a problem hiding this comment.
The docstring states "Validates if it's a valid RPM repo (contains .repo) or Debian repo (contains Release/InRelease)". However, this is slightly misleading as the function doesn't just validate - it also constructs and returns the URL. Consider updating the docstring to clarify that the function both validates and returns the URL, or returns None if validation fails. For example: "Validates if the URL points to a valid RPM or Debian repository and returns the URL if valid, None otherwise."
| Validates if it's a valid RPM repo (contains .repo) or Debian repo (contains Release/InRelease). | |
| Returns the directory URL if valid, None otherwise. | |
| Validates if the URL points to a valid RPM repository (contains a .repo file) or Debian repository | |
| (contains Release/InRelease), and returns the URL if valid, None otherwise. |
| rpm_regex = re.compile(r'href=["\'](?P<link>[^"\']+?\.repo)\b["\']', re.IGNORECASE) | ||
| if rpm_regex.search(body): | ||
| return url | ||
|
|
||
| # Fallback for plain text listings of .repo files | ||
| token_repo_re = re.compile(r'(?P<link>https?://[^\s"\'<>]+?\.repo\b)|(?P<link2>[^\s"\'<>]+?\.repo\b)', re.IGNORECASE) | ||
| if token_repo_re.search(body): | ||
| return url | ||
|
|
||
| # --- Strategy 2: Debian/Ubuntu Repository Check --- | ||
| # Debian repos define a 'Release' or 'InRelease' file at the repository root. | ||
| # Matches: href="Release" or href="InRelease" | ||
| deb_regex = re.compile(r'href=["\'](?P<link>(?:In)?Release)["\']', re.IGNORECASE) |
There was a problem hiding this comment.
The regex defines a named group 'link' but the named group is not used. The match object from 'search()' is not captured, so the named group serves no purpose. Consider simplifying the regex by removing the named group, as only the existence of a match is checked.
| rpm_regex = re.compile(r'href=["\'](?P<link>[^"\']+?\.repo)\b["\']', re.IGNORECASE) | |
| if rpm_regex.search(body): | |
| return url | |
| # Fallback for plain text listings of .repo files | |
| token_repo_re = re.compile(r'(?P<link>https?://[^\s"\'<>]+?\.repo\b)|(?P<link2>[^\s"\'<>]+?\.repo\b)', re.IGNORECASE) | |
| if token_repo_re.search(body): | |
| return url | |
| # --- Strategy 2: Debian/Ubuntu Repository Check --- | |
| # Debian repos define a 'Release' or 'InRelease' file at the repository root. | |
| # Matches: href="Release" or href="InRelease" | |
| deb_regex = re.compile(r'href=["\'](?P<link>(?:In)?Release)["\']', re.IGNORECASE) | |
| rpm_regex = re.compile(r'href=["\'][^"\']+?\.repo\b["\']', re.IGNORECASE) | |
| if rpm_regex.search(body): | |
| return url | |
| # Fallback for plain text listings of .repo files | |
| token_repo_re = re.compile(r'https?://[^\s"\'<>]+?\.repo\b|[^\s"\'<>]+?\.repo\b', re.IGNORECASE) | |
| if token_repo_re.search(body): | |
| return url | |
| # --- Strategy 2: Debian/Ubuntu Repository Check --- | |
| # Debian repos define a 'Release' or 'InRelease' file at the repository root. | |
| # Matches: href="Release" or href="InRelease" | |
| deb_regex = re.compile(r'href=["\'](?:In)?Release["\']', re.IGNORECASE) |
| return url | ||
|
|
||
| # Fallback for plain text listings of .repo files | ||
| token_repo_re = re.compile(r'(?P<link>https?://[^\s"\'<>]+?\.repo\b)|(?P<link2>[^\s"\'<>]+?\.repo\b)', re.IGNORECASE) |
There was a problem hiding this comment.
This regex has two named groups both capturing .repo file patterns: 'link' and 'link2'. Having duplicate group names is problematic as only one will be accessible in match results. Additionally, these named groups are never used since only the existence of a match is checked. Consider simplifying to a single pattern without named groups.
| token_repo_re = re.compile(r'(?P<link>https?://[^\s"\'<>]+?\.repo\b)|(?P<link2>[^\s"\'<>]+?\.repo\b)', re.IGNORECASE) | |
| token_repo_re = re.compile(r'https?://[^\s"\'<>]+?\.repo\b|[^\s"\'<>]+?\.repo\b', re.IGNORECASE) |
| # --- Strategy 2: Debian/Ubuntu Repository Check --- | ||
| # Debian repos define a 'Release' or 'InRelease' file at the repository root. | ||
| # Matches: href="Release" or href="InRelease" | ||
| deb_regex = re.compile(r'href=["\'](?P<link>(?:In)?Release)["\']', re.IGNORECASE) |
There was a problem hiding this comment.
Similar to the RPM regex, the named group 'link' is defined but never used. Consider removing the named group to simplify the code.
| deb_regex = re.compile(r'href=["\'](?P<link>(?:In)?Release)["\']', re.IGNORECASE) | |
| deb_regex = re.compile(r'href=["\'](?:In)?Release["\']', re.IGNORECASE) |
| logging.error(f"Error requesting: {url}; {e}") | ||
| return None | ||
|
|
||
| body = res.text or "" |
There was a problem hiding this comment.
The new create_url function now reads 'res.text' on line 64 to get the response body for validation. However, the test mock at jenkins_pipelines/scripts/tests/mock_response.py:9-13 only defines 'content' attribute, not 'text'. The tests will fail with AttributeError when trying to access res.text. The MockResponse class needs to be updated to include a 'text' property that returns the content.
This is problematic and we need to set clear rules for AI generated code. Testing the code, however, should always be done, nevertheless the author of the code. |
@nodeg To make it clear. I opened a GH Card as discussed in our meeting. After that, and just as example, I opened this proposal vibe coded by Gemini. I did not test it because that was an example and I expected someone else to pick the card. Right now, I'm assigned on the card, and ofc, I will test it before going further. But this card had never been prioritized, so for now, I did not start working on it, it still in "New" column. |
Related card: https://github.com/SUSE/spacewalk/issues/29401
This just makes a sanity check to find .repo file inside the repo URL, if it doesn't find it it logs an error. But we need to improve this logic to consider Debian-like repos too.