Long-running Azure test reliability and test-code ref override#11932
Long-running Azure test reliability and test-code ref override#11932brooke-hamilton wants to merge 5 commits into
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR improves long-running Azure test reliability by allowing test-code checkout overrides, pre-restoring UDT Bicep types, and making Kubernetes/Flux test teardown more robust.
Changes:
- Adds optional
test_code_refsupport to the release-code checkout script and workflow. - Restores UDT testresources Bicep artifacts before functional tests.
- Updates Flux/DeploymentTemplate cleanup to delete templates before namespaces and add namespace finalizer fallback handling.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.github/scripts/checkout-release-codebase.sh |
Adds configurable release/test-code ref checkout and workflow outputs. |
.github/workflows/long-running-azure.yaml |
Adds manual input for test-code ref and restores UDT Bicep artifacts. |
test/functional-portable/kubernetes/noncloud/deploymenttemplate_test.go |
Enhances namespace deletion waiting and finalizer fallback cleanup. |
test/functional-portable/kubernetes/noncloud/flux_test.go |
Tracks and deletes DeploymentTemplates before namespace teardown. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #11932 +/- ##
=======================================
Coverage 51.69% 51.69%
=======================================
Files 724 724
Lines 45508 45508
=======================================
Hits 23525 23525
Misses 19763 19763
Partials 2220 2220 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fix: add testresources Bicep extension pre-restore to prevent token expiry failures The long-running-azure.yaml workflow pre-restores Bicep extension artifacts before running functional tests, but was missing the testresources extension. The comment explicitly noted all three extensions needed to be restored (Radius, AWS, UDT testresources), but only two were implemented. During long-running tests (~1hr), Bicep tries to restore the testresources extension from the Azure Container Registry using Azure CLI credentials. The OIDC token from the "Re-login to Azure" step only lasts 5 minutes, so any build happening after that fails with: AADSTS700024: Client assertion is not within its valid time range By pre-restoring the testresources extension (br:crradfunctest9lhu.azurecr.io/ testresources:latest) while the Azure CLI token is still valid, Bicep caches the artifact locally. Subsequent builds use the cached version without re-authenticating to the registry. Agent-Logs-Url: https://github.com/radius-project/radius/sessions/98d8a698-3cbb-44ef-a1d8-f5315fb82e4b Co-authored-by: brooke-hamilton <45323234+brooke-hamilton@users.noreply.github.com> improve flux text namespace delete logic Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com> improve flux test Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com> run tests on selected branch Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com> (cherry picked from commit 0ca765527959b5ef3bdb28aa12e4638a2f8de2a9)
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
fd4b69b to
391e0f0
Compare
| if len(ns.Spec.Finalizers) > 0 { | ||
| t.Logf("Namespace %s stuck terminating with finalizers: %v; clearing them", namespace, ns.Spec.Finalizers) | ||
| ns.Spec.Finalizers = nil | ||
| _, err = opts.K8sClient.CoreV1().Namespaces().Finalize(ctx, ns, metav1.UpdateOptions{}) |
There was a problem hiding this comment.
I think having a finalizer is a symptom that there are still hanging resources. IMO we should never force delete finalizers in this case, since it can result in the next run having unexpected behavior (existing but untracked resources).
|
|
||
| // Track DeploymentTemplates created across steps so we can delete them and wait for | ||
| // the Radius finalizer to drain before tearing down namespaces. | ||
| var deploymentTemplates []*radappiov1alpha3.DeploymentTemplate |
There was a problem hiding this comment.
where is this slice being populated?
There was a problem hiding this comment.
Populated inside the per-step loop at test/functional-portable/kubernetes/noncloud/flux_test.go#L359-L361:
deploymentTemplate, err := waitForDeploymentTemplateToBeReadyWithGeneration(t, ctx, types.NamespacedName{Name: name, Namespace: namespace}, stepNumber, opts.Client)
require.NoError(t, err)
deploymentTemplates = append(deploymentTemplates, deploymentTemplate)For every step, after each DeploymentTemplate becomes Ready we append it to the slice the t.Cleanup closes over, so teardown deletes every DT created across all steps before tearing down namespaces.
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
Signed-off-by: Brooke Hamilton <45323234+brooke-hamilton@users.noreply.github.com>
Radius functional test overviewClick here to see the test run details
Test Status⌛ Building Radius and pushing container images for functional tests... |
|
@brooke-hamilton @willdavsmith any progress on this fix? |
Description
Improvements to the long-running Azure test workflow to prevent timeouts and to allow test code to be patched without cutting a product patch release.
Three independent fixes:
test_code_refoverride forcheckout-release-codebase.shand the long-running Azure workflow. When set (via positional arg,TEST_CODE_REFenv var, or workflow_dispatch input), the script clones the supplied git ref intocurrent_release/instead of the tag matching the installed CLI. The product under test is still the installed release; only the on-disk test/infrastructure code changes. This lets us iterate on test fixes without releasing a new patch.usertypealpha-recipe.biceptypes in the workflow. Restoring this lazily later in the run was racing with Azure CLI token expiry during the long-running tests.deleteNamespacenow waits for normal deletion withassert.Eventuallyand falls back to clearing namespace finalizers (test-only escape hatch) when a namespace is stuckTerminating.testFluxIntegrationnow deletesDeploymentTemplateresources first and waits for the Radius finalizer to drain before tearing down namespaces, so namespaces are not stuck onradapp.io/deployment-template-finalizerandradius-rpdoes not recreate application namespaces while their backing Applications.Core resources still exist.assert.*instead ofrequire.*so one stuck resource does not skip cleanup of the rest.Type of change
Fixes: #issue_number
Contributor checklist
Please verify that the PR meets the following requirements, where applicable:
eng/design-notes/in this repository, if new APIs are being introduced.