Skip to content

fix: Adds retry on GCP "No Capacity" error during endpoint creation for privatelinkendpoint#4221

Draft
EspenAlbert wants to merge 1 commit intomasterfrom
gcp-retry
Draft

fix: Adds retry on GCP "No Capacity" error during endpoint creation for privatelinkendpoint#4221
EspenAlbert wants to merge 1 commit intomasterfrom
gcp-retry

Conversation

@EspenAlbert
Copy link
Copy Markdown
Collaborator

Description

Please include a summary of the fix/feature/change, including any relevant motivation and context.

Link to any related issue(s):

Type of change:

  • Bug fix (non-breaking change which fixes an issue). Please, add the "bug" label to the PR.
  • New feature (non-breaking change which adds functionality). Please, add the "enhancement" label to the PR. A migration guide must be created or updated if the new feature will go in a major version.
  • Breaking change (fix or feature that would cause existing functionality to not work as expected). Please, add the "breaking change" label to the PR. A migration guide must be created or updated.
  • This change requires a documentation update
  • Documentation fix/enhancement

Required Checklist:

  • I have signed the MongoDB CLA
  • I have read the contributing guides
  • I have checked that this change does not generate any credentials and that they are NOT accidentally logged anywhere.
  • I have added tests that prove my fix is effective or that my feature works per HashiCorp requirements
  • I have added any necessary documentation (if appropriate)
  • I have run make fmt and formatted my code
  • If changes include deprecations or removals I have added appropriate changelog entries.
  • If changes include removal or addition of 3rd party GitHub actions, I updated our internal document. Reach out to the APIx Integration slack channel to get access to the internal document.

Further comments

Copilot AI review requested due to automatic review settings February 24, 2026 14:28
@EspenAlbert EspenAlbert changed the title fix(privatelinkendpoint): add retry on GCP "No Capacity" error during endpoint creation fix: Adds retry on GCP "No Capacity" error during endpoint creation for privatelinkendpoint Feb 24, 2026
@github-actions github-actions Bot added the bug label Feb 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds retry logic to handle GCP "No Capacity" errors during private endpoint creation. The change addresses transient capacity issues by implementing a retry mechanism with up to 5 attempts and 10-second delays between retries.

Changes:

  • Added custom retry loop with 5 attempts and 10-second sleep intervals for handling GCP "No Capacity" errors during private endpoint service creation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic doesn't respect context cancellation or timeouts. The loop uses time.Sleep(retrySleep) without checking if the context has been cancelled or if the configured timeout (1 hour, as defined in the resource timeouts at line 128) has been exceeded. This could result in the function continuing to retry even after the Terraform operation timeout has been reached, leading to a poor user experience and potential resource leaks.

Suggested change
time.Sleep(retrySleep)
select {
case <-ctx.Done():
return diag.FromErr(ctx.Err())
case <-time.After(retrySleep):
}

Copilot uses AI. Check for mistakes.
Comment on lines +152 to 167
const maxRetries = 5
const retrySleep = 10 * time.Second
var privateEndpoint *admin.EndpointService
for attempt := range maxRetries {
var err error
privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err == nil {
break
}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err))
}
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new retry logic for handling "No Capacity" errors lacks test coverage. The existing tests in resource_test.go don't cover this retry scenario. Given that this file has comprehensive test coverage for other scenarios (basicAWS, basicAzure, basicGCP, deleteOnCreateTimeout, etc.), the new retry behavior should also have test coverage to ensure it works correctly and doesn't introduce regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
const maxRetries = 5
const retrySleep = 10 * time.Second
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry configuration uses magic numbers (5 for maxRetries and 10 seconds for retrySleep) instead of following the pattern of defining constants at the package level. The existing code defines timeout constants at the package level (see lines 24-30 where delayAndMinTimeout is defined). Consider defining these retry parameters as package-level constants with descriptive names for better maintainability and consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +155 to 168
for attempt := range maxRetries {
var err error
privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err == nil {
break
}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err))
}

Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential nil pointer dereference bug. If all retries are exhausted and the last error is a "No Capacity" error that passes the retry condition check (line 161), the loop will exit normally without breaking or returning. This means privateEndpoint will remain nil, causing a panic at line 169 when calling privateEndpoint.GetId().

The condition attempt < maxRetries-1 prevents retrying on the last attempt, but doesn't handle the case where the last attempt also returns a "No Capacity" error. When attempt equals 4 (the last iteration), the condition on line 161 will be false, so the code will skip both the retry logic (line 162-164) and the error return (line 166), allowing the loop to complete with privateEndpoint still nil.

Suggested change
for attempt := range maxRetries {
var err error
privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err == nil {
break
}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err))
}
var lastErr error
for attempt := range maxRetries {
privateEndpoint, _, lastErr = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if lastErr == nil {
break
}
if admin.IsErrorCode(lastErr, "ATLAS_GENERAL_ERROR") && strings.Contains(lastErr.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, lastErr))
}
if privateEndpoint == nil {
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, lastErr))
}

Copilot uses AI. Check for mistakes.
Comment on lines +152 to 166
const maxRetries = 5
const retrySleep = 10 * time.Second
var privateEndpoint *admin.EndpointService
for attempt := range maxRetries {
var err error
privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err == nil {
break
}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err))
Copy link

Copilot AI Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This custom retry implementation is inconsistent with the established codebase pattern. The repository extensively uses retry.RetryContext from the Terraform SDK's helper package for retry logic (see internal/service/globalclusterconfig/resource_global_cluster_config.go:129, internal/service/streamconnection/state_transition.go:27, internal/service/team/resource_team.go:177, among others).

Using retry.RetryContext provides several advantages: it respects context cancellation, handles timeouts properly, integrates with Terraform's logging, and provides exponential backoff. The current implementation with time.Sleep doesn't check for context cancellation and uses a fixed delay, which could cause the operation to exceed configured timeouts.

Suggested change
const maxRetries = 5
const retrySleep = 10 * time.Second
var privateEndpoint *admin.EndpointService
for attempt := range maxRetries {
var err error
privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err == nil {
break
}
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 {
log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep)
time.Sleep(retrySleep)
continue
}
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err))
var privateEndpoint *admin.EndpointService
attempts := 0
retryErr := retry.RetryContext(ctx, d.Timeout(schema.TimeoutCreate), func() *retry.RetryError {
attempts++
pe, _, err := connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute()
if err != nil {
if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") {
log.Printf("[DEBUG] Attempt %d: GCP private endpoint creation returned 'No Capacity', retrying...", attempts)
return retry.RetryableError(err)
}
return retry.NonRetryableError(err)
}
privateEndpoint = pe
return nil
})
if retryErr != nil {
return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, retryErr))

Copilot uses AI. Check for mistakes.
@svc-apix-Bot
Copy link
Copy Markdown
Collaborator

This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy!

@svc-apix-Bot svc-apix-Bot added stale and removed stale labels Mar 2, 2026
@svc-apix-Bot
Copy link
Copy Markdown
Collaborator

This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy!

@svc-apix-Bot svc-apix-Bot added stale and removed stale labels Mar 8, 2026
@svc-apix-Bot
Copy link
Copy Markdown
Collaborator

This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy!

@svc-apix-Bot svc-apix-Bot added stale and removed stale labels Mar 15, 2026
@EspenAlbert EspenAlbert added the not_stale Not stale issue or PR label Mar 16, 2026
@EspenAlbert
Copy link
Copy Markdown
Collaborator Author

Still no action from upstream, keeping this open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug not_stale Not stale issue or PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants