fix: Adds retry on GCP "No Capacity" error during endpoint creation for privatelinkendpoint#4221
fix: Adds retry on GCP "No Capacity" error during endpoint creation for privatelinkendpoint#4221EspenAlbert wants to merge 1 commit intomasterfrom
Conversation
… endpoint creation
There was a problem hiding this comment.
Pull request overview
This pull request adds retry logic to handle GCP "No Capacity" errors during private endpoint creation. The change addresses transient capacity issues by implementing a retry mechanism with up to 5 attempts and 10-second delays between retries.
Changes:
- Added custom retry loop with 5 attempts and 10-second sleep intervals for handling GCP "No Capacity" errors during private endpoint service creation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | ||
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | ||
| time.Sleep(retrySleep) |
There was a problem hiding this comment.
The retry logic doesn't respect context cancellation or timeouts. The loop uses time.Sleep(retrySleep) without checking if the context has been cancelled or if the configured timeout (1 hour, as defined in the resource timeouts at line 128) has been exceeded. This could result in the function continuing to retry even after the Terraform operation timeout has been reached, leading to a poor user experience and potential resource leaks.
| time.Sleep(retrySleep) | |
| select { | |
| case <-ctx.Done(): | |
| return diag.FromErr(ctx.Err()) | |
| case <-time.After(retrySleep): | |
| } |
| const maxRetries = 5 | ||
| const retrySleep = 10 * time.Second | ||
| var privateEndpoint *admin.EndpointService | ||
| for attempt := range maxRetries { | ||
| var err error | ||
| privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | ||
| if err == nil { | ||
| break | ||
| } | ||
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | ||
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | ||
| time.Sleep(retrySleep) | ||
| continue | ||
| } | ||
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err)) | ||
| } |
There was a problem hiding this comment.
The new retry logic for handling "No Capacity" errors lacks test coverage. The existing tests in resource_test.go don't cover this retry scenario. Given that this file has comprehensive test coverage for other scenarios (basicAWS, basicAzure, basicGCP, deleteOnCreateTimeout, etc.), the new retry behavior should also have test coverage to ensure it works correctly and doesn't introduce regressions.
| const maxRetries = 5 | ||
| const retrySleep = 10 * time.Second |
There was a problem hiding this comment.
The retry configuration uses magic numbers (5 for maxRetries and 10 seconds for retrySleep) instead of following the pattern of defining constants at the package level. The existing code defines timeout constants at the package level (see lines 24-30 where delayAndMinTimeout is defined). Consider defining these retry parameters as package-level constants with descriptive names for better maintainability and consistency.
| for attempt := range maxRetries { | ||
| var err error | ||
| privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | ||
| if err == nil { | ||
| break | ||
| } | ||
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | ||
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | ||
| time.Sleep(retrySleep) | ||
| continue | ||
| } | ||
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err)) | ||
| } | ||
|
|
There was a problem hiding this comment.
There's a potential nil pointer dereference bug. If all retries are exhausted and the last error is a "No Capacity" error that passes the retry condition check (line 161), the loop will exit normally without breaking or returning. This means privateEndpoint will remain nil, causing a panic at line 169 when calling privateEndpoint.GetId().
The condition attempt < maxRetries-1 prevents retrying on the last attempt, but doesn't handle the case where the last attempt also returns a "No Capacity" error. When attempt equals 4 (the last iteration), the condition on line 161 will be false, so the code will skip both the retry logic (line 162-164) and the error return (line 166), allowing the loop to complete with privateEndpoint still nil.
| for attempt := range maxRetries { | |
| var err error | |
| privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | |
| if err == nil { | |
| break | |
| } | |
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | |
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | |
| time.Sleep(retrySleep) | |
| continue | |
| } | |
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err)) | |
| } | |
| var lastErr error | |
| for attempt := range maxRetries { | |
| privateEndpoint, _, lastErr = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | |
| if lastErr == nil { | |
| break | |
| } | |
| if admin.IsErrorCode(lastErr, "ATLAS_GENERAL_ERROR") && strings.Contains(lastErr.Error(), "No Capacity") && attempt < maxRetries-1 { | |
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | |
| time.Sleep(retrySleep) | |
| continue | |
| } | |
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, lastErr)) | |
| } | |
| if privateEndpoint == nil { | |
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, lastErr)) | |
| } |
| const maxRetries = 5 | ||
| const retrySleep = 10 * time.Second | ||
| var privateEndpoint *admin.EndpointService | ||
| for attempt := range maxRetries { | ||
| var err error | ||
| privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | ||
| if err == nil { | ||
| break | ||
| } | ||
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | ||
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | ||
| time.Sleep(retrySleep) | ||
| continue | ||
| } | ||
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err)) |
There was a problem hiding this comment.
This custom retry implementation is inconsistent with the established codebase pattern. The repository extensively uses retry.RetryContext from the Terraform SDK's helper package for retry logic (see internal/service/globalclusterconfig/resource_global_cluster_config.go:129, internal/service/streamconnection/state_transition.go:27, internal/service/team/resource_team.go:177, among others).
Using retry.RetryContext provides several advantages: it respects context cancellation, handles timeouts properly, integrates with Terraform's logging, and provides exponential backoff. The current implementation with time.Sleep doesn't check for context cancellation and uses a fixed delay, which could cause the operation to exceed configured timeouts.
| const maxRetries = 5 | |
| const retrySleep = 10 * time.Second | |
| var privateEndpoint *admin.EndpointService | |
| for attempt := range maxRetries { | |
| var err error | |
| privateEndpoint, _, err = connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | |
| if err == nil { | |
| break | |
| } | |
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") && attempt < maxRetries-1 { | |
| log.Printf("[DEBUG] Attempt %d/%d: GCP private endpoint creation returned 'No Capacity', retrying in %s...", attempt+1, maxRetries, retrySleep) | |
| time.Sleep(retrySleep) | |
| continue | |
| } | |
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, err)) | |
| var privateEndpoint *admin.EndpointService | |
| attempts := 0 | |
| retryErr := retry.RetryContext(ctx, d.Timeout(schema.TimeoutCreate), func() *retry.RetryError { | |
| attempts++ | |
| pe, _, err := connV2.PrivateEndpointServicesApi.CreatePrivateEndpointService(ctx, projectID, request).Execute() | |
| if err != nil { | |
| if admin.IsErrorCode(err, "ATLAS_GENERAL_ERROR") && strings.Contains(err.Error(), "No Capacity") { | |
| log.Printf("[DEBUG] Attempt %d: GCP private endpoint creation returned 'No Capacity', retrying...", attempts) | |
| return retry.RetryableError(err) | |
| } | |
| return retry.NonRetryableError(err) | |
| } | |
| privateEndpoint = pe | |
| return nil | |
| }) | |
| if retryErr != nil { | |
| return diag.FromErr(fmt.Errorf(errorPrivateLinkEndpointsCreate, retryErr)) |
|
This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy! |
|
This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy! |
|
This PR has gone 7 days without any activity and meets the project’s definition of "stale". This will be auto-closed if there is no new activity over the next 7 days. If the issue is still relevant and active, you can simply comment with a "bump" to keep it open, or add the label "not_stale". Thanks for keeping our repository healthy! |
|
Still no action from upstream, keeping this open |
Description
Please include a summary of the fix/feature/change, including any relevant motivation and context.
Link to any related issue(s):
Type of change:
Required Checklist:
Further comments