diff --git a/docs/manual-test-plan.md b/docs/manual-test-plan.md index db8c1ce..a08a89f 100644 --- a/docs/manual-test-plan.md +++ b/docs/manual-test-plan.md @@ -1502,6 +1502,112 @@ Expected: per-entry findings count exactly, valid forms produce zero findings, a Regression anchor: the no-overlap unit test `TestMultidocNetAddrHandlers_NoOverlapWithRefHandlers` pins the dispatch-map disjointness contract. A future entry that lands in BOTH `multidocHandlers` and `multidocNetAddrHandlers` produces double findings — verify via the unit suite before manual smokes. +## N. Output discipline & error diagnostics + +These scenarios pin two operator-facing contracts: (1) stdout stays clean enough to be piped into another file without filtering progress text out; (2) when a `lookup` against the live Talos API fails, the error chain names the dialed endpoint, the resource kind, and a class-specific remedy. + +### N.1 stdout cleanliness — `talm template --file X > Y` + +```bash +cd $PROJECT +talm template --offline -f nodes/node0.yaml > /tmp/rendered.yaml +head -1 /tmp/rendered.yaml +# Expected: the first line is `# talm: nodes=[...]` (rendered modeline comment), NOT `- talm: file=...`. +# A leading `- talm:` would mean the per-file progress line is leaking into stdout. + +# Same redirect without --offline against a reachable node: +talm template -f nodes/node0.yaml > /tmp/rendered.yaml +head -1 /tmp/rendered.yaml +# Expected: same `# talm: nodes=[...]` first line; progress is on stderr only. + +# Confirm progress lands on stderr: +talm template --offline -f nodes/node0.yaml 1>/dev/null 2>/tmp/stderr.txt +grep '^- talm: file=' /tmp/stderr.txt +# Expected: one matching line per --file argument; stderr is the only channel that carries it. +``` + +Run the same shape against `talm apply --dry-run -f nodes/node0.yaml > /tmp/rendered.yaml`. Note that `talm apply` writes nothing to stdout in normal mode (only `--debug` emits the recipe stream); the assertion here is **defensive** — verifying that nothing new starts leaking onto stdout in the future, not filtering an existing stream. The progress line must still go to stderr; assert stdout is empty and stderr carries `- talm: file=...`. + +### N.2 Lookup error diagnostics — node unreachable + +Run with `--endpoints` pointing at an address that is reachable on TCP but does not run Talos (or a local closed port), and WITHOUT `--offline`: + +```bash +talm template -f nodes/node0.yaml --endpoints 192.0.2.123 || true +# Expected error chain MUST contain: +# - `looking up resource kind="disks" namespace="" id="" on endpoints=[192.0.2.123]` +# - a class hint mentioning the dialed endpoint (192.0.2.123) +# - one of the remedies depending on class: +# * TLS handshake failure → mentions "cert SANs" / "maintenance mode" +# * connection refused → mentions "firewall" / "wrong port" +# * timeout → mentions "timed out" +# * authn rejection → mentions "talosconfig" +# - a closing sentence pointing at `--offline` as the safe escape ("If live discovery is not required, pass --offline"). +``` + +Repeat with `talm apply --dry-run` against the same unreachable endpoint: + +```bash +talm apply --dry-run -f nodes/node0.yaml --endpoints 192.0.2.123 || true +# Expected: same class hint shape, but the remedy clause MUST say +# "Fix node reachability before re-running apply." and MUST NOT mention `--offline` +# (apply has no --offline flag — suggesting it would teach a non-existent workflow). +``` + +### N.3 Lookup retry policy — transient connectivity + +Bring up a node, then drop its API briefly mid-template (e.g. `systemctl stop apid` for ~300ms on a Talos test node, or use `iptables -A INPUT -p tcp --dport 50000 -j DROP` for half a second on a non-prod target). The expectation is that `talm template -f nodes/node0.yaml` succeeds on the second or third attempt without the operator seeing an error — retry budget is 3 attempts with 200ms/400ms exponential backoff (~600ms total). + +```bash +# In one shell: +talm template -f nodes/node0.yaml > /tmp/rendered.yaml + +# In another (within ~300ms of the first): +# Briefly block the talos API: +ssh ${NODE} iptables -A INPUT -p tcp --dport 50000 -j DROP +sleep 0.3 +ssh ${NODE} iptables -D INPUT -p tcp --dport 50000 -j DROP +``` + +Expected: template render succeeds after retry (1 to 2 retried attempts visible only if `--debug` is on); operator sees the rendered output, no error message. + +Same shape against a node that stays down longer than the budget. Use `REJECT --reject-with tcp-reset` rather than `-j DROP` — DROP silently discards SYN packets, the talos client retries the TCP handshake until its own deadline expires, and the lookup classifies as deadline (timeout) instead of refused. REJECT sends an RST so the kernel surfaces ECONNREFUSED immediately, matching what an operator typically sees from a firewall-blocked or stopped-service node: + +```bash +ssh ${NODE} iptables -A INPUT -p tcp --dport 50000 -j REJECT --reject-with tcp-reset +talm template -f nodes/node0.yaml || true +ssh ${NODE} iptables -D INPUT -p tcp --dport 50000 -j REJECT --reject-with tcp-reset +``` + +Expected: failure after ~600ms+ (3 attempts × backoff); operator sees the `connection refused` class hint pointing at firewall / port / `--offline`. + +(If you intentionally want the deadline-class path instead, swap REJECT for `-j DROP` — the expected hint then mentions `timed out` rather than `connection refused`.) + +Ctrl+C should interrupt retries well before the full retry budget: + +```bash +ssh ${NODE} iptables -A INPUT -p tcp --dport 50000 -j REJECT --reject-with tcp-reset +talm template -f nodes/node0.yaml & +sleep 0.1 +kill -INT $! +ssh ${NODE} iptables -D INPUT -p tcp --dport 50000 -j REJECT --reject-with tcp-reset +``` + +Expected: `talm` exits well under 1s (the cancellation interrupts the inter-attempt backoff sleep immediately) rather than the full ~600ms retry budget. A hard sub-100ms bound is too brittle across ssh hops and busy hosts — anything visibly faster than the full budget proves the cancellation path is wired correctly. + +### N.4 Lookup error diagnostics — non-network failure + +Forge a chart that calls `lookup` for an unknown resource kind (e.g. add `{{- (lookup "nonexistent_resource_kind" "" "").items }}` to a helper template), then run `talm template -f nodes/node0.yaml` against a reachable Talos node: + +```bash +talm template -f nodes/node0.yaml || true +# Expected error chain MUST contain: +# - `looking up resource kind="nonexistent_resource_kind" ...` +# - hint mentioning "chart bug" / "unsupported Talos resource" +# - `talosctl get nonexistent_resource_kind` as the verification step +# - MUST NOT suggest `--offline` (would mask the chart bug by rendering against empty discovery). +``` + ## Sanity-check block Run after every destructive section (E, F, H, and anything that touches `--mode=reboot` / `--mode=staged` / `apply -I`): diff --git a/pkg/commands/apply.go b/pkg/commands/apply.go index 77d5f17..9e30743 100644 --- a/pkg/commands/apply.go +++ b/pkg/commands/apply.go @@ -42,10 +42,12 @@ import ( const parentDir = ".." // applyCommandName labels this subcommand inside engine.Options for -// FailIfMultiNodes error wording. Centralised so the template -// rendering options block and any test asserting against the field -// share a single canonical value. -const applyCommandName = "talm apply" +// FailIfMultiNodes error wording and for the engine's apply-vs-template +// hint branching. Aliased to engine.CommandNameApply so the engine +// constant remains the single source of truth — drift between the +// two would let apply error hints silently start suggesting the +// non-existent --offline flag. +const applyCommandName = engine.CommandNameApply // roleAnchor and roleSidePatch label the role of a config file in // the apply chain for operator-facing error messages @@ -417,12 +419,16 @@ func applyOneFileTemplateMode(configFile string, sidePatches, modelineTemplates // dominant invocation shape is `talm apply -f nodes/.yaml` // without side-patches; reporting `side-patches=[]` on every line // was visible noise without operator value. + // Progress line goes to stderr. Apply normally writes nothing to + // stdout (only `--debug` emits the recipe stream via + // engine.debugPhase); routing progress here matches the + // stdout-cleanliness contract already in effect for `talm template + // > file.yaml` and keeps the `--debug` recipe stream + // uncontaminated. if len(sidePatches) == 0 { - //nolint:forbidigo // CLI progress line surfaces the file-to-target mapping for the operator - fmt.Printf("- talm: file=%s, nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(nodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) + fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(nodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) } else { - //nolint:forbidigo // CLI progress line surfaces the file-to-target mapping for the operator - fmt.Printf("- talm: file=%s, side-patches=[%s], nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(sidePatches, ","), strings.Join(nodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) + fmt.Fprintf(os.Stderr, "- talm: file=%s, side-patches=[%s], nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(sidePatches, ","), strings.Join(nodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) } applyClosure := buildApplyClosure() @@ -567,8 +573,8 @@ func applyOneFileDirectPatchMode(configFile, withSecretsPath string) error { return err } - //nolint:forbidigo // CLI progress line surfaces the file-to-target mapping for the operator - fmt.Printf("- talm: file=%s, nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(targetNodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) + // Progress line goes to stderr; stdout is reserved for rendered output. + fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=[%s], endpoints=[%s]\n", configFile, strings.Join(targetNodes, ","), strings.Join(GlobalArgs.Endpoints, ",")) read := cosiVersionReader(c) @@ -1010,6 +1016,7 @@ func buildApplyRenderOptions(modelineTemplates []string, withSecretsPath string) Root: Config.RootDir, TemplateFiles: resolvedTemplates, CommandName: applyCommandName, + TalosEndpoints: append([]string(nil), GlobalArgs.Endpoints...), } } diff --git a/pkg/commands/contract_apply_command_name_test.go b/pkg/commands/contract_apply_command_name_test.go new file mode 100644 index 0000000..7d53a48 --- /dev/null +++ b/pkg/commands/contract_apply_command_name_test.go @@ -0,0 +1,41 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Contract: applyCommandName is the exact engine.CommandNameApply +// value. Drift between the two would cause apply error hints to +// silently start suggesting the non-existent `--offline` flag — the +// apply hint formatter in pkg/engine/lookup_classify.go branches on +// equality with engine.CommandNameApply to omit the `--offline` +// remedy clause. +// +// `applyCommandName` is declared as `= engine.CommandNameApply` in +// pkg/commands/apply.go so a Go compile error catches drift at the +// declaration site already. This test is the belt-and-suspenders +// safety net: a future refactor that re-introduces a literal would +// fail here too. + +package commands + +import ( + "testing" + + "github.com/cozystack/talm/pkg/engine" +) + +func TestContract_ApplyCommandName_MatchesEngineConstant(t *testing.T) { + if applyCommandName != engine.CommandNameApply { + t.Errorf("applyCommandName=%q drifted from engine.CommandNameApply=%q — apply hints will silently suggest --offline", + applyCommandName, engine.CommandNameApply) + } +} diff --git a/pkg/commands/contract_stdout_silence_test.go b/pkg/commands/contract_stdout_silence_test.go new file mode 100644 index 0000000..eb2396a --- /dev/null +++ b/pkg/commands/contract_stdout_silence_test.go @@ -0,0 +1,282 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Contract: `talm` never emits human-facing progress / status text to +// stdout. stdout is reserved for the rendered config stream (so +// `talm template --file X > Y` produces a clean Y). Progress lines — +// the `- talm: file=…` lines printed during multi-file template / +// apply runs — must go to stderr. +// +// Pinning the rule in two layers: +// +// 1. Source-level invariant: walk pkg/commands/*.go and assert that +// no `fmt.Print` / `fmt.Println` / `fmt.Printf` call starts a +// literal with "- talm: ". The only acceptable channel for that +// prefix is `fmt.Fprintf(os.Stderr, …)`. This locks all four +// known sites at once (template.go:225 + apply.go:422/425/571) +// and prevents regression at any future call site. +// +// 2. Integration: run `templateOneFile` in --offline mode against a +// synthetic chart, capture stdout and stderr separately, and +// assert stdout contains only the rendered config (modeline + +// warn banner + body) while the `- talm: file=…` progress line +// lands on stderr. + +package commands + +import ( + "context" + "go/ast" + "go/parser" + "go/token" + "os" + "path/filepath" + "strings" + "testing" +) + +// TestContract_NoTalmProgressOnStdout walks every .go source file in +// pkg/commands via go/ast and asserts no call expression with the +// shape `fmt.Print*("- talm: …", …)` OR `fmt.Fprint*(os.Stdout, "- +// talm: …", …)` exists. AST-level matching catches multi-line call +// sites that a line-by-line text grep would miss (e.g. +// `fmt.Printf(\n\t"- talm: …",\n\t…)`). +// +// The progress prefix MUST only be emitted via stderr writers +// (`fmt.Fprint*(os.Stderr, "- talm: …", …)`); the test inspects the +// first non-writer string-literal argument and rejects any call that +// targets stdout (explicit or default). +func TestContract_NoTalmProgressOnStdout(t *testing.T) { + entries, err := os.ReadDir(".") + if err != nil { + t.Fatalf("read pkg/commands directory: %v", err) + } + + fset := token.NewFileSet() + + var violations []string + + for _, entry := range entries { + if entry.IsDir() || !strings.HasSuffix(entry.Name(), ".go") || strings.HasSuffix(entry.Name(), "_test.go") { + continue + } + + path := filepath.Join(".", entry.Name()) + + file, parseErr := parser.ParseFile(fset, path, nil, parser.SkipObjectResolution) + if parseErr != nil { + t.Fatalf("parse %s: %v", path, parseErr) + } + + ast.Inspect(file, func(n ast.Node) bool { + call, ok := n.(*ast.CallExpr) + if !ok { + return true + } + + if v := violationForFmtCall(call, fset); v != "" { + violations = append(violations, path+":"+v) + } + + return true + }) + } + + if len(violations) > 0 { + t.Errorf("found %d stdout-progress violation(s):\n%s\nfix: rewrite as `fmt.Fprintf(os.Stderr, \"- talm: …\", …)`", + len(violations), strings.Join(violations, "\n")) + } +} + +// violationForFmtCall returns a human-readable position+description +// when `call` matches the forbidden shape: a fmt.{Print,Printf,Println} +// invocation or fmt.{Fprint,Fprintf,Fprintln}(os.Stdout, …) where the +// first non-writer argument is a string literal starting with `- talm:`. +// Returns the empty string otherwise. +func violationForFmtCall(call *ast.CallExpr, fset *token.FileSet) string { + sel, ok := call.Fun.(*ast.SelectorExpr) + if !ok { + return "" + } + + ident, ok := sel.X.(*ast.Ident) + if !ok || ident.Name != "fmt" { + return "" + } + + args := call.Args + + switch sel.Sel.Name { + case "Print", "Printf", "Println": + // stdout by default — first arg is the format / value. + case "Fprint", "Fprintf", "Fprintln": + // stdout only if first arg is os.Stdout; skip os.Stderr & writers. + if len(args) == 0 || !isOsStdout(args[0]) { + return "" + } + + args = args[1:] + default: + return "" + } + + if len(args) == 0 { + return "" + } + + lit, ok := args[0].(*ast.BasicLit) + if !ok || lit.Kind != token.STRING { + return "" + } + + if !strings.HasPrefix(lit.Value, `"- talm:`) { + return "" + } + + pos := fset.Position(call.Pos()) + + return pos.String() + ": forbidden fmt." + sel.Sel.Name + "(… " + lit.Value + " …) on stdout" +} + +func isOsStdout(e ast.Expr) bool { + sel, ok := e.(*ast.SelectorExpr) + if !ok { + return false + } + + ident, ok := sel.X.(*ast.Ident) + + return ok && ident.Name == "os" && sel.Sel.Name == "Stdout" +} + +// TestContract_TemplateProgress_GoesToStderr runs `templateOneFile` +// in offline mode against a synthetic chart, captures stdout and +// stderr separately, and asserts: +// - stdout contains the rendered config (modeline banner + body) +// - stdout does NOT contain the `- talm: file=` progress line +// - stderr DOES contain the progress line +// +// Without --offline, templateOneFile would dial a Talos node; --offline +// short-circuits that and exercises the in-process rendering path +// alone. The chart fixture is the same minimal layout used by the +// other generateOutput-level tests (Chart.yaml + values.yaml + +// secrets.yaml + templates/config.yaml). +func TestContract_TemplateProgress_GoesToStderr(t *testing.T) { + withTemplateFlagsSnapshot(t) + + root := makeMinimalChart(t) + + nodeFile := filepath.Join(root, "node0.yaml") + nodeBody := "# talm: nodes=[\"" + testNodeIP + "\"], endpoints=[\"" + testNodeIP + "\"], templates=[\"templates/config.yaml\"]\n" + + "machine:\n type: worker\n" + + if err := os.WriteFile(nodeFile, []byte(nodeBody), 0o600); err != nil { + t.Fatalf("write node file: %v", err) + } + + Config.RootDir = root + templateCmdFlags.configFiles = []string{nodeFile} + templateCmdFlags.offline = true + templateCmdFlags.templatesFromArgs = false + templateCmdFlags.nodesFromArgs = false + templateCmdFlags.endpointsFromArgs = false + + stdout, stderr := captureStdoutAndStderr(t, func() { + runErr := templateWithFiles(nil)(context.Background(), nil) + if runErr != nil { + t.Fatalf("templateWithFiles failed: %v", runErr) + } + }) + + if strings.Contains(stdout, "- talm: file=") { + t.Errorf("stdout MUST NOT contain `- talm: file=` progress line; got stdout:\n%s", stdout) + } + + if !strings.Contains(stderr, "- talm: file=") { + t.Errorf("stderr MUST contain `- talm: file=` progress line; got stderr:\n%s", stderr) + } + + if !strings.Contains(stdout, "# talm: nodes=") { + t.Errorf("stdout MUST contain rendered modeline `# talm: nodes=…`; got stdout:\n%s", stdout) + } + + if !strings.Contains(stdout, "machine:") { + t.Errorf("stdout MUST contain rendered config body (`machine:`); got stdout:\n%s", stdout) + } +} + +const testNodeIP = "192.0.2.10" // RFC 5737 TEST-NET-1 + +// captureStdoutAndStderr redirects both os.Stdout and os.Stderr to +// pipes for the duration of fn, returns whatever fn printed on each +// channel. Companion to captureStdout in contract_template_test.go; +// extracted as a sibling helper so future stdout/stderr-discipline +// tests can reuse the two-channel form without duplicating the +// goroutine plumbing. +func captureStdoutAndStderr(t *testing.T, fn func()) (string, string) { + t.Helper() + + rOut, wOut, err := os.Pipe() + if err != nil { + t.Fatalf("pipe stdout: %v", err) + } + + rErr, wErr, err := os.Pipe() + if err != nil { + t.Fatalf("pipe stderr: %v", err) + } + + origOut := os.Stdout + origErr := os.Stderr + os.Stdout = wOut + os.Stderr = wErr + + t.Cleanup(func() { + os.Stdout = origOut + os.Stderr = origErr + }) + + doneOut := drainPipe(rOut) + doneErr := drainPipe(rErr) + + fn() + _ = wOut.Close() + _ = wErr.Close() + + return <-doneOut, <-doneErr +} + +func drainPipe(r *os.File) <-chan string { + out := make(chan string, 1) + + go func() { + buf := make([]byte, 0, 64*1024) + readBuf := make([]byte, 4096) + + for { + n, err := r.Read(readBuf) + if n > 0 { + buf = append(buf, readBuf[:n]...) + } + + if err != nil { + break + } + } + + out <- string(buf) + }() + + return out +} diff --git a/pkg/commands/template.go b/pkg/commands/template.go index 32edb5d..3968adb 100644 --- a/pkg/commands/template.go +++ b/pkg/commands/template.go @@ -221,8 +221,8 @@ func templateOneFile(ctx context.Context, args []string, configFile string, firs GlobalArgs.Endpoints = modelineConfig.Endpoints } - //nolint:forbidigo // CLI progress line surfaces the file-to-target mapping for the operator - fmt.Printf("- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles) + // Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y. + fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles) if len(GlobalArgs.Nodes) < 1 { //nolint:wrapcheck // sentinel constructed in-place; WithHint attaches operator guidance @@ -315,7 +315,8 @@ func generateOutput(ctx context.Context, c *client.Client, _ []string) (string, Offline: templateCmdFlags.offline, KubernetesVersion: templateCmdFlags.kubernetesVersion, TemplateFiles: resolvedTemplateFiles, - CommandName: "talm template", + CommandName: engine.CommandNameTemplate, + TalosEndpoints: append([]string(nil), GlobalArgs.Endpoints...), } result, err := engine.Render(ctx, c, opts) diff --git a/pkg/engine/contract_lookup_classify_test.go b/pkg/engine/contract_lookup_classify_test.go new file mode 100644 index 0000000..38df298 --- /dev/null +++ b/pkg/engine/contract_lookup_classify_test.go @@ -0,0 +1,163 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Contract: classifyLookupError partitions gRPC failures from the +// talos client into one of six diagnosable classes. Each class drives +// a distinct human-facing hint downstream; misclassification leads the +// operator to a useless remedy ("pass --offline" is dangerous on a +// real chart bug, "verify cert SANs" is irrelevant on a deadline). + +package engine + +import ( + "testing" + + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +func TestClassifyLookupError_TLSHandshake(t *testing.T) { + err := status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"") + if got := classifyLookupError(err); got != lookupErrTLSHandshake { + t.Errorf("got %v, want lookupErrTLSHandshake", got) + } +} + +func TestClassifyLookupError_TLSHandshake_TLSPrefix(t *testing.T) { + err := status.Error(codes.Unavailable, "connection error: desc = \"tls: bad certificate\"") + if got := classifyLookupError(err); got != lookupErrTLSHandshake { + t.Errorf("got %v, want lookupErrTLSHandshake", got) + } +} + +func TestClassifyLookupError_Refused(t *testing.T) { + err := status.Error(codes.Unavailable, "connection error: desc = \"transport: Error while dialing dial tcp 192.0.2.10:50000: connect: connection refused\"") + if got := classifyLookupError(err); got != lookupErrRefused { + t.Errorf("got %v, want lookupErrRefused", got) + } +} + +func TestClassifyLookupError_Refused_NoRouteToHost(t *testing.T) { + err := status.Error(codes.Unavailable, "no route to host") + if got := classifyLookupError(err); got != lookupErrRefused { + t.Errorf("got %v, want lookupErrRefused", got) + } +} + +func TestClassifyLookupError_DeadlineExceeded(t *testing.T) { + err := status.Error(codes.DeadlineExceeded, "context deadline exceeded") + if got := classifyLookupError(err); got != lookupErrDeadline { + t.Errorf("got %v, want lookupErrDeadline", got) + } +} + +func TestClassifyLookupError_Authn(t *testing.T) { + err := status.Error(codes.Unauthenticated, "talosconfig credentials rejected") + if got := classifyLookupError(err); got != lookupErrAuthn { + t.Errorf("got %v, want lookupErrAuthn", got) + } +} + +func TestClassifyLookupError_Authn_UnknownAuthority(t *testing.T) { + err := status.Error(codes.Unavailable, "x509: certificate signed by unknown authority") + if got := classifyLookupError(err); got != lookupErrAuthn { + t.Errorf("got %v, want lookupErrAuthn", got) + } +} + +// Contract: an x509 SAN mismatch ("certificate is valid for X, not +// Y") classifies as TLSHandshake, not Authn. The cert chain trusts +// fine — the certificate just doesn't cover the dialed address. The +// TLS-handshake hint explicitly names cert SANs as a common cause; +// the Authn hint would mislead the operator toward talosconfig +// credentials, which is the wrong fix. +func TestClassifyLookupError_TLSHandshake_x509SANMismatch(t *testing.T) { + err := status.Error(codes.Unavailable, "x509: certificate is valid for 192.0.2.10, not 192.0.2.11") + if got := classifyLookupError(err); got != lookupErrTLSHandshake { + t.Errorf("got %v, want lookupErrTLSHandshake (SAN mismatch → TLS, not Authn)", got) + } +} + +// Contract: an x509 expiry error also classifies as TLSHandshake. +// The TLS-handshake hint's "node may be in maintenance mode / wrong +// port" causes don't directly cover expiry, but routing here is +// still more useful than the Authn branch — the operator's next +// action is to renew the cert, not to touch talosconfig. +func TestClassifyLookupError_TLSHandshake_x509Expired(t *testing.T) { + err := status.Error(codes.Unavailable, "x509: certificate has expired or is not yet valid") + if got := classifyLookupError(err); got != lookupErrTLSHandshake { + t.Errorf("got %v, want lookupErrTLSHandshake (expired cert → TLS, not Authn)", got) + } +} + +func TestClassifyLookupError_Resource_InternalCode(t *testing.T) { + err := status.Error(codes.Internal, "no such resource type") + if got := classifyLookupError(err); got != lookupErrResource { + t.Errorf("got %v, want lookupErrResource", got) + } +} + +func TestClassifyLookupError_Resource_InvalidArgument(t *testing.T) { + err := status.Error(codes.InvalidArgument, "invalid resource selector") + if got := classifyLookupError(err); got != lookupErrResource { + t.Errorf("got %v, want lookupErrResource", got) + } +} + +// Contract: NotFound from the helpers.ForEachResource return value +// (ResolveResourceKind couldn't find the kind in the target Talos +// version) classifies as Resource, NOT Unknown. The per-node +// callback filters NotFound for missing instances, so any NotFound +// surfacing here is definitionally a chart bug / version mismatch +// signal — the operator needs the "verify with talosctl get" / +// "don't try --offline" hint, not the generic "file an issue" one. +func TestClassifyLookupError_Resource_NotFound_FromResolveResourceKind(t *testing.T) { + err := status.Error(codes.NotFound, "resource type \"nonexistent\" not registered") + if got := classifyLookupError(err); got != lookupErrResource { + t.Errorf("NotFound at the helper level (ResolveResourceKind miss) must classify as Resource; got %v", got) + } +} + +func TestClassifyLookupError_Resource_Unimplemented(t *testing.T) { + err := status.Error(codes.Unimplemented, "method not implemented in this Talos version") + if got := classifyLookupError(err); got != lookupErrResource { + t.Errorf("Unimplemented (older Talos lacking the resource API) must classify as Resource; got %v", got) + } +} + +func TestClassifyLookupError_Unknown_PlainError(t *testing.T) { + if got := classifyLookupError(errTestNonGRPC); got != lookupErrUnknown { + t.Errorf("got %v, want lookupErrUnknown", got) + } +} + +// Contract: a bare Unavailable with no diagnostic substring falls +// back to lookupErrRefused — the dominant real-world cause of plain +// Unavailable from gRPC dial. The TLS-handshake and connection-refused +// substrings are the precise discriminators; absence of both means +// "we know the channel isn't up but can't pinpoint why," and +// "verify network path / firewall / port" is a strictly more useful +// remedy than the generic Unknown fallback. +func TestClassifyLookupError_Unavailable_GenericText(t *testing.T) { + err := status.Error(codes.Unavailable, "something went wrong") + if got := classifyLookupError(err); got != lookupErrRefused { + t.Errorf("got %v, want lookupErrRefused (Unavailable default)", got) + } +} + +func TestClassifyLookupError_Nil(t *testing.T) { + if got := classifyLookupError(nil); got != lookupErrUnknown { + t.Errorf("got %v, want lookupErrUnknown for nil input", got) + } +} diff --git a/pkg/engine/contract_lookup_error_diagnostics_test.go b/pkg/engine/contract_lookup_error_diagnostics_test.go new file mode 100644 index 0000000..770c304 --- /dev/null +++ b/pkg/engine/contract_lookup_error_diagnostics_test.go @@ -0,0 +1,350 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Contract: wrapLookupError enriches every lookup failure with two +// layers of operator-facing context: +// +// 1. ALWAYS: the wrapped message names the resource kind, namespace, +// id, and dialed endpoints. The pre-fix chain stripped this and +// left the operator with a generic "iterating resources: rpc +// error: …" trail. +// +// 2. ALWAYS: an errors.WithHint annotation steers the operator +// toward the right remedy based on classifyLookupError's verdict. +// The hint distinguishes connectivity-class failures (TLS / +// refused / deadline / authn — where `--offline` is a safe +// escape for `talm template`) from resource-class failures +// (`--offline` would mask a real chart bug — explicitly NOT +// suggested in that branch). +// +// For `talm apply`, hints never mention `--offline` because apply has +// no such flag; suggesting it would teach a non-existent workflow. + +package engine + +import ( + "strings" + "testing" + + cockroachErrors "github.com/cockroachdb/errors" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +// errTestNonGRPC is a sentinel returned by tests that need a +// non-gRPC, non-status error to exercise the unknown-class branch. +// Package-level so err113 sees it as a wrapped static error rather +// than a dynamic `errors.New` at the call site. +var errTestNonGRPC = cockroachErrors.New("some non-grpc thing went wrong") + +const ( + testLookupKind = "disks" + testLookupNamespace = "" + testLookupDocID = "" + testLookupCmdTemplate = "talm template" + // Pinned to the exported engine constant so any drift between + // pkg/commands/apply.go (which also reads CommandNameApply) and + // the apply-no-offline contract surfaces here at compile / run + // time rather than as a silent hint regression. + testLookupCmdApply = CommandNameApply +) + +var testLookupEndpoints = []string{"192.0.2.10"} //nolint:gochecknoglobals // table-test fixture; immutable + +// === Level 1: wrap context (always-on, class-independent) === + +func TestContract_WrapLookupError_Nil(t *testing.T) { + got := wrapLookupError(nil, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + if got != nil { + t.Errorf("nil input must produce nil output; got %v", got) + } +} + +func TestContract_WrapLookupError_NamesKindNamespaceIdEndpoints(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"") + got := wrapLookupError(inner, "disks", "system", "system-disk", []string{"192.0.2.10", "192.0.2.11"}, testLookupCmdTemplate) + + if got == nil { + t.Fatal("expected non-nil error") + } + + msg := got.Error() + for _, want := range []string{ + `kind="disks"`, + `namespace="system"`, + `id="system-disk"`, + `192.0.2.10`, + `192.0.2.11`, + } { + if !strings.Contains(msg, want) { + t.Errorf("wrapped message missing %q; got: %s", want, msg) + } + } + + if !strings.Contains(msg, "handshake failed") { + t.Errorf("wrapped message must preserve the underlying error; got: %s", msg) + } +} + +// === Level 2: class-specific hints === + +func hintsOf(t *testing.T, err error) string { + t.Helper() + + hints := cockroachErrors.GetAllHints(err) + + return strings.Join(hints, "\n") +} + +func TestContract_WrapLookupError_TLSHandshake_HintMentionsCertSANsAndOffline(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + for _, want := range []string{"TLS handshake", "cert SAN", "maintenance mode", "--offline"} { + if !strings.Contains(hint, want) { + t.Errorf("TLS-handshake hint missing %q; got hint:\n%s", want, hint) + } + } + + if !strings.Contains(hint, "192.0.2.10") { + t.Errorf("hint must name the dialed endpoint; got:\n%s", hint) + } + + if !strings.Contains(hint, "talosctl --endpoints 192.0.2.10 get nodeaddress") { + t.Errorf("hint must point at `talosctl --endpoints X get nodeaddress`; got:\n%s", hint) + } +} + +// Contract: the talosctl suggestion lists ALL configured endpoints +// comma-separated, not just the first one. Naming only the first +// would mislead an operator whose actual failing node is somewhere +// in the middle of the list (e.g. 3-node modeline, middle node +// down). +func TestContract_WrapLookupError_TLSHandshake_MultipleEndpoints_AllListed(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, []string{"192.0.2.10", "192.0.2.11", "192.0.2.12"}, testLookupCmdTemplate) + hint := hintsOf(t, got) + + if !strings.Contains(hint, "talosctl --endpoints 192.0.2.10,192.0.2.11,192.0.2.12 get nodeaddress") { + t.Errorf("hint must list all endpoints comma-separated; got:\n%s", hint) + } +} + +// Contract: when endpoints is empty (e.g. no --endpoints flag, no +// modeline endpoints, no talosconfig context), the `talosctl` +// suggestion must remain grammatical — no double space, no dangling +// `--endpoints` with no value, no literal `[]` leaking through +// fmt.Sprintf("%v", nil). Reaches the engine through +// `engine.Options.TalosEndpoints` populated by `append([]string(nil), +// GlobalArgs.Endpoints...)`; an empty GlobalArgs.Endpoints flows +// through as a nil/empty slice. +func TestContract_WrapLookupError_TLSHandshake_EmptyEndpoints(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, nil, testLookupCmdTemplate) + hint := hintsOf(t, got) + + if strings.Contains(hint, "talosctl get") || strings.Contains(hint, "talosctl --") { + t.Errorf("hint has a double space — empty endpoints leaked into the suggestion; got:\n%s", hint) + } + + if strings.Contains(hint, "--endpoints get") || strings.Contains(hint, "--endpoints \n") { + t.Errorf("hint has a dangling --endpoints with no value; got:\n%s", hint) + } + + if !strings.Contains(hint, "talosctl get nodeaddress") { + t.Errorf("hint must fall back to a clean `talosctl get nodeaddress` form; got:\n%s", hint) + } + + if strings.Contains(hint, "endpoint []") || strings.Contains(hint, "aborted ") { + t.Errorf("empty endpoints leaked into the hint lead text (literal `[]` or double space); got:\n%s", hint) + } +} + +// Contract: every hint — including resource-class and unknown-class +// — must read grammatically with empty endpoints. No leading-bracket +// leak, no double space before the em-dash. The resource branch +// historically never names endpoints (chart bug is endpoint- +// independent) but the test still pins absence of `[]` so a future +// refactor that introduces an endpoint reference there is forced to +// handle the empty case. +func TestContract_WrapLookupError_AllClasses_EmptyEndpointsGrammatical(t *testing.T) { + cases := []struct { + name string + inner error + }{ + {"tls", status.Error(codes.Unavailable, "transport: authentication handshake failed: EOF")}, + {"refused", status.Error(codes.Unavailable, "connection refused")}, + {"deadline", status.Error(codes.DeadlineExceeded, "context deadline exceeded")}, + {"authn", status.Error(codes.Unauthenticated, "rejected")}, + {"resource", status.Error(codes.Internal, "no such resource type")}, + {"unknown_class", errTestNonGRPC}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + got := wrapLookupError(tc.inner, testLookupKind, testLookupNamespace, testLookupDocID, nil, testLookupCmdTemplate) + hint := hintsOf(t, got) + + if strings.Contains(hint, "[]") { + t.Errorf("literal `[]` leaked into hint (fmt.Sprintf %%v on nil slice); got:\n%s", hint) + } + + if strings.Contains(hint, " ") { + t.Errorf("double space in hint (empty endpoints leaked); got:\n%s", hint) + } + }) + } +} + +func TestContract_WrapLookupError_Refused_HintMentionsFirewallAndOffline(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection error: desc = \"transport: Error while dialing dial tcp 192.0.2.10:50000: connect: connection refused\"") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + for _, want := range []string{"refused", "firewall", "--offline"} { + if !strings.Contains(hint, want) { + t.Errorf("refused hint missing %q; got hint:\n%s", want, hint) + } + } +} + +func TestContract_WrapLookupError_Deadline_HintMentionsTimeoutAndOffline(t *testing.T) { + inner := status.Error(codes.DeadlineExceeded, "context deadline exceeded") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + for _, want := range []string{"timed out", "--offline"} { + if !strings.Contains(hint, want) { + t.Errorf("deadline hint missing %q; got hint:\n%s", want, hint) + } + } +} + +func TestContract_WrapLookupError_Authn_HintMentionsTalosconfigAndOffline(t *testing.T) { + inner := status.Error(codes.Unauthenticated, "talosconfig credentials rejected") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + for _, want := range []string{"talosconfig", "--offline"} { + if !strings.Contains(hint, want) { + t.Errorf("authn hint missing %q; got hint:\n%s", want, hint) + } + } +} + +// Contract: a resource-class error (chart bug / unsupported resource) +// must NEVER suggest --offline — `--offline` would render the chart +// against an empty discovery surface and produce broken output that +// looks valid. The hint instead steers the operator at `talosctl get +// ` to verify the resource exists. +func TestContract_WrapLookupError_Resource_HintWarnsAgainstOffline(t *testing.T) { + inner := status.Error(codes.Internal, "no such resource type") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + if !strings.Contains(hint, "chart") { + t.Errorf("resource-class hint must mention chart bug; got:\n%s", hint) + } + + if !strings.Contains(hint, "talosctl get") { + t.Errorf("resource-class hint must point at `talosctl get`; got:\n%s", hint) + } + + if strings.Contains(hint, "--offline") { + t.Errorf("resource-class hint must NOT suggest --offline (would mask chart bug); got:\n%s", hint) + } +} + +// Contract: for `talm apply`, no hint ever mentions --offline because +// apply has no such flag. The diagnosis itself stays the same; only +// the remedy phrasing flips ("verify reachability" instead). +func TestContract_WrapLookupError_Apply_HintNeverMentionsOffline(t *testing.T) { + cases := []struct { + name string + inner error + }{ + {"tls", status.Error(codes.Unavailable, "connection error: desc = \"transport: authentication handshake failed: EOF\"")}, + {"refused", status.Error(codes.Unavailable, "connection refused")}, + {"deadline", status.Error(codes.DeadlineExceeded, "context deadline exceeded")}, + {"authn", status.Error(codes.Unauthenticated, "rejected")}, + } + + for _, tc := range cases { + t.Run(tc.name, func(t *testing.T) { + got := wrapLookupError(tc.inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdApply) + hint := hintsOf(t, got) + + if hint == "" { + t.Errorf("apply hint must not be empty; got error: %v", got) + } + + if strings.Contains(hint, "--offline") { + t.Errorf("apply hint must NOT mention --offline (flag does not exist for apply); got:\n%s", hint) + } + }) + } +} + +// Contract: when CommandName is empty (caller forgot to set it, +// untested call path, future subcommand) the hint MUST NOT suggest +// `--offline`. The remedy clause is allow-listed against +// CommandNameTemplate: anything else falls through to the safe +// generic "fix reachability" form. Reviewer flagged this as a real +// failure mode — engine.go defaults cmdName to "talm" when empty, +// and "talm" was previously falling into the template branch and +// suggesting --offline incorrectly. +func TestContract_WrapLookupError_EmptyCommandName_NoOfflineSuggestion(t *testing.T) { + inner := status.Error(codes.Unavailable, "transport: authentication handshake failed: EOF") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, "") + hint := hintsOf(t, got) + + if strings.Contains(hint, "--offline") { + t.Errorf("empty commandName must not get --offline suggestion (no such flag exists on unknown subcommands); got hint:\n%s", hint) + } + + if hint == "" { + t.Errorf("empty commandName must still receive a fallback hint; got error: %v", got) + } +} + +// Contract: an unrecognized commandName (e.g. a future subcommand +// not yet allow-listed) MUST also fall to the safe remedy. Mirrors +// the apply branch's safety semantics. +func TestContract_WrapLookupError_UnknownCommandName_NoOfflineSuggestion(t *testing.T) { + inner := status.Error(codes.Unavailable, "connection refused") + got := wrapLookupError(inner, testLookupKind, testLookupNamespace, testLookupDocID, testLookupEndpoints, "talm future-subcommand") + hint := hintsOf(t, got) + + if strings.Contains(hint, "--offline") { + t.Errorf("unknown commandName must not get --offline suggestion; got hint:\n%s", hint) + } +} + +// Contract: unknown class still gets a hint, but a generic one that +// guides toward filing an issue rather than guessing a specific +// remedy. Pinning so a refactor that drops the fallback hint +// surfaces here. +func TestContract_WrapLookupError_Unknown_HasFallbackHint(t *testing.T) { + got := wrapLookupError(errTestNonGRPC, "machinetype", testLookupNamespace, testLookupDocID, testLookupEndpoints, testLookupCmdTemplate) + hint := hintsOf(t, got) + + if hint == "" { + t.Errorf("unknown-class error must still carry a fallback hint; got error: %v", got) + } + + if !strings.Contains(got.Error(), `kind="machinetype"`) { + t.Errorf("wrap must name the kind passed by caller; got: %s", got) + } +} diff --git a/pkg/engine/contract_lookup_retry_test.go b/pkg/engine/contract_lookup_retry_test.go new file mode 100644 index 0000000..cfe3f29 --- /dev/null +++ b/pkg/engine/contract_lookup_retry_test.go @@ -0,0 +1,445 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// Contract: retryWithFailFast wraps a single `lookup` attempt with a +// bounded retry loop. Transient connectivity classes (refused, +// deadline) are retried up to policy.maxAttempts; deterministic +// classes (TLS handshake, authn, resource, unknown) return after the +// first attempt — the underlying condition won't clear on a second +// try and the operator should not pay the latency. +// +// Context cancellation interrupts the inter-attempt backoff +// immediately so Ctrl+C in `talm template / apply` is responsive. + +package engine + +import ( + "context" + "errors" + "testing" + "time" + + "github.com/hashicorp/go-multierror" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +// testInstantPolicy disables backoff so unit tests don't wait. The +// retry-count contract is tested without sleeping; the actual backoff +// shape is exercised by TestRetryWithFailFast_BoundedTime. +// +//nolint:gochecknoglobals // test fixture, immutable +var testInstantPolicy = retryPolicy{ + maxAttempts: 3, + backoff: func(int) time.Duration { return 0 }, +} + +// grpcTestErr is a tiny test helper that returns a gRPC status error +// verbatim. wrapcheck normally insists on `errors.Wrap` around any +// error returned from an external package; for unit tests that need +// to inject a specific gRPC error shape into the function under test, +// wrapping would defeat the purpose. The one nolint here is paid +// once instead of at every retry-test call site. +// +//nolint:wrapcheck // test fixture: returning a raw gRPC error is the point +func grpcTestErr(c codes.Code, msg string) error { + return status.Error(c, msg) +} + +func TestRetryWithFailFast_NoRetryOnImmediateSuccess(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return nil + }, defaultShouldRetry, testInstantPolicy) + if err != nil { + t.Fatalf("expected nil error; got %v", err) + } + + if calls != 1 { + t.Errorf("expected 1 attempt on immediate success; got %d", calls) + } +} + +func TestRetryWithFailFast_RetriesOnRefused_SucceedsSecondAttempt(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + if calls == 1 { + return grpcTestErr(codes.Unavailable, "connection refused") + } + + return nil + }, defaultShouldRetry, testInstantPolicy) + if err != nil { + t.Fatalf("expected nil error on retry success; got %v", err) + } + + if calls != 2 { + t.Errorf("expected 2 attempts (refused then ok); got %d", calls) + } +} + +func TestRetryWithFailFast_RetriesOnRefused_AllFail(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return grpcTestErr(codes.Unavailable, "connection refused") + }, defaultShouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected error after exhausting retries") + } + + if calls != testInstantPolicy.maxAttempts { + t.Errorf("expected %d attempts; got %d", testInstantPolicy.maxAttempts, calls) + } +} + +func TestRetryWithFailFast_RetriesOnDeadline(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + if calls < testInstantPolicy.maxAttempts { + return grpcTestErr(codes.DeadlineExceeded, "context deadline exceeded") + } + + return nil + }, defaultShouldRetry, testInstantPolicy) + if err != nil { + t.Fatalf("expected nil on retry success; got %v", err) + } + + if calls != testInstantPolicy.maxAttempts { + t.Errorf("expected %d attempts before success; got %d", testInstantPolicy.maxAttempts, calls) + } +} + +func TestRetryWithFailFast_FailsFastOnTLSHandshake(t *testing.T) { + calls := 0 + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return grpcTestErr(codes.Unavailable, "transport: authentication handshake failed: EOF") + }, defaultShouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected TLS error to bubble up") + } + + if calls != 1 { + t.Errorf("TLS handshake must fail fast (deterministic — cert SAN / maintenance — no retry); got %d attempts", calls) + } +} + +func TestRetryWithFailFast_FailsFastOnAuthn(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return grpcTestErr(codes.Unauthenticated, "credentials rejected") + }, defaultShouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected authn error to bubble up") + } + + if calls != 1 { + t.Errorf("authn must fail fast (bad credentials won't clear); got %d attempts", calls) + } +} + +func TestRetryWithFailFast_FailsFastOnResource(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return grpcTestErr(codes.Internal, "no such resource type") + }, defaultShouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected resource error to bubble up") + } + + if calls != 1 { + t.Errorf("resource-class must fail fast (chart bug, deterministic); got %d attempts", calls) + } +} + +func TestRetryWithFailFast_FailsFastOnUnknown(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return errTestNonGRPC + }, defaultShouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected unknown error to bubble up") + } + + if calls != 1 { + t.Errorf("unknown-class must fail fast (cannot assume transience); got %d attempts", calls) + } +} + +func TestRetryWithFailFast_HonoursContextCancellation(t *testing.T) { + ctx, cancel := context.WithCancel(t.Context()) + + slowPolicy := retryPolicy{ + maxAttempts: 3, + backoff: func(int) time.Duration { return 10 * time.Second }, + } + + calls := 0 + + go func() { + // Give the first attempt time to enter the backoff sleep + // before cancelling. + time.Sleep(50 * time.Millisecond) + cancel() + }() + + start := time.Now() + + err := retryWithFailFast(ctx, func() error { + calls++ + + return grpcTestErr(codes.Unavailable, "connection refused") + }, defaultShouldRetry, slowPolicy) + elapsed := time.Since(start) + + if err == nil { + t.Fatal("expected error on cancellation") + } + + // Backoff is 10s; we should be back well under that if cancellation + // interrupted the sleep correctly. + if elapsed > 2*time.Second { + t.Errorf("cancellation did not interrupt backoff; elapsed=%v", elapsed) + } + + if calls < 1 || calls > 2 { + t.Errorf("expected 1-2 attempts before cancellation; got %d", calls) + } +} + +func TestRetryWithFailFast_BoundedTime(t *testing.T) { + policy := retryPolicy{ + maxAttempts: 3, + backoff: interAttemptBackoff, + } + + calls := 0 + start := time.Now() + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + return grpcTestErr(codes.Unavailable, "connection refused") + }, defaultShouldRetry, policy) + elapsed := time.Since(start) + + if err == nil { + t.Fatal("expected error after exhausting retries") + } + + if calls != policy.maxAttempts { + t.Errorf("expected %d attempts; got %d", policy.maxAttempts, calls) + } + + // Production backoff: 200ms · 2^n between attempts 0→1 and 1→2. + // 200ms + 400ms = 600ms; allow up to 2s for scheduling slack. + if elapsed > 2*time.Second { + t.Errorf("production retry budget exceeded 2s; elapsed=%v", elapsed) + } + + if elapsed < 500*time.Millisecond { + t.Errorf("production retry budget too short — backoff not actually waiting; elapsed=%v", elapsed) + } +} + +// Contract: firstLookupError surfaces per-node multierror failures +// to the retry classifier, which closes the gap that would otherwise +// leave the "brief partition against one node in a multi-node +// lookup" case unretried. helpers.ForEachResource itself returns nil +// when only callback (per-node) errors occurred; pre-fix the +// closure returned that nil, retryWithFailFast exited successfully, +// and the multiErr was wrapped+hinted without a second attempt. +func TestFirstLookupError_PrefersHelperErrOverMultiErr(t *testing.T) { + helperErr := grpcTestErr(codes.Unavailable, "transport: handshake failed") + mErr := &multierror.Error{} + mErr = multierror.Append(mErr, grpcTestErr(codes.Unavailable, "connection refused")) + + got := firstLookupError(helperErr, mErr) + if !errors.Is(got, helperErr) { + t.Errorf("expected helperErr to win over multiErr; got %v", got) + } +} + +func TestFirstLookupError_SurfacesMultiErrWhenHelperErrNil(t *testing.T) { + mErr := &multierror.Error{} + mErr = multierror.Append(mErr, grpcTestErr(codes.Unavailable, "connection refused")) + + got := firstLookupError(nil, mErr) + if got == nil { + t.Fatal("multiErr-only failure must surface (so retry can see it); got nil") + } + + if classifyLookupError(got) != lookupErrRefused { + t.Errorf("classifier must see Refused class through multierror wrap; got class %v", classifyLookupError(got)) + } +} + +func TestFirstLookupError_NilBoth(t *testing.T) { + if got := firstLookupError(nil, nil); got != nil { + t.Errorf("nil/nil must return nil; got %v", got) + } +} + +func TestFirstLookupError_NilMultiErr(t *testing.T) { + if got := firstLookupError(nil, (*multierror.Error)(nil)); got != nil { + t.Errorf("nil/nil-multierror must return nil; got %v", got) + } +} + +// Contract: when multiErr has accumulated errors from multiple nodes +// with DIFFERENT classes (e.g. node A: x509 / authn; node B: refused), +// firstLookupError returns the LAST appended error rather than the +// multierror envelope. classifyLookupError downstream substring- +// matches err.Error(); if we returned the multierror, the x509 from +// node A would mis-flip the whole batch to lookupErrAuthn and node +// B's refused (the more relevant signal — latest outcome) would be +// masked. +func TestFirstLookupError_PicksLastAppended_NotMultierror(t *testing.T) { + mErr := &multierror.Error{} + mErr = multierror.Append(mErr, grpcTestErr(codes.Unavailable, "x509: certificate signed by unknown authority")) + mErr = multierror.Append(mErr, grpcTestErr(codes.Unavailable, "connection refused")) + + got := firstLookupError(nil, mErr) + if got == nil { + t.Fatal("expected non-nil error from multiErr; got nil") + } + + if classifyLookupError(got) != lookupErrRefused { + t.Errorf("classifier must see the LAST error (refused) — not whole multierror concat which would match x509 first; got class %v", classifyLookupError(got)) + } +} + +// Contract: a multierror-only refused failure on attempt 1 triggers +// a second attempt (proving the retry path sees per-node errors). +// The pre-fix closure returned only helperErr (nil here) so retry +// exited on attempt 1 and the operator never got the retry budget +// for the dominant transient class. +func TestRetryWithFailFast_RetriesOnMultiErrOnly(t *testing.T) { + calls := 0 + + err := retryWithFailFast(t.Context(), func() error { + calls++ + + if calls == 1 { + mErr := &multierror.Error{} + mErr = multierror.Append(mErr, grpcTestErr(codes.Unavailable, "connection refused")) + + return firstLookupError(nil, mErr) + } + + return nil + }, defaultShouldRetry, testInstantPolicy) + if err != nil { + t.Fatalf("expected nil after retry success; got %v", err) + } + + if calls != 2 { + t.Errorf("expected 2 attempts (multierror refused then ok); got %d", calls) + } +} + +func TestRetryableLookupError(t *testing.T) { + cases := []struct { + class lookupErrorClass + want bool + }{ + {lookupErrRefused, true}, + {lookupErrDeadline, true}, + {lookupErrTLSHandshake, false}, + {lookupErrAuthn, false}, + {lookupErrResource, false}, + {lookupErrUnknown, false}, + } + + for _, tc := range cases { + if got := retryableLookupError(tc.class); got != tc.want { + t.Errorf("retryableLookupError(%v): got %v, want %v", tc.class, got, tc.want) + } + } +} + +// Contract: when the closure-derived `shouldRetry` reports false +// because partial data has been collected, retry stops on attempt 1 +// even with a connection-refused error. Mirrors the stale-modeline +// case in newLookupFunction: one node responds with data, another +// is persistently down. Retry would discard the good data and pay +// up to 600ms while reaching the same final verdict — fast-path +// avoids both costs. +func TestRetryWithFailFast_PartialSuccessFastPath(t *testing.T) { + calls := 0 + partialSuccessReached := false + + shouldRetry := func(err error) bool { + if partialSuccessReached { + return false + } + + return retryableLookupError(classifyLookupError(err)) + } + + err := retryWithFailFast(t.Context(), func() error { + calls++ + partialSuccessReached = true + + return grpcTestErr(codes.Unavailable, "connection refused") + }, shouldRetry, testInstantPolicy) + if err == nil { + t.Fatal("expected error to bubble up; partial-success fast-path returns the error, not nil") + } + + if calls != 1 { + t.Errorf("partial-success closure must short-circuit retry on attempt 1; got %d attempts", calls) + } +} + +// Contract: a future class added to the iota set without being +// threaded into retryableLookupError's switch MUST panic at first +// reach, not silently fall to "not retryable". Pinned via a +// synthetic class value past the known range; the panic recovery +// proves the default branch fires loudly so the regression is caught +// in CI rather than producing a silent UX degradation. +func TestRetryableLookupError_UnknownClassPanics(t *testing.T) { + defer func() { + if r := recover(); r == nil { + t.Fatal("expected panic on unhandled class; got none") + } + }() + + // Synthetic class value beyond the iota set — emulates a future + // class added to the type without switch update. + _ = retryableLookupError(lookupErrorClass(999)) +} diff --git a/pkg/engine/engine.go b/pkg/engine/engine.go index 00aa92d..de9480c 100644 --- a/pkg/engine/engine.go +++ b/pkg/engine/engine.go @@ -61,6 +61,11 @@ type Options struct { // CommandName names the caller subcommand for error messages such as // the one produced by FailIfMultiNodes. Empty value falls back to "talm". CommandName string + // TalosEndpoints carries the addresses dialed by the talos client for + // chart `lookup` calls. Surfaced through wrapLookupError so failed + // lookups name the endpoints the operator actually targeted, instead + // of forcing them to reconstruct from CLI flags / modeline. + TalosEndpoints []string } // NormalizeTemplatePath converts OS-specific path separators to forward slash. @@ -1594,7 +1599,7 @@ func Render(ctx context.Context, c *client.Client, opts Options) ([]byte, error) return nil, errors.Wrap(err, "checking node selector") } - helmEngine.LookupFunc = newLookupFunction(ctx, c) + helmEngine.LookupFunc = newLookupFunction(ctx, c, cmdName, opts.TalosEndpoints) } chartPath, err := os.Getwd() @@ -2085,13 +2090,17 @@ func extractResourceData(r resource.Resource) (map[string]any, error) { // template function, dispatching across COSI resource kinds and emitting // a deterministic error envelope on miss. // -//nolint:funlen // 62 lines: closure over ctx/c with a single linear dispatch over resource kinds; extracting helpers would either thread (ctx, c) through every signature or hoist the closure body to package level. -func newLookupFunction(ctx context.Context, c *client.Client) func(resource string, namespace string, id string) (map[string]any, error) { +//nolint:funlen // closure over ctx/c plus a single linear dispatch over resource kinds wrapped in a retry-with-fail-fast loop; extracting helpers would either thread (ctx, c) through every signature or hoist the closure body to package level. +func newLookupFunction(ctx context.Context, c *client.Client, commandName string, endpoints []string) func(resource string, namespace string, id string) (map[string]any, error) { return func(kind string, namespace string, docID string) (map[string]any, error) { var multiErr *multierror.Error var resources []map[string]any + // Signature is fixed by helpers.ForEachResource; the callback + // always returns nil because per-item errors are accumulated + // into multiErr / passed through for retry classification. + //nolint:unparam // callback shape fixed by helpers.ForEachResource API callbackResource := func(_ context.Context, _ string, r resource.Resource, callError error) error { if callError != nil { // Ignore NotFound and PermissionDenied errors - resource doesn't exist or is not accessible @@ -2123,14 +2132,53 @@ func newLookupFunction(ctx context.Context, c *client.Client) func(resource stri return nil } - helperErr := helpers.ForEachResource(ctx, c, callbackRD, callbackResource, namespace, kind, docID) - if helperErr != nil { - return map[string]any{}, errors.Wrap(helperErr, "iterating resources") + // Retry transient connectivity failures up to defaultMaxAttempts + // times; TLS handshake / authn / resource / unknown classes + // fail fast — see retryWithFailFast docs. The closure resets + // multiErr and resources at the top of each attempt so a + // half-collected partial result from a failed attempt does not + // leak into the next one. + // + // helpers.ForEachResource routes per-node dial failures (the + // dominant transient class — a single node briefly partitioned + // from the rest of a multi-node lookup) through callbackResource + // as callError, where they land in multiErr and ForEachResource + // itself returns nil. firstLookupError surfaces those to the + // retry predicate so the brief-partition case is actually + // retried, not silently wrapped on the first attempt. + // + // shouldRetry adds a partial-success fast-path: when at least + // one node responded with data this attempt, retrying would + // reset both `resources` AND `multiErr` and re-collect from + // the same set — wasting up to 600ms while producing the same + // final outcome (a persistently-down node stays down). + // + // Note this saves ONLY latency: the post-retry caller still + // returns an error envelope and discards `resources` whenever + // any node failed (matches pre-retry behaviour at engine.go's + // wrapLookupError site below). The partial data is not + // preserved by this fast-path; what is preserved is the + // operator's wall-clock budget on the stale-modeline case + // (decommissioned node still listed in modeline) which is + // common enough that paying per-render latency for it was + // the dominant cost of the retry feature before this + // fast-path landed. + shouldRetry := func(err error) bool { + if len(resources) > 0 { + return false + } + + return retryableLookupError(classifyLookupError(err)) } - err := multiErr.ErrorOrNil() - if err != nil { - return map[string]any{}, errors.Wrap(err, "collecting resource lookup errors") + attemptErr := retryWithFailFast(ctx, func() error { + multiErr = nil + resources = resources[:0] + + return firstLookupError(helpers.ForEachResource(ctx, c, callbackRD, callbackResource, namespace, kind, docID), multiErr) + }, shouldRetry, defaultRetryPolicy()) + if attemptErr != nil { + return map[string]any{}, wrapLookupError(attemptErr, kind, namespace, docID, endpoints, commandName) } if len(resources) == 0 { diff --git a/pkg/engine/lookup_classify.go b/pkg/engine/lookup_classify.go new file mode 100644 index 0000000..0de3be5 --- /dev/null +++ b/pkg/engine/lookup_classify.go @@ -0,0 +1,413 @@ +// Copyright Cozystack Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +package engine + +import ( + "context" + "fmt" + "strings" + "time" + + "github.com/cockroachdb/errors" + "github.com/hashicorp/go-multierror" + "google.golang.org/grpc/codes" + "google.golang.org/grpc/status" +) + +// CommandNameApply is the CommandName value `talm apply` passes to +// engine.Render. Exported so pkg/commands/apply.go uses it directly +// as the source of truth — eliminates the drift class that would +// otherwise let apply error hints silently start suggesting the +// non-existent `--offline` flag. +const CommandNameApply = "talm apply" + +// CommandNameTemplate is the CommandName value `talm template` passes +// to engine.Render. Pinning to a constant lets offlineRemedyFor +// allow-list `--offline` suggestions (rather than deny-list against +// known no-offline subcommands) — that way any future subcommand or +// untested caller that forgets to set CommandName falls through to +// the safe "fix reachability" remedy instead of getting an +// `--offline` hint that may not exist for its flow. +const CommandNameTemplate = "talm template" + +// lookupErrorClass partitions failures from the talos client `lookup` +// path into actionable categories. Each class drives a distinct +// remedy in hintForClass; the partition is intentionally lossy +// (substring-matched on gRPC code + description) — the goal is +// "right-enough hint 95% of the time", not exhaustive taxonomy. +type lookupErrorClass int + +const ( + lookupErrUnknown lookupErrorClass = iota + lookupErrTLSHandshake + lookupErrRefused + lookupErrDeadline + lookupErrAuthn + lookupErrResource +) + +// classifyLookupError inspects err and returns the most specific +// class it can identify. Discriminators are checked in priority +// order; `Unavailable` with no diagnostic substring falls back to +// `lookupErrRefused` (the dominant real-world cause of bare +// Unavailable from gRPC dial). +func classifyLookupError(err error) lookupErrorClass { + if err == nil { + return lookupErrUnknown + } + + desc := strings.ToLower(err.Error()) + + // "Signed by unknown authority" is a CA / PKI trust failure — + // the dialed cert is fine, but the operator's talosconfig CA bundle + // doesn't recognize the signer. That's an authn-class problem + // (talosconfig context mismatch); the Authn hint steers the + // operator at the talosconfig. + if strings.Contains(desc, "certificate signed by unknown authority") { + return lookupErrAuthn + } + + // Any other x509 / tls error (SAN mismatch like "x509: certificate + // is valid for X, not Y", expired cert, hostname mismatch) is a + // TLS-handshake-class problem. The TLS-handshake hint explicitly + // names cert SANs as a common cause and tells the operator how to + // verify with `talosctl get nodeaddress`. Routing these to Authn + // would land them on the "talosconfig credentials rejected" hint, + // which is irrelevant when the cert chain trusts fine but the + // SAN / expiry / hostname is wrong. + if strings.Contains(desc, "handshake failed") || + strings.Contains(desc, "tls:") || + strings.Contains(desc, "x509:") { + return lookupErrTLSHandshake + } + + if strings.Contains(desc, "connection refused") || + strings.Contains(desc, "no route to host") || + strings.Contains(desc, "network is unreachable") { + return lookupErrRefused + } + + switch status.Code(err) { + case codes.DeadlineExceeded: + return lookupErrDeadline + case codes.Unauthenticated: + return lookupErrAuthn + case codes.Internal, codes.InvalidArgument, codes.FailedPrecondition, codes.NotFound, codes.Unimplemented: + // NotFound and Unimplemented at this level come from the + // ResolveResourceKind path inside helpers.ForEachResource — + // the operator asked for a resource kind the target Talos + // version doesn't know about. The per-node callback already + // filters NotFound for missing instances (engine.go), so any + // NotFound reaching the classifier is definitionally a + // resource-class problem (chart bug, version mismatch). + return lookupErrResource + case codes.Unavailable: + // Diagnostic substrings already checked above; reaching here + // means "channel down, root cause unidentified". Operators + // expect "refused / firewall / wrong port" guidance for this + // class far more often than "TLS / authn / chart bug" — the + // refused-class remedy is a strict superset of useful + // suggestions for the unknown-Unavailable case. + return lookupErrRefused + case codes.OK, + codes.Canceled, + codes.Unknown, + codes.AlreadyExists, + codes.PermissionDenied, + codes.ResourceExhausted, + codes.Aborted, + codes.OutOfRange, + codes.DataLoss: + // Not connectivity-class; fall through to substring fallback. + } + + if strings.Contains(desc, "context deadline exceeded") { + return lookupErrDeadline + } + + return lookupErrUnknown +} + +// wrapLookupError enriches a raw `lookup` error with two layers of +// operator-facing context: +// +// 1. The wrap message names the resource kind, namespace, id, and +// dialed endpoints — the pre-fix chain stripped these and forced +// the operator to read the chart template trace to recover them. +// 2. An errors.WithHint annotation steers the operator toward the +// right remedy for the classified failure mode. +// +// Returns nil when err is nil so call sites can use it +// unconditionally after `helpers.ForEachResource`. +func wrapLookupError(err error, kind, namespace, docID string, endpoints []string, commandName string) error { + if err == nil { + return nil + } + + enriched := errors.Wrapf(err, + "looking up resource kind=%q namespace=%q id=%q on endpoints=%v", + kind, namespace, docID, endpoints) + + class := classifyLookupError(err) + + //nolint:wrapcheck // WithHint annotates the already-wrapped error chain; double-wrap would obscure the chain. + return errors.WithHint(enriched, hintForClass(class, commandName, kind, endpoints)) +} + +// hintForClass returns the human-facing remedy for a classified +// lookup failure. The `talm apply` branch never mentions `--offline` +// because that flag is template-only; suggesting it would teach a +// non-existent workflow. +func hintForClass(class lookupErrorClass, commandName, kind string, endpoints []string) string { + offlineRemedy := offlineRemedyFor(commandName) + + switch class { + case lookupErrTLSHandshake: + return fmt.Sprintf( + "TLS handshake aborted%s — common causes: "+ + "(1) node is not running Talos or is in maintenance mode, "+ + "(2) cert SANs do not include the dialed address, "+ + "(3) wrong port. "+ + "Verify with `talosctl%s get nodeaddress` first. %s", + endpointPhrase(endpoints), endpointsSuggestionSuffix(endpoints), offlineRemedy) + + case lookupErrRefused: + return fmt.Sprintf( + "connection refused%s — node down, wrong port, or blocked by firewall. "+ + "Verify the network path. %s", + endpointPhrase(endpoints), offlineRemedy) + + case lookupErrDeadline: + return fmt.Sprintf( + "request timed out%s — node is slow to respond or the network is partitioned. %s", + endpointPhrase(endpoints), offlineRemedy) + + case lookupErrAuthn: + return fmt.Sprintf( + "talosconfig credentials rejected%s — verify the talosconfig context and the node's PKI. %s", + endpointPhrase(endpoints), offlineRemedy) + + case lookupErrResource: + return fmt.Sprintf( + "lookup of resource kind=%q failed with a non-network error — likely a chart bug or unsupported Talos resource. "+ + "Verify the resource exists with `talosctl get %s`. "+ + "Skipping live discovery would not help here: chart helpers depend on this value and an empty result would produce broken output that looks valid.", + kind, kind) + + case lookupErrUnknown: + // fall through to the generic remedy below. + } + + return fmt.Sprintf( + "lookup failed%s with an unrecognized error — "+ + "file an issue at https://github.com/cozystack/talm/issues with the full output.", + endpointPhrase(endpoints)) +} + +// offlineRemedyFor returns the closing sentence of a connectivity- +// class hint. Allow-listed: only callers that explicitly identify as +// `CommandNameTemplate` get the `--offline` escape clause, because +// only `talm template` ships the flag. Every other caller (apply, +// untested callers, callers that forgot to set CommandName, future +// subcommands) gets the safe generic "fix reachability" remedy. The +// allow-list avoids the failure mode where a missing CommandName +// would silently leak `--offline` into a hint for a flow that has +// no such flag. +func offlineRemedyFor(commandName string) string { + if commandName == CommandNameTemplate { + return "If live discovery is not required for this render, pass --offline to skip it." + } + + return "Fix node reachability before re-running." +} + +// endpointsSuggestionSuffix builds the ` --endpoints X,Y,…` suffix +// appended to a `talosctl` suggestion in hints. Returns the empty +// string when endpoints is empty so the rendered suggestion stays +// grammatical (`talosctl get nodeaddress` rather than `talosctl +// --endpoints get nodeaddress` with a double space). Lists all +// configured endpoints comma-separated — `talosctl --endpoints` +// accepts multiple, and naming only the first would mislead the +// operator when the failing one is somewhere in the middle of the +// list. +func endpointsSuggestionSuffix(endpoints []string) string { + if len(endpoints) == 0 { + return "" + } + + return " --endpoints " + strings.Join(endpoints, ",") +} + +// endpointPhrase returns the operator-facing phrase that names the +// dialed endpoint set in hint leads. When endpoints is empty (no +// --endpoints flag, no modeline endpoints, no talosconfig context) +// returns the empty string so the hint reads grammatically without +// a literal `[]` bracket pair leaking through fmt.Sprintf("%v", nil). +// Includes the leading space so the call site can write "verb%s —" +// and get clean output in both the populated and empty cases. +func endpointPhrase(endpoints []string) string { + if len(endpoints) == 0 { + return "" + } + + return fmt.Sprintf(" by endpoint %v", endpoints) +} + +// retryPolicy parameterises retryWithFailFast so tests can drop the +// backoff to zero while production keeps the exponential schedule. +// Both fields are required; a zero maxAttempts skips the loop body +// entirely and would silently swallow any error. +type retryPolicy struct { + maxAttempts int + backoff func(attempt int) time.Duration +} + +// defaultMaxAttempts is the upper bound on `lookup` retries in +// production. Three attempts with 200ms · 2^n backoff = total +// inter-attempt wait of 600ms (200 + 400). Keeps the CLI responsive +// while catching genuinely transient failures (connection blip, +// brief partition, controller restart). +const defaultMaxAttempts = 3 + +// defaultRetryPolicy is the production retry budget for `lookup` +// against the live Talos API. Returned by function rather than held +// as a var so test-time policy injection is obviously local and the +// gochecknoglobals lint suppression is not paid module-wide. +func defaultRetryPolicy() retryPolicy { + return retryPolicy{ + maxAttempts: defaultMaxAttempts, + backoff: interAttemptBackoff, + } +} + +// interAttemptBackoff is the wait between attempt N and attempt N+1 +// (200ms · 2^N). Caller passes the just-finished attempt's zero- +// based index, so the maximum value actually slept is +// `interAttemptBackoff(maxAttempts - 2)` — the last attempt's value +// is computed by the loop but never slept on because the loop breaks +// at `i == maxAttempts - 1`. Renamed from the earlier +// `productionBackoff` to make the inter-attempt semantics explicit. +func interAttemptBackoff(attempt int) time.Duration { + return 200 * time.Millisecond * time.Duration(1<