fix(cli, engine): emit progress on stderr, enrich lookup error chain, retry transient failures#212
fix(cli, engine): emit progress on stderr, enrich lookup error chain, retry transient failures#212lexfrei wants to merge 3 commits into
Conversation
… retry transient failures Three layered improvements to how talm reports lookup failures and routes progress output. 1. stdout cleanliness for talm template -f X > Y The per-file progress line `- talm: file=...` was emitted via fmt.Printf (stdout) from one site in template.go and three sites in apply.go. The canonical redirect pattern talm template ... > nodes/foo.yaml was therefore producing files with a leading non-comment YAML list item that broke downstream consumers. All four sites move to fmt.Fprintf(os.Stderr, ...), matching the established stderr idiom in this codebase (Updated., kubeconfig warnings, talosconfig regeneration, etc.). 2. Enriched lookup error chain Pre-change a chart lookup failure surfaced as 'iterating resources: rpc error: code = Unavailable ...' — operator had to read the chart template trace to recover the resource kind, with no signal of which endpoint was dialed or what failure class this was. Every lookup failure now wraps with kind=, namespace=, id=, on endpoints=[...] and carries an errors.WithHint annotation whose body depends on a six- class taxonomy (TLS handshake / connection refused / deadline / authentication / resource / unknown). --offline is suggested only when the calling command actually has the flag — allow-listed against engine.CommandNameTemplate. talm apply, future subcommands, and the empty-CommandName default all fall through to a generic 'fix reachability' remedy. The resource-class hint explicitly warns AWAY from --offline because rendering against an empty discovery surface would mask a real chart bug. 3. Bounded retry with fail-fast lookup calls now retry up to 3 times with 200ms · 2^n exponential backoff (~600ms total wait budget) — but ONLY for transient connectivity classes (refused, deadline). TLS handshake / authn / resource / unknown errors fail fast because retrying buys nothing and costs operator latency. A partial-success fast-path skips retry when at least one node responded with data in this attempt — the stale-modeline case (operator decommissioned a node but did not update the modeline) is common enough that paying per-render latency for it would have been the dominant cost of the retry feature. The partial data is not preserved (matches pre-existing 'any node fail → wrap as error' behaviour); what is preserved is the wall-clock budget. Context cancellation interrupts the inter-attempt backoff immediately so Ctrl+C in talm template / apply stays responsive. Tests - pkg/engine/contract_lookup_classify_test.go: 14 cases covering each error class plus the Unavailable default and the ResolveResourceKind-NotFound path. - pkg/engine/contract_lookup_error_diagnostics_test.go: wrap-context preservation, per-class hint contracts, empty-endpoints grammaticality across all six classes, multi-endpoint listing, apply never gets --offline, empty/unknown commandName falls to the safe remedy. - pkg/engine/contract_lookup_retry_test.go: no-retry-on-success, retry-on-refused, retry-on-deadline, fail-fast on TLS / authn / resource / unknown, context cancellation interrupts backoff, bounded production-budget timing, firstLookupError multierror surfacing, last-appended discrimination (one node's x509: failure does not mis-flip the whole batch), partial-success fast-path, unhandled-class panic for future iota additions. - pkg/commands/contract_apply_command_name_test.go: pins applyCommandName == engine.CommandNameApply so a future drift cannot silently let apply hints start suggesting the non-existent --offline flag. - pkg/commands/contract_stdout_silence_test.go: AST-level source walker over pkg/commands/*.go that rejects any fmt.Print* or fmt.Fprint*(os.Stdout, ...) call whose first string-literal argument starts with '- talm:'. Catches multi-line call sites a text grep would miss. Plus an integration test running templateOneFile in --offline mode through real pipes asserting the channel split. - docs/manual-test-plan.md section N (N.1–N.4): forward-looking operator scripts for stdout cleanliness, lookup error diagnostics (unreachable + non-network), retry policy (transient blip, persistent failure, Ctrl+C interruption). A follow-up issue will track a global --quiet / --verbose flag for gating all info-level CLI chatter; this change holds the line at the four progress sites without introducing a new flag. Assisted-By: Claude <noreply@anthropic.com> Signed-off-by: Aleksei Sviridkin <f@lex.la>
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughThis PR introduces structured lookup error handling for Talos resource discovery in Helm charts. It classifies gRPC failures into categories (TLS, connectivity, timeout, authentication, resource kind, unknown), wraps errors with operator-facing remedies tailored to each class and command, implements transient-failure retry with exponential backoff, routes progress output to stderr to preserve clean stdout for rendered templates, and integrates the new error path through the engine's Helm lookup function. ChangesLookup error diagnostics, retry, and output discipline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust error handling and retry mechanism for Talos API lookups, including structured error classification and actionable user hints. It also enforces output discipline by redirecting progress logs to stderr, ensuring stdout remains clean for template rendering. Feedback suggests refining the error classification logic to ensure TLS SAN mismatches are correctly categorized under handshake errors rather than authentication errors to provide more specific guidance.
| if strings.Contains(desc, "certificate signed by unknown authority") || | ||
| strings.Contains(desc, "x509:") { | ||
| return lookupErrAuthn | ||
| } | ||
|
|
||
| if strings.Contains(desc, "handshake failed") || strings.Contains(desc, "tls:") { | ||
| return lookupErrTLSHandshake | ||
| } |
There was a problem hiding this comment.
The current classification logic for lookupErrAuthn might incorrectly categorize TLS SAN mismatch errors. A SAN mismatch typically results in an error message starting with x509: certificate is valid for ..., not .... Because this contains the x509: substring, it will be classified as lookupErrAuthn and receive a generic PKI hint. However, the lookupErrTLSHandshake hint specifically mentions SANs as a common cause.
Consider refining the order or the substrings to ensure SAN mismatches receive the more specific lookupErrTLSHandshake hint.
There was a problem hiding this comment.
Good catch — fixed in cf1008c.
The classifier now narrows the Authn-class substring to "certificate signed by unknown authority" (the real CA / PKI trust failure where the talosconfig context is the right pivot). Any other x509 / tls / handshake descriptor — SAN mismatch ("x509: certificate is valid for X, not Y"), expired cert, hostname mismatch — falls through to lookupErrTLSHandshake, whose hint already names cert SANs as a common cause and points the operator at talosctl get nodeaddress for verification.
Added two contract tests pinning the new routing: TestClassifyLookupError_TLSHandshake_x509SANMismatch and TestClassifyLookupError_TLSHandshake_x509Expired (pkg/engine/contract_lookup_classify_test.go).
…shake class
Pre-fix the substring check 'x509:' on any descriptor flipped the
classification to lookupErrAuthn, which lands on the 'talosconfig
credentials rejected' hint. That hint is the wrong remedy when the
cert chain trusts fine but the certificate just doesn't cover the
dialed address ('x509: certificate is valid for X, not Y') or has
expired — the operator should renew or correct the SAN, not touch
talosconfig.
Narrow the Authn substring to 'certificate signed by unknown
authority' (the genuine CA / PKI trust failure where talosconfig CA
bundle is the right pivot), and route all other 'x509:' / 'tls:' /
'handshake failed' descriptors to lookupErrTLSHandshake. The
TLS-handshake hint already names cert SANs as a common cause and
points the operator at 'talosctl get nodeaddress' for verification.
Adds contract tests for SAN mismatch and expired-cert cases pinning
the new classification.
Signed-off-by: Aleksei Sviridkin <f@lex.la>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/manual-test-plan.md`:
- Around line 1577-1583: The test currently injects a DROP rule via the iptables
command (ssh ${NODE} iptables -A INPUT -p tcp --dport 50000 -j DROP) but the
expectation text refers to a "connection refused" class; update the docs to
align with the injected fault by either (A) changing the expected hint/class
text from "connection refused" to a timeout/deadline/socket timeout class that
reflects packet DROP behavior, or (B) if you want a true refusal, change the
injection to use REJECT with a TCP reset (ssh ${NODE} iptables -A INPUT -p tcp
--dport 50000 -j REJECT --reject-with tcp-reset) and adjust the expected
hint/class accordingly; edit the lines referencing the iptables injection and
the "connection refused" hint so they match one coherent fault mode.
- Around line 1594-1595: Update the expectation text that reads "talm exits
within ~100ms, not after the full 600ms retry budget" to a less brittle
assertion such as "talm exits well before the full retry budget (e.g., <1s)" so
the manual check is tolerant across environments; ensure you reference the
original context mentioning "talm" and the "600ms retry budget" when changing
the sentence.
In `@pkg/commands/template.go`:
- Around line 224-225: The progress line currently uses fmt.Fprintf(os.Stderr,
"...%s...", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints,
templateCmdFlags.templateFiles) which prints slice values as "%!s(...)" — change
the formatting to be slice-safe by either using %v/%#v for the slice arguments
or convert the slices to strings (e.g. strings.Join(GlobalArgs.Nodes, ","),
strings.Join(GlobalArgs.Endpoints, ","),
strings.Join(templateCmdFlags.templateFiles, ",")) before passing to
fmt.Fprintf; update the call in the same block that writes to os.Stderr so the
progress line prints human-readable slice contents.
In `@pkg/engine/lookup_classify.go`:
- Around line 379-405: Validate retry policy at the top of retryWithFailFast:
ensure policy.maxAttempts is > 0 and policy.backoff is not nil (and optionally
that maxAttempts is reasonable); if these conditions fail, return a descriptive
error immediately instead of entering the loop. Reference retryWithFailFast and
the retryPolicy fields maxAttempts and backoff so the caller fails fast rather
than returning nil/no-op or panicking when backoff is nil.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 461e252e-dcb8-4a79-8963-f4efdca98815
📒 Files selected for processing (10)
docs/manual-test-plan.mdpkg/commands/apply.gopkg/commands/contract_apply_command_name_test.gopkg/commands/contract_stdout_silence_test.gopkg/commands/template.gopkg/engine/contract_lookup_classify_test.gopkg/engine/contract_lookup_error_diagnostics_test.gopkg/engine/contract_lookup_retry_test.gopkg/engine/engine.gopkg/engine/lookup_classify.go
| // Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y. | ||
| fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles) |
There was a problem hiding this comment.
Use slice-safe formatting in progress output.
%s with []string prints %!s(...) instead of values, so the progress line is malformed.
Suggested fix
- fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles)
+ fmt.Fprintf(
+ os.Stderr,
+ "- talm: file=%s, nodes=[%s], endpoints=[%s], templates=[%s]\n",
+ configFile,
+ strings.Join(GlobalArgs.Nodes, ","),
+ strings.Join(GlobalArgs.Endpoints, ","),
+ strings.Join(templateCmdFlags.templateFiles, ","),
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y. | |
| fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles) | |
| // Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y. | |
| fmt.Fprintf( | |
| os.Stderr, | |
| "- talm: file=%s, nodes=[%s], endpoints=[%s], templates=[%s]\n", | |
| configFile, | |
| strings.Join(GlobalArgs.Nodes, ","), | |
| strings.Join(GlobalArgs.Endpoints, ","), | |
| strings.Join(templateCmdFlags.templateFiles, ","), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/commands/template.go` around lines 224 - 225, The progress line currently
uses fmt.Fprintf(os.Stderr, "...%s...", configFile, GlobalArgs.Nodes,
GlobalArgs.Endpoints, templateCmdFlags.templateFiles) which prints slice values
as "%!s(...)" — change the formatting to be slice-safe by either using %v/%#v
for the slice arguments or convert the slices to strings (e.g.
strings.Join(GlobalArgs.Nodes, ","), strings.Join(GlobalArgs.Endpoints, ","),
strings.Join(templateCmdFlags.templateFiles, ",")) before passing to
fmt.Fprintf; update the call in the same block that writes to os.Stderr so the
progress line prints human-readable slice contents.
There was a problem hiding this comment.
Verified the claim and the code is actually correct as-is — declining.
Go's fmt handles %s against []string via the compound-formatting rule: it applies %s to each element and wraps the result in [ ]. A quick check:
fmt.Printf("nodes=%s\n", []string{"192.0.2.10", "192.0.2.11"})
// nodes=[192.0.2.10 192.0.2.11]No %!s(...) is produced; the progress line reads correctly. Switching to %v would produce identical output for []string since the per-element format is the same. Leaving the existing %s to keep the diff scoped.
There was a problem hiding this comment.
@lexfrei, you're absolutely right — I apologize for the false positive. Go's fmt compound-formatting rule applies the verb element-wise to slices, so %s on a []string correctly produces [elem0 elem1 ...] rather than any %!s(...) escape. The existing code is fine as-is.
(ᵔ ᴥ ᵔ) 🐇
✏️ Learnings added
Learnt from: lexfrei
Repo: cozystack/talm PR: 212
File: pkg/commands/template.go:224-225
Timestamp: 2026-05-18T16:43:51.082Z
Learning: In Go, `fmt.Printf("%s", someSlice)` (or `fmt.Fprintf`) where `someSlice` is a `[]string` does NOT produce `%!s(...)`. Go's compound-formatting rule applies the verb element-wise and wraps the result in `[ ]`, e.g., `[]string{"a", "b"}` prints as `[a b]`. Do not flag `%s` with `[]string` arguments as a formatting error in Go code reviews.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: lexfrei
Repo: cozystack/talm PR: 113
File: pkg/commands/apply.go:116-136
Timestamp: 2026-04-02T08:59:38.723Z
Learning: In cozystack/talm’s pkg/commands/apply.go and pkg/commands/template.go, treat the node file provided via `talm apply -f <node.yaml>` as a modeline container for metadata only (e.g., nodes/endpoints/template paths). Do not pass `configFile` as a patch to `engine.Render`. Instead, generate the cluster configuration entirely from Helm templates plus `values.yaml`, matching the existing `templateWithFiles` behavior and the equivalent workflow `talm template -f node.yaml | talm apply -f -`.
Learnt from: lexfrei
Repo: cozystack/talm PR: 154
File: pkg/commands/contract_helpers_test.go:40-69
Timestamp: 2026-05-08T18:19:34.964Z
Learning: In this codebase, some Go *contract* tests intentionally “pin” known-broken behavior: keep the incorrect expected value and add a `FIXME #<issue>` comment in the test case to document the known defect. When the underlying bug is fixed later, update only the expected value to the corrected behavior (and remove/adjust the `FIXME` as appropriate) rather than rewriting the test pattern.
For the specific IPv6 endpoint/bracketing defect tracked as issue `#155` in `pkg/commands` (notably after `net.SplitHostPort`), ensure the fix re-brackets IPv6 correctly by using `net.JoinHostPort(host, "6443")` instead of `fmt.Sprintf("https://%s:6443", host)` (so IPv6 hostnames are wrapped in `[]` when combined with a port).
| func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error { | ||
| var lastErr error | ||
|
|
||
| for i := range policy.maxAttempts { | ||
| lastErr = attempt() | ||
| if lastErr == nil { | ||
| return nil | ||
| } | ||
|
|
||
| if !shouldRetry(lastErr) { | ||
| return lastErr | ||
| } | ||
|
|
||
| if i == policy.maxAttempts-1 { | ||
| break | ||
| } | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| //nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error. | ||
| return ctx.Err() | ||
| case <-time.After(policy.backoff(i)): | ||
| } | ||
| } | ||
|
|
||
| return lastErr | ||
| } |
There was a problem hiding this comment.
Guard invalid retry policy inputs to prevent silent success/panic.
Line 382 can return nil without attempting anything when maxAttempts <= 0, and Line 400 will panic if backoff is nil. Add explicit validation at function entry.
Suggested patch
func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error {
+ if policy.maxAttempts <= 0 {
+ return errors.New("retryWithFailFast: maxAttempts must be > 0")
+ }
+ if policy.backoff == nil {
+ return errors.New("retryWithFailFast: backoff must not be nil")
+ }
+
var lastErr error📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error { | |
| var lastErr error | |
| for i := range policy.maxAttempts { | |
| lastErr = attempt() | |
| if lastErr == nil { | |
| return nil | |
| } | |
| if !shouldRetry(lastErr) { | |
| return lastErr | |
| } | |
| if i == policy.maxAttempts-1 { | |
| break | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| //nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error. | |
| return ctx.Err() | |
| case <-time.After(policy.backoff(i)): | |
| } | |
| } | |
| return lastErr | |
| } | |
| func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error { | |
| if policy.maxAttempts <= 0 { | |
| return errors.New("retryWithFailFast: maxAttempts must be > 0") | |
| } | |
| if policy.backoff == nil { | |
| return errors.New("retryWithFailFast: backoff must not be nil") | |
| } | |
| var lastErr error | |
| for i := range policy.maxAttempts { | |
| lastErr = attempt() | |
| if lastErr == nil { | |
| return nil | |
| } | |
| if !shouldRetry(lastErr) { | |
| return lastErr | |
| } | |
| if i == policy.maxAttempts-1 { | |
| break | |
| } | |
| select { | |
| case <-ctx.Done(): | |
| //nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error. | |
| return ctx.Err() | |
| case <-time.After(policy.backoff(i)): | |
| } | |
| } | |
| return lastErr | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/engine/lookup_classify.go` around lines 379 - 405, Validate retry policy
at the top of retryWithFailFast: ensure policy.maxAttempts is > 0 and
policy.backoff is not nil (and optionally that maxAttempts is reasonable); if
these conditions fail, return a descriptive error immediately instead of
entering the loop. Reference retryWithFailFast and the retryPolicy fields
maxAttempts and backoff so the caller fails fast rather than returning nil/no-op
or panicking when backoff is nil.
There was a problem hiding this comment.
Declining — the function is an unexported internal primitive with two known call sites (the production newLookupFunction via defaultRetryPolicy() and the unit tests), and both construct valid retryPolicy values. Adding boundary validation for the misuse cases described would contradict the project's coding rule of not validating scenarios that can't occur in internal code paths.
For the record on the two specific concerns:
maxAttempts <= 0returning nil: the only path that could surface this is a future caller forgetting to initialize the policy, which would also fail every other contract test in the suite; the silence would be the smallest symptom, not the diagnostic signal.backoff == nilpanicking on first retry: a nil-deref panic IS the loud failure mode here — it's deterministic, immediate on the first retryable error, and lands a clear stack trace in CI. Pre-validating to panic earlier with a custom message would not improve discovery.
If a public API surfaces retryWithFailFast later, validation belongs at that boundary.
There was a problem hiding this comment.
@lexfrei, understood — the internal-only scope plus the existing contract-test coverage make the pre-validation redundant, and a nil-deref panic from backoff == nil is indeed a louder and more actionable CI signal than a custom error message. Acknowledged.
✏️ Learnings added
Learnt from: lexfrei
Repo: cozystack/talm PR: 212
File: pkg/engine/lookup_classify.go:379-405
Timestamp: 2026-05-18T16:44:13.725Z
Learning: In the cozystack/talm repository (Go project), `retryWithFailFast` in `pkg/engine/lookup_classify.go` is an unexported internal primitive. The project coding convention is to NOT add boundary validation (e.g., maxAttempts <= 0, nil backoff guards) for scenarios that cannot occur on internal code paths. Validation should only be added at public API boundaries. A nil `backoff` field is intentionally left to panic (nil-deref) as the loud/deterministic failure signal in CI.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…/ cancellation semantics Two fixes from review feedback on the retry scenario: 1. `iptables -j DROP` silently discards SYN packets; the talos client retries the TCP handshake until its own deadline expires, so the lookup classifies as deadline (timeout) rather than refused. The expected line said 'connection refused class hint', which would never have matched a real run. Switched the example to `REJECT --reject-with tcp-reset` so the kernel surfaces ECONNREFUSED immediately and the expected hint matches. Added a parenthetical noting that swapping back to DROP intentionally exercises the deadline-class path with its own expected hint text. 2. Ctrl+C interrupt assertion of '~100ms' was too strict — even with correctly wired cancellation, ssh / busy host scheduling can push the wall-clock observation past 100ms and produce false negatives in manual runs. Relaxed to 'well under 1s' (anything visibly faster than the full ~600ms retry budget proves the cancellation path is correct). Signed-off-by: Aleksei Sviridkin <f@lex.la>
What
Three layered improvements to how
talmreportslookupfailures and routes progress output.1. stdout cleanliness for
talm template -f X > YThe per-file progress line
- talm: file=...was emitted viafmt.Printf(stdout) from one site intemplate.goand three sites inapply.go. The canonical redirect patterntalm template ... > nodes/foo.yamlwas therefore producing files with a leading non-comment YAML list item that broke downstream consumers. All four sites move tofmt.Fprintf(os.Stderr, ...), matching the established stderr idiom in this codebase (Updated., kubeconfig warnings, talosconfig regeneration, etc.).2. Enriched lookup error chain
Pre-change a chart
lookupfailure surfaced asiterating resources: rpc error: code = Unavailable ...— operator had to read the chart template trace to recover the resource kind, with no signal of which endpoint was dialed or what failure class this was. Everylookupfailure now wraps withkind="..." namespace="..." id="..." on endpoints=[...]and carries anerrors.WithHintannotation whose body depends on a six-class taxonomy (TLS handshake / connection refused / deadline / authentication / resource / unknown).--offlineis suggested only when the calling command actually has the flag — allow-listed againstengine.CommandNameTemplate.talm apply, future subcommands, and the empty-CommandNamedefault all fall through to a generic "fix reachability" remedy. The resource-class hint explicitly warns AWAY from--offlinebecause rendering against an empty discovery surface would mask a real chart bug.3. Bounded retry with fail-fast
lookupcalls now retry up to 3 times with 200ms · 2^n exponential backoff (~600ms total wait budget) — but ONLY for transient connectivity classes (refused, deadline). TLS handshake / authn / resource / unknown errors fail fast because retrying buys nothing and costs operator latency. A partial-success fast-path skips retry when at least one node responded with data in this attempt — the stale-modeline case (operator decommissioned a node but did not update the modeline) is common enough that paying per-render latency for it would have been the dominant cost of the retry feature. The partial data is not preserved (matches pre-existing "any node fail → wrap as error" behaviour); what is preserved is the wall-clock budget. Context cancellation interrupts the inter-attempt backoff immediately so Ctrl+C intalm template / applystays responsive.Why
Reported via private channel:
talm template --file nodes/node0.yaml.old > nodes/node0.yamlagainst a temporarily-unreachable node produced two symptoms that together masked the actual situation — a corrupt output file with a leading non-comment YAML list item, and an opaquetransport: authentication handshake failed: EOFerror chain that did not name the dialed endpoint or the resource kind, leaving the operator without a clear next step.Testing
pkg/engine/contract_lookup_classify_test.go: 14 cases covering each error class plus theUnavailabledefault and theResolveResourceKind-NotFound path.pkg/engine/contract_lookup_error_diagnostics_test.go: wrap-context preservation (kind/namespace/id/endpointsalways present), per-class hint contracts, empty-endpoints grammaticality across all six classes (no literal[]leak, no double space), multi-endpointtalosctl --endpoints X,Y,Zlisting,applynever gets--offline, empty/unknowncommandNamefalls to the safe remedy.pkg/engine/contract_lookup_retry_test.go: no-retry-on-success, retry-on-refused, retry-on-deadline, fail-fast on TLS / authn / resource / unknown, context cancellation interrupts backoff, bounded production-budget timing (500ms ≤ elapsed ≤ 2s for three-attempt failure),firstLookupErrormultierror surfacing, last-appended discrimination (so one node'sx509:failure does not mis-flip the whole batch), partial-success fast-path, unhandled-class panic for future iota additions.pkg/commands/contract_apply_command_name_test.go: pinsapplyCommandName == engine.CommandNameApplyso a future drift cannot silently letapplyhints start suggesting the non-existent--offlineflag.pkg/commands/contract_stdout_silence_test.go: AST-level source walker overpkg/commands/*.gothat rejects anyfmt.Print*/fmt.Fprint*(os.Stdout, ...)call whose first string-literal argument starts with- talm:. Catches multi-line call sites a text grep would miss. Plus an integration test that runstemplateOneFilein--offlinemode through real pipes and asserts the channel split.docs/manual-test-plan.mdsection N (N.1–N.4) added: forward-looking operator scripts for stdout cleanliness, lookup error diagnostics (unreachable + non-network), retry policy (transient blip, persistent failure, Ctrl+C interruption).go test ./... -count=1passes.golangci-lint run ./...reports 0 issues.Risk
Behaviour changes operators may notice:
talm template / applyprogress now on stderr — pipelines that consumed the progress text from stdout (none known) would need to read stderr instead.lookupcalls against reachable nodes: unchanged.lookupcalls: error chain wording changes; previously-opaque errors now name the resource and endpoint, hints suggest a class-specific remedy. The--offlineflag is mentioned only fortalm template.lookupfailures (refused, deadline): up to 600ms additional wall-clock latency for retry budget. Partial-success cases (one node down in a multi-node lookup that returned data from others) skip retry entirely so no latency added.Out of scope
A follow-up issue will track a global
--quiet/--verboseflag for gating all info-level CLI chatter; this PR holds the line at the four progress sites without introducing a new flag.Summary by CodeRabbit
New Features
Bug Fixes