Skip to content

fix(cli, engine): emit progress on stderr, enrich lookup error chain, retry transient failures#212

Open
lexfrei wants to merge 3 commits into
mainfrom
fix/stdout-stderr-and-lookup-diagnostics
Open

fix(cli, engine): emit progress on stderr, enrich lookup error chain, retry transient failures#212
lexfrei wants to merge 3 commits into
mainfrom
fix/stdout-stderr-and-lookup-diagnostics

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 18, 2026

What

Three layered improvements to how talm reports lookup failures and routes progress output.

1. stdout cleanliness for talm template -f X > Y

The per-file progress line - talm: file=... was emitted via fmt.Printf (stdout) from one site in template.go and three sites in apply.go. The canonical redirect pattern talm template ... > nodes/foo.yaml was therefore producing files with a leading non-comment YAML list item that broke downstream consumers. All four sites move to fmt.Fprintf(os.Stderr, ...), matching the established stderr idiom in this codebase (Updated., kubeconfig warnings, talosconfig regeneration, etc.).

2. Enriched lookup error chain

Pre-change a chart lookup failure surfaced as iterating resources: rpc error: code = Unavailable ... — operator had to read the chart template trace to recover the resource kind, with no signal of which endpoint was dialed or what failure class this was. Every lookup failure now wraps with kind="..." namespace="..." id="..." on endpoints=[...] and carries an errors.WithHint annotation whose body depends on a six-class taxonomy (TLS handshake / connection refused / deadline / authentication / resource / unknown).

--offline is suggested only when the calling command actually has the flag — allow-listed against engine.CommandNameTemplate. talm apply, future subcommands, and the empty-CommandName default all fall through to a generic "fix reachability" remedy. The resource-class hint explicitly warns AWAY from --offline because rendering against an empty discovery surface would mask a real chart bug.

3. Bounded retry with fail-fast

lookup calls now retry up to 3 times with 200ms · 2^n exponential backoff (~600ms total wait budget) — but ONLY for transient connectivity classes (refused, deadline). TLS handshake / authn / resource / unknown errors fail fast because retrying buys nothing and costs operator latency. A partial-success fast-path skips retry when at least one node responded with data in this attempt — the stale-modeline case (operator decommissioned a node but did not update the modeline) is common enough that paying per-render latency for it would have been the dominant cost of the retry feature. The partial data is not preserved (matches pre-existing "any node fail → wrap as error" behaviour); what is preserved is the wall-clock budget. Context cancellation interrupts the inter-attempt backoff immediately so Ctrl+C in talm template / apply stays responsive.

Why

Reported via private channel: talm template --file nodes/node0.yaml.old > nodes/node0.yaml against a temporarily-unreachable node produced two symptoms that together masked the actual situation — a corrupt output file with a leading non-comment YAML list item, and an opaque transport: authentication handshake failed: EOF error chain that did not name the dialed endpoint or the resource kind, leaving the operator without a clear next step.

Testing

  • New pkg/engine/contract_lookup_classify_test.go: 14 cases covering each error class plus the Unavailable default and the ResolveResourceKind-NotFound path.
  • New pkg/engine/contract_lookup_error_diagnostics_test.go: wrap-context preservation (kind/namespace/id/endpoints always present), per-class hint contracts, empty-endpoints grammaticality across all six classes (no literal [] leak, no double space), multi-endpoint talosctl --endpoints X,Y,Z listing, apply never gets --offline, empty/unknown commandName falls to the safe remedy.
  • New pkg/engine/contract_lookup_retry_test.go: no-retry-on-success, retry-on-refused, retry-on-deadline, fail-fast on TLS / authn / resource / unknown, context cancellation interrupts backoff, bounded production-budget timing (500ms ≤ elapsed ≤ 2s for three-attempt failure), firstLookupError multierror surfacing, last-appended discrimination (so one node's x509: failure does not mis-flip the whole batch), partial-success fast-path, unhandled-class panic for future iota additions.
  • New pkg/commands/contract_apply_command_name_test.go: pins applyCommandName == engine.CommandNameApply so a future drift cannot silently let apply hints start suggesting the non-existent --offline flag.
  • New pkg/commands/contract_stdout_silence_test.go: AST-level source walker over pkg/commands/*.go that rejects any fmt.Print* / fmt.Fprint*(os.Stdout, ...) call whose first string-literal argument starts with - talm:. Catches multi-line call sites a text grep would miss. Plus an integration test that runs templateOneFile in --offline mode through real pipes and asserts the channel split.
  • docs/manual-test-plan.md section N (N.1–N.4) added: forward-looking operator scripts for stdout cleanliness, lookup error diagnostics (unreachable + non-network), retry policy (transient blip, persistent failure, Ctrl+C interruption).

go test ./... -count=1 passes. golangci-lint run ./... reports 0 issues.

Risk

Behaviour changes operators may notice:

  • talm template / apply progress now on stderr — pipelines that consumed the progress text from stdout (none known) would need to read stderr instead.
  • Successful lookup calls against reachable nodes: unchanged.
  • Failed lookup calls: error chain wording changes; previously-opaque errors now name the resource and endpoint, hints suggest a class-specific remedy. The --offline flag is mentioned only for talm template.
  • Transient lookup failures (refused, deadline): up to 600ms additional wall-clock latency for retry budget. Partial-success cases (one node down in a multi-node lookup that returned data from others) skip retry entirely so no latency added.

Out of scope

A follow-up issue will track a global --quiet / --verbose flag for gating all info-level CLI chatter; this PR holds the line at the four progress sites without introducing a new flag.

Summary by CodeRabbit

  • New Features

    • Automatic retry for transient Talos API connectivity failures
    • Enhanced error diagnostics with failure-specific remediation guidance (TLS, firewall, authentication, resource issues)
  • Bug Fixes

    • Fixed output discipline: progress lines now correctly route to stderr for clean stdout redirection

Review Change Stack

… retry transient failures

Three layered improvements to how talm reports lookup failures and
routes progress output.

1. stdout cleanliness for talm template -f X > Y

The per-file progress line `- talm: file=...` was emitted via fmt.Printf
(stdout) from one site in template.go and three sites in apply.go. The
canonical redirect pattern talm template ... > nodes/foo.yaml was therefore
producing files with a leading non-comment YAML list item that broke
downstream consumers. All four sites move to fmt.Fprintf(os.Stderr, ...),
matching the established stderr idiom in this codebase (Updated.,
kubeconfig warnings, talosconfig regeneration, etc.).

2. Enriched lookup error chain

Pre-change a chart lookup failure surfaced as 'iterating resources:
rpc error: code = Unavailable ...' — operator had to read the chart
template trace to recover the resource kind, with no signal of which
endpoint was dialed or what failure class this was. Every lookup
failure now wraps with kind=, namespace=, id=, on endpoints=[...] and
carries an errors.WithHint annotation whose body depends on a six-
class taxonomy (TLS handshake / connection refused / deadline /
authentication / resource / unknown).

--offline is suggested only when the calling command actually has the
flag — allow-listed against engine.CommandNameTemplate. talm apply,
future subcommands, and the empty-CommandName default all fall
through to a generic 'fix reachability' remedy. The resource-class
hint explicitly warns AWAY from --offline because rendering against
an empty discovery surface would mask a real chart bug.

3. Bounded retry with fail-fast

lookup calls now retry up to 3 times with 200ms · 2^n exponential
backoff (~600ms total wait budget) — but ONLY for transient
connectivity classes (refused, deadline). TLS handshake / authn /
resource / unknown errors fail fast because retrying buys nothing
and costs operator latency. A partial-success fast-path skips retry
when at least one node responded with data in this attempt — the
stale-modeline case (operator decommissioned a node but did not
update the modeline) is common enough that paying per-render latency
for it would have been the dominant cost of the retry feature. The
partial data is not preserved (matches pre-existing 'any node fail
→ wrap as error' behaviour); what is preserved is the wall-clock
budget. Context cancellation interrupts the inter-attempt backoff
immediately so Ctrl+C in talm template / apply stays responsive.

Tests

- pkg/engine/contract_lookup_classify_test.go: 14 cases covering each
  error class plus the Unavailable default and the
  ResolveResourceKind-NotFound path.
- pkg/engine/contract_lookup_error_diagnostics_test.go: wrap-context
  preservation, per-class hint contracts, empty-endpoints
  grammaticality across all six classes, multi-endpoint listing,
  apply never gets --offline, empty/unknown commandName falls to the
  safe remedy.
- pkg/engine/contract_lookup_retry_test.go: no-retry-on-success,
  retry-on-refused, retry-on-deadline, fail-fast on TLS / authn /
  resource / unknown, context cancellation interrupts backoff,
  bounded production-budget timing, firstLookupError multierror
  surfacing, last-appended discrimination (one node's x509: failure
  does not mis-flip the whole batch), partial-success fast-path,
  unhandled-class panic for future iota additions.
- pkg/commands/contract_apply_command_name_test.go: pins
  applyCommandName == engine.CommandNameApply so a future drift cannot
  silently let apply hints start suggesting the non-existent --offline
  flag.
- pkg/commands/contract_stdout_silence_test.go: AST-level source
  walker over pkg/commands/*.go that rejects any fmt.Print* or
  fmt.Fprint*(os.Stdout, ...) call whose first string-literal
  argument starts with '- talm:'. Catches multi-line call sites a
  text grep would miss. Plus an integration test running
  templateOneFile in --offline mode through real pipes asserting the
  channel split.
- docs/manual-test-plan.md section N (N.1–N.4): forward-looking
  operator scripts for stdout cleanliness, lookup error diagnostics
  (unreachable + non-network), retry policy (transient blip,
  persistent failure, Ctrl+C interruption).

A follow-up issue will track a global --quiet / --verbose flag for
gating all info-level CLI chatter; this change holds the line at the
four progress sites without introducing a new flag.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

Warning

Rate limit exceeded

@lexfrei has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 25 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2b8067d-514d-4540-a26a-9e2775388fbe

📥 Commits

Reviewing files that changed from the base of the PR and between cf1008c and cc0239d.

📒 Files selected for processing (1)
  • docs/manual-test-plan.md
📝 Walkthrough

Walkthrough

This PR introduces structured lookup error handling for Talos resource discovery in Helm charts. It classifies gRPC failures into categories (TLS, connectivity, timeout, authentication, resource kind, unknown), wraps errors with operator-facing remedies tailored to each class and command, implements transient-failure retry with exponential backoff, routes progress output to stderr to preserve clean stdout for rendered templates, and integrates the new error path through the engine's Helm lookup function.

Changes

Lookup error diagnostics, retry, and output discipline

Layer / File(s) Summary
Lookup error classification and hint generation
pkg/engine/lookup_classify.go (lines 1–266), pkg/engine/contract_lookup_classify_test.go
classifyLookupError maps gRPC status codes and error message substrings to failure categories (TLS handshake, connection refused, deadline, authentication, resource, unknown). wrapLookupError attaches context-aware operator remedies via error hints, varying by classification and command (e.g., --offline for template connectivity failures but not for apply or resource-class errors). Test suite verifies classification across TLS variants (SAN mismatch, expiry), connection states (refused, no-route), authentication, resource codes (Internal, InvalidArgument, NotFound, Unimplemented), and unknown/nil inputs.
Retry policy and transient-failure retry logic
pkg/engine/lookup_classify.go (lines 268–413), pkg/engine/contract_lookup_retry_test.go
retryWithFailFast loop attempts resource lookups with exponential backoff, consulting a shouldRetry predicate. firstLookupError extracts the most relevant error from per-node results (prefers direct errors, selects last appended multierror item). Fail-fast for deterministic classes (TLS, auth, resource, unknown) while retrying transient classes (connection refused, deadline exceeded). Honors context cancellation. Test suite covers success on immediate/second attempt, retry exhaustion, timeout behavior, context cancellation, and partial-success fast-path.
Error wrapping and hint generation tests
pkg/engine/contract_lookup_error_diagnostics_test.go
Verifies wrapLookupError includes resource kind, namespace, id, and all endpoints in wrapped messages. Tests class-specific hints: TLS/refused/deadline/authn mention --offline for template but not apply; resource-class hints never suggest --offline and reference talosctl get <kind> instead. Confirms all hints remain grammatical with empty endpoints (no [], no double spaces, no dangling flags). Tests fallback behavior for empty/unknown command names and unknown error classes.
Output discipline: progress to stderr
pkg/commands/template.go, pkg/commands/apply.go, pkg/commands/contract_stdout_silence_test.go, pkg/commands/contract_apply_command_name_test.go
Progress output (- talm: file=...) routed to stderr in both talm template and talm apply, preserving stdout for rendered templates. Command render options now pass TalosEndpoints and CommandName to engine. AST-based contract test verifies no fmt.Print* calls emit "- talm:" to stdout; integration test confirms progress on stderr and rendered output on stdout. Separate test validates apply command name constant matches engine constant (prevents --offline hint drift).
Engine integration: lookup with retry and endpoints
pkg/engine/engine.go
Options adds TalosEndpoints []string field. Render passes command name and endpoints to newLookupFunction. Resource lookup wrapped in retryWithFailFast loop with per-attempt state reset (multiErr, resources) and fail-fast on deterministic errors. Final errors wrapped via wrapLookupError with endpoint and command-name context. Successful lookup return paths unchanged (empty map, single resource, COSI List envelope).
Manual test plan documentation
docs/manual-test-plan.md (section N)
Documents operator-facing end-to-end scenarios: stdout/stderr separation (progress on stderr, rendered output on stdout); error diagnostics when endpoints unreachable (includes kind, endpoints, class-specific remedy; --offline for template, "fix reachability" for apply); retry behavior within and beyond budget with Ctrl+C support; non-network failures (unknown kinds) with talosctl get <kind> verification and no --offline suggestion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

In the realm of Talos charts we dwell,
Where lookups failed with nary a clue,
Now errors bloom—each class its own spell—
Hints flow to stderr, stdout stays true. 🐰
Retry and grace when transients call,
Deterministic foes fail fast through it all!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and concisely captures all three main changes: progress output redirection to stderr, enriched lookup error chains with context and hints, and transient failure retry logic.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/stdout-stderr-and-lookup-diagnostics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@lexfrei lexfrei marked this pull request as ready for review May 18, 2026 15:59
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust error handling and retry mechanism for Talos API lookups, including structured error classification and actionable user hints. It also enforces output discipline by redirecting progress logs to stderr, ensuring stdout remains clean for template rendering. Feedback suggests refining the error classification logic to ensure TLS SAN mismatches are correctly categorized under handshake errors rather than authentication errors to provide more specific guidance.

Comment thread pkg/engine/lookup_classify.go Outdated
Comment on lines +73 to +80
if strings.Contains(desc, "certificate signed by unknown authority") ||
strings.Contains(desc, "x509:") {
return lookupErrAuthn
}

if strings.Contains(desc, "handshake failed") || strings.Contains(desc, "tls:") {
return lookupErrTLSHandshake
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current classification logic for lookupErrAuthn might incorrectly categorize TLS SAN mismatch errors. A SAN mismatch typically results in an error message starting with x509: certificate is valid for ..., not .... Because this contains the x509: substring, it will be classified as lookupErrAuthn and receive a generic PKI hint. However, the lookupErrTLSHandshake hint specifically mentions SANs as a common cause.

Consider refining the order or the substrings to ensure SAN mismatches receive the more specific lookupErrTLSHandshake hint.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in cf1008c.

The classifier now narrows the Authn-class substring to "certificate signed by unknown authority" (the real CA / PKI trust failure where the talosconfig context is the right pivot). Any other x509 / tls / handshake descriptor — SAN mismatch ("x509: certificate is valid for X, not Y"), expired cert, hostname mismatch — falls through to lookupErrTLSHandshake, whose hint already names cert SANs as a common cause and points the operator at talosctl get nodeaddress for verification.

Added two contract tests pinning the new routing: TestClassifyLookupError_TLSHandshake_x509SANMismatch and TestClassifyLookupError_TLSHandshake_x509Expired (pkg/engine/contract_lookup_classify_test.go).

…shake class

Pre-fix the substring check 'x509:' on any descriptor flipped the
classification to lookupErrAuthn, which lands on the 'talosconfig
credentials rejected' hint. That hint is the wrong remedy when the
cert chain trusts fine but the certificate just doesn't cover the
dialed address ('x509: certificate is valid for X, not Y') or has
expired — the operator should renew or correct the SAN, not touch
talosconfig.

Narrow the Authn substring to 'certificate signed by unknown
authority' (the genuine CA / PKI trust failure where talosconfig CA
bundle is the right pivot), and route all other 'x509:' / 'tls:' /
'handshake failed' descriptors to lookupErrTLSHandshake. The
TLS-handshake hint already names cert SANs as a common cause and
points the operator at 'talosctl get nodeaddress' for verification.

Adds contract tests for SAN mismatch and expired-cert cases pinning
the new classification.

Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/manual-test-plan.md`:
- Around line 1577-1583: The test currently injects a DROP rule via the iptables
command (ssh ${NODE} iptables -A INPUT -p tcp --dport 50000 -j DROP) but the
expectation text refers to a "connection refused" class; update the docs to
align with the injected fault by either (A) changing the expected hint/class
text from "connection refused" to a timeout/deadline/socket timeout class that
reflects packet DROP behavior, or (B) if you want a true refusal, change the
injection to use REJECT with a TCP reset (ssh ${NODE} iptables -A INPUT -p tcp
--dport 50000 -j REJECT --reject-with tcp-reset) and adjust the expected
hint/class accordingly; edit the lines referencing the iptables injection and
the "connection refused" hint so they match one coherent fault mode.
- Around line 1594-1595: Update the expectation text that reads "talm exits
within ~100ms, not after the full 600ms retry budget" to a less brittle
assertion such as "talm exits well before the full retry budget (e.g., <1s)" so
the manual check is tolerant across environments; ensure you reference the
original context mentioning "talm" and the "600ms retry budget" when changing
the sentence.

In `@pkg/commands/template.go`:
- Around line 224-225: The progress line currently uses fmt.Fprintf(os.Stderr,
"...%s...", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints,
templateCmdFlags.templateFiles) which prints slice values as "%!s(...)" — change
the formatting to be slice-safe by either using %v/%#v for the slice arguments
or convert the slices to strings (e.g. strings.Join(GlobalArgs.Nodes, ","),
strings.Join(GlobalArgs.Endpoints, ","),
strings.Join(templateCmdFlags.templateFiles, ",")) before passing to
fmt.Fprintf; update the call in the same block that writes to os.Stderr so the
progress line prints human-readable slice contents.

In `@pkg/engine/lookup_classify.go`:
- Around line 379-405: Validate retry policy at the top of retryWithFailFast:
ensure policy.maxAttempts is > 0 and policy.backoff is not nil (and optionally
that maxAttempts is reasonable); if these conditions fail, return a descriptive
error immediately instead of entering the loop. Reference retryWithFailFast and
the retryPolicy fields maxAttempts and backoff so the caller fails fast rather
than returning nil/no-op or panicking when backoff is nil.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 461e252e-dcb8-4a79-8963-f4efdca98815

📥 Commits

Reviewing files that changed from the base of the PR and between 62f344c and cf1008c.

📒 Files selected for processing (10)
  • docs/manual-test-plan.md
  • pkg/commands/apply.go
  • pkg/commands/contract_apply_command_name_test.go
  • pkg/commands/contract_stdout_silence_test.go
  • pkg/commands/template.go
  • pkg/engine/contract_lookup_classify_test.go
  • pkg/engine/contract_lookup_error_diagnostics_test.go
  • pkg/engine/contract_lookup_retry_test.go
  • pkg/engine/engine.go
  • pkg/engine/lookup_classify.go

Comment thread docs/manual-test-plan.md Outdated
Comment thread docs/manual-test-plan.md Outdated
Comment thread pkg/commands/template.go
Comment on lines +224 to +225
// Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y.
fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles)
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use slice-safe formatting in progress output.

%s with []string prints %!s(...) instead of values, so the progress line is malformed.

Suggested fix
- fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles)
+ fmt.Fprintf(
+   os.Stderr,
+   "- talm: file=%s, nodes=[%s], endpoints=[%s], templates=[%s]\n",
+   configFile,
+   strings.Join(GlobalArgs.Nodes, ","),
+   strings.Join(GlobalArgs.Endpoints, ","),
+   strings.Join(templateCmdFlags.templateFiles, ","),
+ )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y.
fmt.Fprintf(os.Stderr, "- talm: file=%s, nodes=%s, endpoints=%s, templates=%s\n", configFile, GlobalArgs.Nodes, GlobalArgs.Endpoints, templateCmdFlags.templateFiles)
// Progress line goes to stderr; stdout is reserved for the rendered config stream so `talm template --file X > Y` produces a clean Y.
fmt.Fprintf(
os.Stderr,
"- talm: file=%s, nodes=[%s], endpoints=[%s], templates=[%s]\n",
configFile,
strings.Join(GlobalArgs.Nodes, ","),
strings.Join(GlobalArgs.Endpoints, ","),
strings.Join(templateCmdFlags.templateFiles, ","),
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/commands/template.go` around lines 224 - 225, The progress line currently
uses fmt.Fprintf(os.Stderr, "...%s...", configFile, GlobalArgs.Nodes,
GlobalArgs.Endpoints, templateCmdFlags.templateFiles) which prints slice values
as "%!s(...)" — change the formatting to be slice-safe by either using %v/%#v
for the slice arguments or convert the slices to strings (e.g.
strings.Join(GlobalArgs.Nodes, ","), strings.Join(GlobalArgs.Endpoints, ","),
strings.Join(templateCmdFlags.templateFiles, ",")) before passing to
fmt.Fprintf; update the call in the same block that writes to os.Stderr so the
progress line prints human-readable slice contents.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified the claim and the code is actually correct as-is — declining.

Go's fmt handles %s against []string via the compound-formatting rule: it applies %s to each element and wraps the result in [ ]. A quick check:

fmt.Printf("nodes=%s\n", []string{"192.0.2.10", "192.0.2.11"})
// nodes=[192.0.2.10 192.0.2.11]

No %!s(...) is produced; the progress line reads correctly. Switching to %v would produce identical output for []string since the per-element format is the same. Leaving the existing %s to keep the diff scoped.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexfrei, you're absolutely right — I apologize for the false positive. Go's fmt compound-formatting rule applies the verb element-wise to slices, so %s on a []string correctly produces [elem0 elem1 ...] rather than any %!s(...) escape. The existing code is fine as-is.

(ᵔ ᴥ ᵔ) 🐇


✏️ Learnings added
Learnt from: lexfrei
Repo: cozystack/talm PR: 212
File: pkg/commands/template.go:224-225
Timestamp: 2026-05-18T16:43:51.082Z
Learning: In Go, `fmt.Printf("%s", someSlice)` (or `fmt.Fprintf`) where `someSlice` is a `[]string` does NOT produce `%!s(...)`. Go's compound-formatting rule applies the verb element-wise and wraps the result in `[ ]`, e.g., `[]string{"a", "b"}` prints as `[a b]`. Do not flag `%s` with `[]string` arguments as a formatting error in Go code reviews.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: lexfrei
Repo: cozystack/talm PR: 113
File: pkg/commands/apply.go:116-136
Timestamp: 2026-04-02T08:59:38.723Z
Learning: In cozystack/talm’s pkg/commands/apply.go and pkg/commands/template.go, treat the node file provided via `talm apply -f <node.yaml>` as a modeline container for metadata only (e.g., nodes/endpoints/template paths). Do not pass `configFile` as a patch to `engine.Render`. Instead, generate the cluster configuration entirely from Helm templates plus `values.yaml`, matching the existing `templateWithFiles` behavior and the equivalent workflow `talm template -f node.yaml | talm apply -f -`.

Learnt from: lexfrei
Repo: cozystack/talm PR: 154
File: pkg/commands/contract_helpers_test.go:40-69
Timestamp: 2026-05-08T18:19:34.964Z
Learning: In this codebase, some Go *contract* tests intentionally “pin” known-broken behavior: keep the incorrect expected value and add a `FIXME #<issue>` comment in the test case to document the known defect. When the underlying bug is fixed later, update only the expected value to the corrected behavior (and remove/adjust the `FIXME` as appropriate) rather than rewriting the test pattern.

For the specific IPv6 endpoint/bracketing defect tracked as issue `#155` in `pkg/commands` (notably after `net.SplitHostPort`), ensure the fix re-brackets IPv6 correctly by using `net.JoinHostPort(host, "6443")` instead of `fmt.Sprintf("https://%s:6443", host)` (so IPv6 hostnames are wrapped in `[]` when combined with a port).

Comment on lines +379 to +405
func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error {
var lastErr error

for i := range policy.maxAttempts {
lastErr = attempt()
if lastErr == nil {
return nil
}

if !shouldRetry(lastErr) {
return lastErr
}

if i == policy.maxAttempts-1 {
break
}

select {
case <-ctx.Done():
//nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error.
return ctx.Err()
case <-time.After(policy.backoff(i)):
}
}

return lastErr
}
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Guard invalid retry policy inputs to prevent silent success/panic.

Line 382 can return nil without attempting anything when maxAttempts <= 0, and Line 400 will panic if backoff is nil. Add explicit validation at function entry.

Suggested patch
 func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error {
+	if policy.maxAttempts <= 0 {
+		return errors.New("retryWithFailFast: maxAttempts must be > 0")
+	}
+	if policy.backoff == nil {
+		return errors.New("retryWithFailFast: backoff must not be nil")
+	}
+
 	var lastErr error
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error {
var lastErr error
for i := range policy.maxAttempts {
lastErr = attempt()
if lastErr == nil {
return nil
}
if !shouldRetry(lastErr) {
return lastErr
}
if i == policy.maxAttempts-1 {
break
}
select {
case <-ctx.Done():
//nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error.
return ctx.Err()
case <-time.After(policy.backoff(i)):
}
}
return lastErr
}
func retryWithFailFast(ctx context.Context, attempt func() error, shouldRetry func(error) bool, policy retryPolicy) error {
if policy.maxAttempts <= 0 {
return errors.New("retryWithFailFast: maxAttempts must be > 0")
}
if policy.backoff == nil {
return errors.New("retryWithFailFast: backoff must not be nil")
}
var lastErr error
for i := range policy.maxAttempts {
lastErr = attempt()
if lastErr == nil {
return nil
}
if !shouldRetry(lastErr) {
return lastErr
}
if i == policy.maxAttempts-1 {
break
}
select {
case <-ctx.Done():
//nolint:wrapcheck // context error is the operator-meaningful one when cancellation interrupts retry; preserving it lets the chart-template trace point at "context cancelled" rather than the last transport error.
return ctx.Err()
case <-time.After(policy.backoff(i)):
}
}
return lastErr
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/engine/lookup_classify.go` around lines 379 - 405, Validate retry policy
at the top of retryWithFailFast: ensure policy.maxAttempts is > 0 and
policy.backoff is not nil (and optionally that maxAttempts is reasonable); if
these conditions fail, return a descriptive error immediately instead of
entering the loop. Reference retryWithFailFast and the retryPolicy fields
maxAttempts and backoff so the caller fails fast rather than returning nil/no-op
or panicking when backoff is nil.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declining — the function is an unexported internal primitive with two known call sites (the production newLookupFunction via defaultRetryPolicy() and the unit tests), and both construct valid retryPolicy values. Adding boundary validation for the misuse cases described would contradict the project's coding rule of not validating scenarios that can't occur in internal code paths.

For the record on the two specific concerns:

  • maxAttempts <= 0 returning nil: the only path that could surface this is a future caller forgetting to initialize the policy, which would also fail every other contract test in the suite; the silence would be the smallest symptom, not the diagnostic signal.
  • backoff == nil panicking on first retry: a nil-deref panic IS the loud failure mode here — it's deterministic, immediate on the first retryable error, and lands a clear stack trace in CI. Pre-validating to panic earlier with a custom message would not improve discovery.

If a public API surfaces retryWithFailFast later, validation belongs at that boundary.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lexfrei, understood — the internal-only scope plus the existing contract-test coverage make the pre-validation redundant, and a nil-deref panic from backoff == nil is indeed a louder and more actionable CI signal than a custom error message. Acknowledged.


✏️ Learnings added
Learnt from: lexfrei
Repo: cozystack/talm PR: 212
File: pkg/engine/lookup_classify.go:379-405
Timestamp: 2026-05-18T16:44:13.725Z
Learning: In the cozystack/talm repository (Go project), `retryWithFailFast` in `pkg/engine/lookup_classify.go` is an unexported internal primitive. The project coding convention is to NOT add boundary validation (e.g., maxAttempts <= 0, nil backoff guards) for scenarios that cannot occur on internal code paths. Validation should only be added at public API boundaries. A nil `backoff` field is intentionally left to panic (nil-deref) as the loud/deterministic failure signal in CI.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

…/ cancellation semantics

Two fixes from review feedback on the retry scenario:

1. `iptables -j DROP` silently discards SYN packets; the talos
   client retries the TCP handshake until its own deadline expires,
   so the lookup classifies as deadline (timeout) rather than
   refused. The expected line said 'connection refused class hint',
   which would never have matched a real run. Switched the example
   to `REJECT --reject-with tcp-reset` so the kernel surfaces
   ECONNREFUSED immediately and the expected hint matches. Added a
   parenthetical noting that swapping back to DROP intentionally
   exercises the deadline-class path with its own expected hint
   text.

2. Ctrl+C interrupt assertion of '~100ms' was too strict — even with
   correctly wired cancellation, ssh / busy host scheduling can
   push the wall-clock observation past 100ms and produce false
   negatives in manual runs. Relaxed to 'well under 1s' (anything
   visibly faster than the full ~600ms retry budget proves the
   cancellation path is correct).

Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei self-assigned this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant