diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md index 9f4fe10f6..e2a6e557a 100644 --- a/.claude/CLAUDE.md +++ b/.claude/CLAUDE.md @@ -3,3 +3,7 @@ See @../AGENTS.md for project guidelines. See @../redhat-compliance-and-responsible-ai.md for Red Hat compliance and responsible AI rules. + +## Architecture Decision Records + +Before marking work complete, check: did you reject an alternative or accept a trade-off? If yes, suggest creating an ADR in `adr/`. Before proposing changes, check existing ADRs for prior decisions that might be affected. See [adr/README.md](../adr/README.md) for conventions. diff --git a/AGENTS.md b/AGENTS.md index 13ddb169b..303dc9c54 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -748,11 +748,33 @@ Use this to trigger a reconciliation without changing the spec. kubectl patch dw --type merge -p "{\"metadata\": {\"annotations\": {\"force-update\": \"$(date +%s)\"}}}" ``` +## Architecture Decision Records (ADRs) + +ADRs live in the [`adr/`](adr/) directory. They capture the **why** behind non-trivial design decisions — information that is lost in code and PR descriptions over time. + +### When to Create an ADR + +Before marking work complete, ask: **did I reject an alternative or accept a trade-off?** If yes, create an ADR. + +Specifically, write an ADR when your change involves any of these: + +1. **You rejected an alternative** — There were 2+ reasonable approaches and you picked one +2. **You accepted a trade-off** — Something got worse in exchange for something more important +3. **You changed a resource lifecycle or ownership model** — Who creates, owns, or cleans up a Kubernetes resource +4. **You changed an external contract** — API shape, CRD fields, Secret/ConfigMap naming, image paths + +**Don't** write an ADR for: bug fixes, dependency bumps, refactors preserving behavior, test additions, docs, or performance optimizations with no trade-offs. + +### How to Create an ADR + +Copy [`adr/TEMPLATE.md`](adr/TEMPLATE.md) to `adr/YYYY-MM-DD-short-slug.md` and fill in all sections. See [`adr/AGENTS.md`](adr/AGENTS.md) for maintenance conventions. + ## See Also - **[README.md](README.md)** - Project overview, installation, what is DevWorkspace Operator - **[CONTRIBUTING.md](CONTRIBUTING.md)** - Development setup, debugging, testing, contribution guidelines - **[redhat-compliance-and-responsible-ai.md](redhat-compliance-and-responsible-ai.md)** - Red Hat compliance & responsible AI rules +- **[adr/](adr/)** - Architecture Decision Records - **[docs/additional-configuration.adoc](docs/additional-configuration.adoc)** - DevWorkspace configuration options - **[docs/unsupported-devfile-api.adoc](docs/unsupported-devfile-api.adoc)** - Unsupported Devfile API features - **[Devfile Documentation](https://devfile.io/)** - Devfile specification @@ -760,4 +782,4 @@ kubectl patch dw --type merge -p "{\"metadata\": {\"annotations\": {\"for --- -**Last Updated**: 2025-12-24 +**Last Updated**: 2026-05-15 diff --git a/adr/2026-05-11-backup-auth-secret-lifecycle.md b/adr/2026-05-11-backup-auth-secret-lifecycle.md new file mode 100644 index 000000000..d60481e5b --- /dev/null +++ b/adr/2026-05-11-backup-auth-secret-lifecycle.md @@ -0,0 +1,97 @@ +# Backup Auth Secret Must Not Be Owned by Any Workspace + +**Status**: Accepted +**Date**: 2026-05-11 +**Deciders**: DevWorkspace Operator maintainers +**Related Issue**: [CRW-10760](https://redhat.atlassian.net/browse/CRW-10760) + +## Context + +The backup system copies the registry authentication secret (e.g., quay.io credentials) from the operator namespace into each workspace namespace as `devworkspace-backup-registry-auth`. The original implementation set a Kubernetes controller ownerReference from this secret to the DevWorkspace that triggered the copy: + +```go +controllerutil.SetControllerReference(workspace, desiredSecret, scheme) +``` + +This was likely intended to clean up the secret when no workspaces need it anymore — standard Kubernetes garbage collection pattern. + +However, two properties of this secret make ownerReference-based lifecycle management incorrect: + +1. **The secret is a namespace singleton**: All workspaces in a namespace share the same `devworkspace-backup-registry-auth` secret, but a Kubernetes object can have only one controller owner. Whichever workspace's backup job ran last "wins" ownership. + +2. **The secret is needed after all workspaces are deleted**: The primary restore use case is creating a new workspace from a backup of a deleted one. If the auth secret is garbage-collected when the last workspace is deleted, the user cannot authenticate to the private registry to pull the backup image. + +**Bug observed (CRW-10760)**: When using quay.io (private registry) for backups, deleting a workspace caused backup entries to disappear from the Dashboard backup list for ALL workspaces in the namespace. The auth secret was garbage-collected, and the Dashboard could no longer query the registry. + +**Validated on CRC cluster** (DWO 0.40.1, quay.io/okurinny): +- The `devworkspace-backup-registry-auth` secret was confirmed to have ownerReference to a single workspace (`nodejs`) +- Deleting that workspace triggered K8s GC, removing the secret +- The secret was not re-created by subsequent backup cycles (remaining workspaces already had recent backups) +- Backup listing in the Dashboard failed for all workspaces + +## Decision + +**Remove the ownerReference from the backup registry auth secret.** The secret becomes a namespace-scoped resource with no ownership tie to any workspace. + +### What Changes + +- `pkg/secrets/backup.go`: Remove the `controllerutil.SetControllerReference()` call in `CopySecret()` +- The secret is still created via `c.Create()` with `AlreadyExists` handling, just without an ownerReference +- The restore path now resolves the operator namespace via `infrastructure.GetNamespace()` to copy the secret on demand when it is missing + +### What Doesn't Change + +- Per-workspace resources (job runner ServiceAccount, image-builder RoleBinding) retain their ownerReferences — their GC on workspace deletion is correct and expected +- The backup Job itself retains its ownerReference (short-lived, TTL-cleaned) +- The `CopySecret` function signature stays the same + +## Considered Alternatives + +### Alternative 1: Multi-owner references (non-controller) + +Add each workspace as a non-controller owner of the secret. K8s GC would only delete the secret when ALL owning workspaces are gone. + +**Rejected because**: +- The secret must survive deletion of ALL workspaces (for restore) +- Adds complexity to track and merge owner lists +- Non-controller ownerReferences have subtle GC semantics + +### Alternative 2: Finalizer-based cleanup + +Remove ownerReference but add a cleanup mechanism (e.g., a controller that deletes the secret when the namespace has zero DevWorkspaces). + +**Rejected because**: +- Adds complexity for a marginal benefit (one small secret in an otherwise-empty namespace) +- Could race with restore operations (user creates a workspace from backup right after the last one is deleted) +- Namespace deletion already cleans up all resources + +### Alternative 3: Per-workspace auth secrets + +Create a unique auth secret per workspace (e.g., `devworkspace-backup-registry-auth-{workspace-id}`). + +**Rejected because**: +- Multiplies secrets unnecessarily (all contain the same credentials) +- The restore path expects the predefined name `devworkspace-backup-registry-auth` +- Still wouldn't survive workspace deletion for restore use case + +## Consequences + +### Positive + +1. **Backups survive workspace deletion**: Users can delete all workspaces and still restore from backups +2. **No cross-workspace interference**: Deleting one workspace no longer affects other workspaces' backup capabilities +3. **Simpler lifecycle**: No ownership tracking needed for a namespace-scoped singleton + +### Negative + +1. **Secret persists in empty namespaces**: If all workspaces are deleted and the user never restores, the auth secret remains until the namespace is deleted. This is a minor leak — one small secret per namespace. + +### Neutral + +1. **Existing secrets on upgraded clusters**: Secrets created by older DWO versions will retain their stale ownerReference until the next `CopySecret` call overwrites them (via `SyncObjectWithCluster`). In the worst case, one more GC event occurs before the fix takes effect. + +## References + +- `pkg/secrets/backup.go` — `CopySecret()` function +- `controllers/backupcronjob/rbac.go` — Per-workspace SA/RoleBinding (unchanged) +- `pkg/constants/metadata.go:204` — `DevWorkspaceBackupAuthSecretName` diff --git a/adr/AGENTS.md b/adr/AGENTS.md new file mode 100644 index 000000000..dec93c549 --- /dev/null +++ b/adr/AGENTS.md @@ -0,0 +1,32 @@ +# ADR Maintenance Guide for AI Agents + +This guide covers how to create and maintain ADRs. For when to create an ADR, see the root [AGENTS.md](../AGENTS.md). + +## Creating a New ADR + +1. Copy [TEMPLATE.md](TEMPLATE.md) to a new file named `YYYY-MM-DD-short-slug.md` +2. Use today's date and a descriptive lowercase slug with hyphens +3. Fill in all sections — the "Considered Alternatives" section is the most valuable part +4. Set status to **Proposed** if seeking feedback, or **Accepted** if the decision is final + +## Naming Convention + +`YYYY-MM-DD-short-description.md` + +- Date prefix for chronological ordering +- Lowercase, hyphens only, no special characters +- Keep the slug short but descriptive + +## Superseding an ADR + +When a decision is replaced: + +1. Update the old ADR's status line: `**Status**: Superseded by [New Title](YYYY-MM-DD-new-slug.md)` +2. Create the new ADR with its own context explaining why the previous decision changed +3. Reference the old ADR in the new one's Context section + +## Format Rules + +- Follow the structure in [TEMPLATE.md](TEMPLATE.md) +- End every file with a trailing newline +- Use the existing ADRs as examples for tone and level of detail diff --git a/adr/README.md b/adr/README.md new file mode 100644 index 000000000..810cca124 --- /dev/null +++ b/adr/README.md @@ -0,0 +1,39 @@ +# Architecture Decision Records (ADRs) + +This directory contains Architecture Decision Records for the DevWorkspace Operator. + +## What is an ADR? + +An ADR captures a significant design decision along with its context, alternatives considered, and trade-offs accepted. ADRs preserve the **why** behind decisions — information that is lost in code, commit messages, and PR descriptions over time. + +## When to Write an ADR + +Write an ADR when your change involves any of these: + +1. **You rejected an alternative** — There were 2+ reasonable approaches and you picked one. The code shows what you chose but not what you didn't. +2. **You accepted a trade-off** — Something got worse (performance, complexity, a minor leak) in exchange for something more important. +3. **You changed a resource lifecycle or ownership model** — Who creates, owns, or cleans up a Kubernetes resource. +4. **You changed an external contract** — API shape, CRD fields, Secret/ConfigMap naming conventions, image paths. + +**Don't** write an ADR for: bug fixes, dependency bumps, refactors preserving behavior, test additions, docs updates, or performance optimizations with no trade-offs. + +## Lifecycle + +| Status | Meaning | +|--------|---------| +| **Proposed** | Under discussion, not yet accepted | +| **Accepted** | Decision is in effect | +| **Deprecated** | Decision is outdated but not yet replaced | +| **Superseded** | Replaced by a newer ADR (link to it in the status line) | + +To supersede an ADR, update its status to `Superseded by [new ADR title](new-adr-file.md)` and create a new ADR. + +## Naming Convention + +Files use date-prefixed slugs: `YYYY-MM-DD-short-description.md` + +Example: `2026-05-11-backup-auth-secret-lifecycle.md` + +## Creating a New ADR + +Copy [TEMPLATE.md](TEMPLATE.md) and fill in the sections. See existing ADRs for examples. diff --git a/adr/TEMPLATE.md b/adr/TEMPLATE.md new file mode 100644 index 000000000..50813b062 --- /dev/null +++ b/adr/TEMPLATE.md @@ -0,0 +1,51 @@ +# [Title of Decision] + +**Status**: Proposed | Accepted | Deprecated | Superseded by [title](link) +**Date**: YYYY-MM-DD +**Deciders**: DevWorkspace Operator maintainers +**Related Issue**: [ISSUE-ID](link) *(optional)* + +## Context + +What is the problem or need that prompted this decision? Include enough background for someone unfamiliar with the issue to understand why a decision was needed. + +## Decision + +What did we decide to do? Be specific about what changes and what stays the same. + +## Considered Alternatives + +### Alternative 1: [Name] + +Brief description. + +**Rejected because**: +- Reason 1 +- Reason 2 + +### Alternative 2: [Name] + +Brief description. + +**Rejected because**: +- Reason 1 +- Reason 2 + +## Consequences + +### Positive + +1. Benefit 1 +2. Benefit 2 + +### Negative + +1. Downside or accepted trade-off + +### Neutral + +1. Side effects that are neither positive nor negative + +## References + +- Links to relevant code, docs, PRs, or external resources