Skip to content

[addon-operator] added new metrics #20476

Draft
diyliv wants to merge 1 commit into
mainfrom
feature/converge_modules
Draft

[addon-operator] added new metrics #20476
diyliv wants to merge 1 commit into
mainfrom
feature/converge_modules

Conversation

@diyliv
Copy link
Copy Markdown
Contributor

@diyliv diyliv commented Jun 4, 2026

Description

Changes in addon-operator:

  1. New metric deckhouse_tasks_queue_head_info with labels queue, module, task_type, hook — shows what task is at the head of each non-empty queue.
  2. New label critical on deckhouse_mm_module_info — value from BasicModule.GetCritical().

Why do we need it, and what problem does it solve?

deckhouse_tasks_queue_length exposes only queue — no visibility into which task/module/hook is at the head of a hung queue. Severity is always "7" regardless of module criticality. deckhouse_mm_module_info lacks a critical label.

This change enables severity-tiered alerting based on module criticality: critical module queues (e.g., cni-cilium) get severity 4, non-critical (e.g., console) get severity 6, and global/parallel task hangs get severity 4. Operators get immediate signal about what is stuck and how important it is.

Example

critical label on deckhouse_mm_module_info

deckhouse_mm_module_info{critical="true",enabled="true",module="cni-cilium"} 1
deckhouse_mm_module_info{critical="true",enabled="true",module="control-plane-manager"} 1
deckhouse_mm_module_info{critical="true",enabled="true",module="deckhouse"} 1
deckhouse_mm_module_info{critical="true",enabled="true",module="node-manager"} 1
deckhouse_mm_module_info{critical="true",enabled="false",module="cloud-provider-aws"} 1
deckhouse_mm_module_info{critical="true",enabled="false",module="cloud-provider-vsphere"} 1
deckhouse_mm_module_info{critical="false",enabled="true",module="cert-manager"} 1
deckhouse_mm_module_info{critical="false",enabled="true",module="console"} 1
deckhouse_mm_module_info{critical="false",enabled="true",module="ingress-nginx"} 1

26 modules with critical="true", 41 with critical="false". Verified on dev cluster.

deckhouse_tasks_queue_head_info with non-empty queues

Captured during startup (tasks queued across main, parallel, and module-specific queues):

deckhouse_tasks_queue_head_info{hook="",module="node-manager",queue="main",task_type="ModuleRun"} 1
deckhouse_tasks_queue_head_info{hook="",module="ingress-nginx",queue="parallel_queue_12",task_type="ModuleRun"} 1
deckhouse_tasks_queue_head_info{hook="",module="upmeter",queue="parallel_queue_18",task_type="ModuleRun"} 1
deckhouse_tasks_queue_head_info{hook="300-prometheus/hooks/disk_metrics.go",module="prometheus",queue="/modules/prometheus/disk_metrics",task_type="ModuleHookRun"} 1
deckhouse_tasks_queue_head_info{hook="402-ingress-nginx/hooks/safe_daemonset_update.go",module="ingress-nginx",queue="/modules/ingress-nginx/safe_daemonset_update",task_type="ModuleHookRun"} 1

When all queues are empty (steady state), deckhouse_tasks_queue_head_info correctly produces no series. Verified on dev cluster.

Checklist

  • Unit tests pass (11 tests in pkg/metrics + full pkg/... suite).
  • Alert rules in deckhouse (separate PR).
  • Documentation updated.
  • Deployed to dev cluster, critical label and tasks_queue_head_info verified.

Changelog entries

section: deckhouse
type: feature
summary: Rework D8DeckhouseQueueIsHung alert with queue head info and severity based on module criticality.

Signed-off-by: diyliv <onlogn081@gmail.com>
@github-actions github-actions Bot added type/dependencies Pull requests that update a dependency file area/core Pull requests that update core modules security/gitleaks/success Gitleaks PR diff scan passed labels Jun 4, 2026
@diyliv diyliv marked this pull request as draft June 4, 2026 13:45
@diyliv diyliv changed the title [addon-operator] rework D8DeckhouseQueueIsHung alert with queue head info [addon-operator] added new metrics Jun 4, 2026
@github-actions github-actions Bot added security/rootless/success Default-user (rootless) validation passed security/antivirus/success Dr.Web and Kaspersky antivirus scans passed security/cve/failed Trivy CVE scan on PR images failed labels Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/core Pull requests that update core modules security/antivirus/success Dr.Web and Kaspersky antivirus scans passed security/cve/failed Trivy CVE scan on PR images failed security/gitleaks/success Gitleaks PR diff scan passed security/rootless/success Default-user (rootless) validation passed type/dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant