Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 18 additions & 10 deletions rollback/playbooks/rollback_telemetry.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,17 +68,25 @@
ansible.builtin.debug:
msg: "[ROLLBACK] Component '{{ component_name }}' — status changed to: in-progress"
Comment thread
Kratika-P marked this conversation as resolved.
Outdated

# TODO: Implement telemetry rollback steps per ESpec §4.8.5:
# 1. Helm uninstall new components (powerscale, vast, victorialogs, ufm)
# 2. Rollback Strimzi operator + Kafka brokers to previous version
# 3. Rollback VictoriaMetrics StatefulSet(s) to previous version
# 4. Rollback iDRAC telemetry receiver + pump images
# 5. Restore LDMS sampler/aggregator configs from backup
# 6. Rolling restart LDMS pods
# 7. Validate: all telemetry pods Running, metrics/logs flowing
- name: Telemetry rollback placeholder
# ── Telemetry rollback strategy ──────────────────────────────────
# The K8s rollback (etcd snapshot restore) restores ALL Kubernetes objects
# to their pre-upgrade (2.1) state, which includes all telemetry namespace
# resources: Deployments, StatefulSets, Services, ConfigMaps, Secrets,
# PVCs, Helm release secrets, and CRDs.
#
# 2.2-only components (vector-ldms, vector-ome, victoria-logs cluster,
# victoria-metrics-operator CRDs, vlagent-vector, vmagent-vector) are
# automatically removed because they did not exist in the etcd snapshot.
#
# Post-K8s-rollback telemetry pod verification is performed by
# verify_telemetry_rollback.yml (Stage 8d) inside the rollback_k8s role.
- name: "Telemetry rollback — handled by K8s rollback (etcd restore)"
ansible.builtin.debug:
msg: "Telemetry rollback tasks to be implemented (Helm uninstall, component rollback)"
msg:
Comment thread
Kratika-P marked this conversation as resolved.
Outdated
- "Telemetry rollback is handled by K8s rollback via etcd snapshot restore."
- "The etcd snapshot contains the full 2.1 telemetry namespace state."
- "2.2-only components will be removed automatically."
- "Post-rollback telemetry pod verification runs in K8s rollback Stage 8d."

- name: Mark telemetry rollback as completed
ansible.builtin.copy:
Expand Down
4 changes: 4 additions & 0 deletions rollback/roles/rollback_k8s/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,10 @@
- name: "Stage 8c — Clean up stale CSI VolumeAttachments"
ansible.builtin.include_tasks: cleanup_stale_volume_attachments.yml

# ── Stage 8d: Verify telemetry pods after etcd restore ───────
- name: "Stage 8d — Verify telemetry rollback"
ansible.builtin.include_tasks: verify_telemetry_rollback.yml

# ── Stage 9: Restore BSS boot params and cloud-init ──────────
- name: "Stage 9 — Restore BSS boot params and cloud-init"
ansible.builtin.include_tasks: restore_bss_cloud_init.yml
Expand Down
104 changes: 104 additions & 0 deletions rollback/roles/rollback_k8s/tasks/verify_telemetry_rollback.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Copyright 2026 Dell Inc. or its subsidiaries. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
---
# ============================================================================
# Verify Telemetry Rollback (Post-K8s-Rollback)
# ============================================================================
# After etcd restore, the 2.1 telemetry objects are restored and 2.2-only
# objects are removed. This task verifies:
# 1. All telemetry pods are displayed with their status
# 2. No 2.2-only components remain (vector-ldms, vector-ome, victoria-logs,
# victoria-metrics-operator, vlagent-vector, vmagent-vector)
# 3. Pod readiness summary
#
# Prerequisites (guaranteed by rollback_k8s/main.yml execution order):
# - kube_vip: set in load_version_vars.yml, added to inventory in
# load_rollback_status.yml, verified reachable in post_validation.yml
# ============================================================================

# ── Get all telemetry pods ────────────────────────────────────────────
- name: "Telemetry rollback verification — Get all pods in telemetry namespace"
ansible.builtin.command:
cmd: kubectl get pods -n telemetry -o wide
delegate_to: "{{ kube_vip }}"
connection: ssh
register: telemetry_pods_result
changed_when: false
failed_when: false

- name: "Display telemetry pods after K8s rollback"
ansible.builtin.debug:
msg: "{{ telemetry_pods_result.stdout_lines | default(['No pods found in telemetry namespace']) }}"

# ── Check for stale 2.2-only components ───────────────────────────────
- name: "Check for 2.2-only components that should have been removed by etcd restore"
ansible.builtin.command:
cmd: >-
kubectl get deploy,sts -n telemetry --no-headers
-o custom-columns='KIND:.kind,NAME:.metadata.name'
delegate_to: "{{ kube_vip }}"
connection: ssh
register: telemetry_resources_result
changed_when: false
failed_when: false

- name: "Identify any 2.2-only telemetry components still present"
ansible.builtin.set_fact:
stale_22_components: >-
{{ telemetry_resources_result.stdout_lines | default([])
| select('search',
'vector-ldms|vector-ome|victoria-metrics-operator|vlagent-vector|vmagent-vector|vlstorage|vlinsert|vlselect|vlagent-vlagent|vmstorage-victoria|vmselect-victoria|vminsert-victoria|vmagent-vmagent')
| list }}

- name: "Display 2.2-only component cleanup status"
ansible.builtin.debug:
msg: >-
{% if stale_22_components | length == 0 -%}
All 2.2-only components removed successfully by etcd restore.{%- else -%}
WARNING: {{ stale_22_components | length }} stale 2.2 component(s) still present: {{ stale_22_components | join(', ') }}{%- endif %}

# ── Pod readiness summary ─────────────────────────────────────────────
- name: "Check telemetry pod readiness summary"
ansible.builtin.shell:
cmd: |
set -o pipefail
total=$(kubectl get pods -n telemetry --no-headers 2>/dev/null | grep -cv 'Completed' || echo 0)
running=$(kubectl get pods -n telemetry --no-headers 2>/dev/null | grep -c 'Running' || echo 0)
not_ready=$(kubectl get pods -n telemetry --no-headers 2>/dev/null | grep -v 'Running\|Completed' || echo "")
echo "TOTAL=${total}"
echo "RUNNING=${running}"
if [ -n "${not_ready}" ]; then
echo "NOT_READY:"
echo "${not_ready}"
fi
executable: /bin/bash
delegate_to: "{{ kube_vip }}"
connection: ssh
register: telemetry_summary
changed_when: false
failed_when: false

- name: "Display telemetry pod readiness summary"
ansible.builtin.debug:
msg: "{{ telemetry_summary.stdout_lines | default(['Could not retrieve pod summary']) }}"

- name: "Warn if any telemetry pods are not Running"
ansible.builtin.debug:
msg: >-
WARNING: Some telemetry pods are not in Running state after K8s rollback.
This may be transient — pods may need time to start after etcd restore.
If pods remain unhealthy after 5-10 minutes, manual investigation is needed.
when:
- telemetry_pods_result.stdout is defined
- telemetry_pods_result.stdout | regex_search('CrashLoopBackOff|Error|ImagePullBackOff|Pending|ContainerCreating') is not none
12 changes: 12 additions & 0 deletions upgrade/roles/upgrade_k8s/tasks/step_uncordon.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,20 @@
# See the License for the specific language governing permissions and
# limitations under the License.
---
- name: Wait for API server to be reachable before uncordon
ansible.builtin.command: kubectl get --raw /healthz
delegate_to: "{{ kube_vip }}"
register: api_health
changed_when: false
retries: 30
delay: 10
until: api_health.rc == 0

- name: Uncordon node {{ current_node_name }}
ansible.builtin.command: kubectl uncordon {{ node_ip }}
delegate_to: "{{ kube_vip }}"
register: uncordon_result
changed_when: true
retries: 5
delay: 10
until: uncordon_result.rc == 0
4 changes: 2 additions & 2 deletions upgrade/roles/upgrade_telemetry/tasks/backup_telemetry.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,10 @@
failed_when: false

- name: Backup telemetry.sh from control plane
ansible.builtin.fetch:
ansible.builtin.copy:
src: /root/telemetry.sh
dest: "{{ tel_backup_dir }}/telemetry.sh"
flat: true
mode: '0644'
Comment thread
Kratika-P marked this conversation as resolved.
Outdated
delegate_to: "{{ kube_vip }}"
connection: ssh
when:
Expand Down
55 changes: 31 additions & 24 deletions upgrade/roles/upgrade_telemetry/tasks/execute_telemetry_sh.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,7 @@

- name: Fail if telemetry.sh is missing
ansible.builtin.fail:
msg: >-
telemetry.sh not found at {{ telemetry_script_path }} on kube_vip.
Ensure the provision playbook has generated the telemetry deployment script.
msg: "{{ telemetry_sh_missing_msg }}"
when: not (telemetry_sh_stat.stat.exists | default(false))

- name: Verify kustomization.yaml exists in deployments directory
Expand All @@ -48,9 +46,7 @@

- name: Fail if kustomization.yaml is missing
ansible.builtin.fail:
msg: >-
kustomization.yaml not found at {{ telemetry_kustomization_dir }}/kustomization.yaml on kube_vip.
Ensure the provision playbook has generated the kustomize deployment files.
msg: "{{ kustomization_missing_msg }}"
when: not (kustomization_stat.stat.exists | default(false))

- name: Execute telemetry.sh and validate deployment
Expand Down Expand Up @@ -98,7 +94,7 @@

- name: Fail telemetry.sh for non-Helm errors
ansible.builtin.fail:
msg: "telemetry.sh failed with non-Helm error: {{ telemetry_sh_result.stderr }}"
msg: "{{ telemetry_sh_non_helm_error_msg }}"
when:
- telemetry_sh_result.rc != 0
- not (telemetry_sh_helm_error | default(false))
Expand All @@ -125,11 +121,20 @@
delegate_to: "{{ kube_vip }}"
connection: ssh
changed_when: false
failed_when: false
Comment thread
Kratika-P marked this conversation as resolved.
register: rollout_scale_result
when:
- idrac_sts_check.rc == 0
- idrac_replica_count.stdout | int > 0
- restore_replicas_result.changed | default(false)

- name: Warn if idrac-telemetry rollout did not complete in time
ansible.builtin.debug:
msg: "{{ idrac_rollout_timeout_warning_msg }}"
when:
Comment thread
Kratika-P marked this conversation as resolved.
- rollout_scale_result is defined
- rollout_scale_result.rc | default(0) != 0

- name: Display replica restore status
ansible.builtin.debug:
msg: "{{ idrac_replica_restore_msg }}"
Expand All @@ -156,11 +161,14 @@
failed_when: false

# ── Post-deployment validation: Check for MySQL issues ──
- name: Check idrac-telemetry pod status after deployment
- name: Check idrac-telemetry pod container status after deployment
ansible.builtin.shell:
cmd: >
kubectl get pods -n {{ telemetry_namespace }} -l app=idrac-telemetry
--no-headers -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,READY:.status.containerStatuses[*].ready
cmd: |
set -o pipefail
kubectl get pods -n {{ telemetry_namespace }} -l app=idrac-telemetry \
--no-headers \
-o custom-columns=NAME:.metadata.name,PHASE:.status.phase,CONTAINER_STATUSES:.status.containerStatuses[*].state.waiting.reason \
2>/dev/null || true
delegate_to: "{{ kube_vip }}"
connection: ssh
register: idrac_pod_status_check
Expand All @@ -170,14 +178,14 @@
- name: Display idrac-telemetry pod status
ansible.builtin.debug:
msg: "idrac-telemetry pod status: {{ idrac_pod_status_check.stdout }}"
when: idrac_pod_status_check.stdout != ""
when: idrac_pod_status_check.stdout | default('') != ""

- name: Fail if idrac-telemetry MySQL container is in CrashLoopBackOff
ansible.builtin.fail:
msg: "{{ mysql_crash_error_msg }}"
when:
- idrac_pod_status_check.stdout != ""
- "'CrashLoopBackOff' in idrac_pod_status_check.stdout or 'Error' in idrac_pod_status_check.stdout"
- idrac_pod_status_check.stdout | default('') != ""
- "'CrashLoopBackOff' in idrac_pod_status_check.stdout"

- name: Generate telemetry pod status report
ansible.builtin.command:
Expand Down Expand Up @@ -205,25 +213,24 @@

- name: Fail if some pods are not ready
ansible.builtin.fail:
msg: >-
{{ pods_not_ready_msg }}
Review the pod status report above and check pod logs for errors:
kubectl logs -n {{ telemetry_namespace }} <pod-name>
msg: "{{ pods_not_ready_detailed_msg }}"
when: pods_not_ready.stdout | int > 0

- name: Display telemetry.sh success
ansible.builtin.debug:
msg: "{{ telemetry_sh_success_msg }}"

rescue:
- name: Display telemetry.sh failure details
- name: Display actual failing task details
ansible.builtin.debug:
msg:
- "{{ telemetry_sh_fail_msg }}"
- "stdout: {{ telemetry_sh_result.stdout | default('N/A') }}"
- "stderr: {{ telemetry_sh_result.stderr | default('N/A') }}"
- "rc: {{ telemetry_sh_result.rc | default('N/A') }}"
- "Telemetry deployment failed during post-deployment validation."
- "Failed task: {{ ansible_failed_task.name | default('unknown') }}"
- "Failure reason: {{ ansible_failed_result.msg | default(ansible_failed_result.stderr | default('unknown')) }}"
- "telemetry.sh rc: {{ telemetry_sh_result.rc | default('N/A') }}"

- name: Fail the telemetry upgrade
ansible.builtin.fail:
msg: "Telemetry deployment failed. See error details above."
msg: >-
Telemetry deployment failed at task '{{ ansible_failed_task.name | default('unknown') }}':
{{ ansible_failed_result.msg | default(ansible_failed_result.stderr | default('See error details above.')) }}
14 changes: 14 additions & 0 deletions upgrade/roles/upgrade_telemetry/vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,21 @@ telemetry_sh_helm_skip_msg: >-
telemetry.sh had Helm errors (existing releases), but StatefulSet was already patched
with terminationGracePeriodSeconds=120s for MySQL safety. Skipping full re-deployment.
Image tags should be updated via kustomize if needed.
telemetry_sh_missing_msg: >-
telemetry.sh not found at {{ telemetry_script_path }} on kube_vip.
Ensure the provision playbook has generated the telemetry deployment script.
kustomization_missing_msg: >-
kustomization.yaml not found at {{ telemetry_kustomization_dir }}/kustomization.yaml on kube_vip.
Ensure the provision playbook has generated the kustomize deployment files.
telemetry_sh_non_helm_error_msg: "telemetry.sh failed with non-Helm error: {{ telemetry_sh_result.stderr }}"
idrac_rollout_timeout_warning_msg: >-
WARNING: idrac-telemetry rollout did not complete within 300s (rc={{ rollout_scale_result.rc }}).
Continuing with pod readiness check.
pods_not_ready_msg: "Some telemetry pods are not ready after deployment."
pods_not_ready_detailed_msg: >-
{{ pods_not_ready_msg }}
Review the pod status report above and check pod logs for errors:
kubectl logs -n {{ telemetry_namespace }} <pod-name>
mysql_crash_error_msg: |
ERROR: idrac-telemetry MySQL container failed to start after graceful shutdown.
Manual intervention required:
Expand Down
Loading