Skip to content

Commit 7dd0fb8

Browse files
authored
Merge pull request #41 from robusta-dev/more-alert-enrichers-and-docs
More alert enrichers and docs
2 parents 4013f3a + e1e1b17 commit 7dd0fb8

4 files changed

Lines changed: 195 additions & 95 deletions

File tree

docs/images/graph-enricher.png

58.3 KB
Loading

docs/user-guide/alerts.rst

Lines changed: 156 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Prometheus Alert Enrichment
44
##################################
55

66
Introduction
7-
^^^^^^^^^^^^^^^
7+
--------------
88
Robusta has special features for handling Prometheus alerts in Kubernetes clusters including:
99

1010
1. **Enrichers:** playbooks that enrich alerts with extra information based on the alert type
@@ -17,38 +17,34 @@ These features are still in beta and therefore have been implemented differently
1717
of operation, you configure a root ``alerts_integration`` playbook in ``active_playbooks.yaml`` and then add special enrichment
1818
and silencer playbooks underneath that playbook. In the future, this functionality will likely be merged into regular playbooks.
1919

20-
Setup and configuration
21-
^^^^^^^^^^^^^^^^^^^^^^^^^^
20+
Configure Robusta
21+
---------------------------------
2222

23-
Configure Prometheus AlertManager
24-
----------------------------------
25-
Before you can enrich prometheus alerts, you must forward Prometheus alerts to Robusta by adding a webhook receiver to AlertsManager.
26-
See :ref:`Setting up the webhook` for details.
23+
.. admonition:: Configure Prometheus AlertManager
2724

28-
Configure Robusta
29-
------------------------------
30-
Lets look at the simplest possible ``active_playbooks.yaml`` which instructs Robusta to forward Prometheus alerts to Slack without any enrichment:
25+
Before you can enrich prometheus alerts, you must forward Prometheus alerts to Robusta by adding a webhook receiver to AlertsManager.
3126

32-
| **Enabling it:**
27+
See :ref:`Setting up the webhook` for details.
28+
29+
30+
Lets look at the simplest possible ``active_playbooks.yaml`` which instructs Robusta to forward Prometheus alerts to Slack without any enrichment:
3331

3432
.. code-block:: yaml
3533
3634
active_playbooks:
3735
- name: "alerts_integration"
3836
3937
The above configuration isn't very useful because we haven't enriched any alerts yet.
40-
However, we do get a minor aesthetic benefit because Robusta adds pretty formatting to alerts as you can see below:
38+
However, Robusta still sends default information for every alert as you can see below.
4139

4240
.. image:: /images/default-slack-enrichment.png
4341
:width: 30 %
4442
:align: center
4543

4644
Adding an Enricher
47-
-------------------
45+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4846
Now lets add an enricher to ``active_playbooks.yaml`` which enriches the ``HostHighCPULoad`` alert:
4947

50-
| **Enabling it:**
51-
5248
.. code-block:: yaml
5349
5450
active_playbooks:
@@ -78,7 +74,7 @@ Therefore, in the above example, we explicitly added back the ``AlertDefaults``
7874
Make sure to check out the full list of enrichers to see what you can add.
7975

8076
Setting the default enricher
81-
------------------------------
77+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
8278

8379
You can change the default enricher(s) for all alerts using the ``default_enrichers`` parameter.
8480

@@ -91,10 +87,8 @@ You can change the default enricher(s) for all alerts using the ``default_enrich
9187
- name: "AlertDefaults"
9288
9389
Adding a Silencer
94-
-----------------
95-
Now lets look at an example ``active_playbooks.yaml`` which silences KubePodCrashLooping alerts in the first ten minutes after a node (re)starts:
96-
97-
| **Enabling it:**
90+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
91+
Lets silence `KubePodCrashLooping` alerts in the first ten minutes after a node (re)starts:
9892

9993
.. code-block:: yaml
10094
@@ -109,8 +103,8 @@ Now lets look at an example ``active_playbooks.yaml`` which silences KubePodCras
109103
post_restart_silence: 600 # seconds
110104
111105
Full example
112-
----------------
113-
Here is an example which shows all the features discussed above working together:
106+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
107+
Here are all the above features working together:
114108

115109
.. code-block:: yaml
116110
@@ -133,108 +127,179 @@ Here is an example which shows all the features discussed above working together
133127
params:
134128
post_restart_silence: 600 # seconds
135129
136-
Available enrichers
137-
^^^^^^^^^^^^^^^^^^^^^^^^^^
130+
Available Enrichers
131+
-----------------------
138132

139-
**AlertDefaults:** send the alert message and labels to Slack
133+
AlertDefaults
134+
^^^^^^^^^^^^^^^^
135+
Send the alert message and labels to Slack
140136

141-
**NodeCPUAnalysis:** provide deep analysis of node cpu usage
142-
This enricher use ``prometheus``. The ``prometheus`` url can be overriden in the ``global_config`` section.
143-
For example - ``prometheus_url: "http://prometheus-k8s.monitoring.svc.cluster.local:9090"``
137+
NodeCPUAnalysis
138+
^^^^^^^^^^^^^^^^^^^^^
139+
Provide analysis of node cpu usage.
144140

145-
**OOMKillerEnricher:** shows which pods were recently OOM Killed on a node
141+
.. note::
142+
This enricher use ``prometheus``. The ``prometheus`` url can be overriden in the ``global_config`` section.
146143

147-
**GraphEnricher:** display a graph of the Prometheus query which triggered the alert
148-
This enricher use ``prometheus``. The ``prometheus`` url can be overriden in the ``global_config`` section.
149-
For example - ``prometheus_url: "http://prometheus-k8s.monitoring.svc.cluster.local:9090"``
144+
For example - ``prometheus_url: "http://prometheus-k8s.monitoring.svc.cluster.local:9090"``
150145

151-
**StackOverflowEnricher:** add a button in Slack to search for the alert name on StackOverflow
146+
GraphEnricher
147+
^^^^^^^^^^^^^^^^^^^^^
148+
Display a graph of the Prometheus query which triggered the alert.
152149

153-
**NodeRunningPodsEnricher:** add a list of the pods running on the node, with the pod Ready status
150+
`See note above regarding the prometheus_url parameter.`
154151

155-
.. image:: /images/node-running-pods.png
156-
:width: 80 %
157-
:align: center
152+
.. admonition:: Example
158153

159-
**NodeAllocatableResourcesEnricher:** add the allocatable resources available on the node
154+
.. image:: /images/graph-enricher.png
155+
:width: 50 %
156+
:align: center
160157

161-
.. image:: /images/node-allocatable-resources.png
162-
:width: 80 %
163-
:align: center
158+
TemplateEnricher
159+
^^^^^^^^^^^^^^^^^^^^^
160+
Add a paragraph to the alert's description containing templated markdown. You can inject any of the alert's Prometheus labels into the markdown.
164161

165-
**DaemonsetEnricher:** for daemonset related alerts, adds details about the daemonset status
162+
A variable like ``$foo`` will be replaced by the value of the Prometheus label ``foo``. If a label isn't present then the text "<missing>" will be used instead.
166163

167-
.. image:: /images/daemonset-enricher.png
168-
:width: 80 %
169-
:align: center
164+
Common variables to use are ``$alertname``, ``$deployment``, ``$namespace``, and ``$node``
170165

171-
**DaemonsetMisscheduledAnalysis:** analyze the known Prometheus alert ``KubernetesDaemonsetMisscheduled`` and provide
172-
actionable advice on how to fix it. This enricher **only** displays output when it can verify that the alert is a false
173-
positive.
166+
The template can include all markdown directives supported by Slack. Note that Slack markdown links use a different format than GitHub.
174167

175-
.. image:: /images/daemonset-misscheduled.png
168+
.. admonition:: Example
176169

177-
**PodBashEnricher:** runs the specified bash command, on the **pod** associated with the alert
170+
.. code-block:: yaml
178171
179-
| **Note:** The bash command must be installed on the target pod
172+
active_playbooks:
173+
(...)
174+
- alert_name: "ContainerVolumeUsage"
175+
enrichers:
176+
- name: "TemplateEnricher"
177+
params:
178+
template: "The alertname is $alertname and the pod is $pod"
180179
181-
| **Example Usage:**
180+
LogsEnricher
181+
^^^^^^^^^^^^^^^^^^^^^
182+
Fetch logs related to the alert and attach them to the alert as a file.
182183

183-
.. code-block:: yaml
184+
The pod to fetch logs for is determined by the alert's ``pod`` label from Prometheus.
184185

185-
active_playbooks:
186-
(...)
187-
- alert_name: "ContainerVolumeUsage"
188-
enrichers:
189-
- name: "PodBashEnricher"
190-
params:
191-
bash_command: "df -h"
186+
By default, if the alert has no label named ``pod`` then this enricher will silently do nothing. To show an explicit error, set the ``warn_on_missing_label`` parameter to ``true``
192187

193-
| **The results:**
188+
OOMKillerEnricher
189+
^^^^^^^^^^^^^^^^^^^^^
190+
Shows which pods were recently OOM Killed on a node
194191

195-
.. image:: /images/disk-usage.png
196-
:width: 80 %
197-
:align: center
192+
StackOverflowEnricher
193+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
194+
Add a button in Slack to search for the alert name on StackOverflow
198195

199-
**NodeBashEnricher:** runs the specified bash command, on the **node** associated with the alert
196+
NodeRunningPodsEnricher
197+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
198+
Add a list of the pods running on the node, with the pod Ready status
200199

201-
| **Example Usage:**
200+
.. admonition:: Example
202201

203-
.. code-block:: yaml
202+
.. image:: /images/node-running-pods.png
203+
:width: 80 %
204+
:align: center
204205

205-
active_playbooks:
206-
(...)
207-
- alert_name: "HostOutOfDiskSpace"
208-
enrichers:
209-
- name: "NodeBashEnricher"
210-
params:
211-
bash_command: "df -h"
206+
NodeAllocatableResourcesEnricher
207+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
208+
Add the allocatable resources available on the node
212209

213-
**DeploymentStatusEnricher:** adds deployment condition statuses
210+
.. admonition:: Example
214211

215-
| **Example Usage:**
212+
.. image:: /images/node-allocatable-resources.png
213+
:width: 80 %
214+
:align: center
216215

217-
.. code-block:: yaml
216+
DaemonsetEnricher
217+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
218+
For daemonset related alerts, adds details about the daemonset status
218219

219-
active_playbooks:
220-
(...)
221-
- alert_name: "KubernetesDeploymentReplicasMismatch"
222-
enrichers:
223-
- name: "DeploymentStatusEnricher"
220+
.. admonition:: Example
224221

225-
| **The results:**
222+
.. image:: /images/daemonset-enricher.png
223+
:width: 80 %
224+
:align: center
226225

227-
.. image:: /images/deployment-status-details.png
228-
:width: 100 %
229-
:align: center
226+
DaemonsetMisscheduledAnalysis
227+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
228+
Analyze the known Prometheus alert ``KubernetesDaemonsetMisscheduled`` and provide actionable advice on how to fix it.
229+
This enricher **only** displays output when it can verify that the alert is a false positive.
230+
231+
.. admonition:: Example
232+
233+
.. image:: /images/daemonset-misscheduled.png
234+
235+
PodBashEnricher
236+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
237+
Runs the specified bash command, on the **pod** associated with the alert. The bash command must already be installed in the target pod.
238+
239+
.. admonition:: Example
240+
241+
.. code-block:: yaml
242+
243+
active_playbooks:
244+
(...)
245+
- alert_name: "ContainerVolumeUsage"
246+
enrichers:
247+
- name: "PodBashEnricher"
248+
params:
249+
bash_command: "df -h"
250+
251+
.. image:: /images/disk-usage.png
252+
:width: 80 %
253+
:align: center
254+
255+
NodeBashEnricher
256+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
257+
Runs the specified bash command, on the **node** associated with the alert
258+
259+
.. admonition:: Example
260+
261+
.. code-block:: yaml
262+
263+
active_playbooks:
264+
(...)
265+
- alert_name: "HostOutOfDiskSpace"
266+
enrichers:
267+
- name: "NodeBashEnricher"
268+
params:
269+
bash_command: "df -h"
270+
271+
272+
DeploymentStatusEnricher
273+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
274+
Adds deployment condition statuses
275+
276+
.. admonition:: Example
277+
278+
.. code-block:: yaml
279+
280+
active_playbooks:
281+
(...)
282+
- alert_name: "KubernetesDeploymentReplicasMismatch"
283+
enrichers:
284+
- name: "DeploymentStatusEnricher"
285+
286+
.. image:: /images/deployment-status-details.png
287+
:width: 100 %
288+
:align: center
230289

231290
Available Silencers
232-
^^^^^^^^^^^^^^^^^^^^^^^^^^
291+
-----------------------
292+
293+
NodeRestartSilencer
294+
^^^^^^^^^^^^^^^^^^^^^^^^^
295+
After a node is restarted, silence alerts for pods running on it.
296+
297+
.. admonition:: Parameters
233298

234-
**NodeRestartSilencer:** After a node is restarted, silence alerts for pods running on it.
235-
| params: post_restart_silence, (seconds), default to 300
299+
**post_restart_silence**: length of the silencing period in seconds; defaults to 300
236300

237301

238-
**DaemonsetMisscheduledSmartSilencer:** Silence the Prometheus alert ``KubernetesDaemonsetMisscheduled`` under
239-
conditions matching a known false alarm
302+
DaemonsetMisscheduledSmartSilencer
303+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
304+
Silence the Prometheus alert ``KubernetesDaemonsetMisscheduled`` under conditions matching a known false alarm
240305

0 commit comments

Comments
 (0)