Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions concepts/AI-monitoring-user-apps.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -66,11 +66,12 @@ Number of requests.
*Unit:* integer

gen_ai.usage.cost::
The distribution of GenAI request costs.
The distribution of GenAI request costs.
This is a non-standard metric, available only when explicitly set by the application or when the application is instrumented with {openlit}.
+
*Type:* histogram
*Type:* histogram
+
*Unit:* USD
*Unit:* USD

gen_ai.usage.input_tokens::
Number of prompt tokens processed.
Expand Down Expand Up @@ -107,7 +108,7 @@ No metrics received from any components. ::

No metrics received from the GPU. ::
+
* Verify if the RBAC rules were applied.
* Verify that the `clusterRole` configuration is included in the `otel-values.yaml` and the collector has been installed or upgraded with it.
* Verify if the metrics receiver scraper is configured.
* Check the {nvidia} DCGM Exporter for errors.

Expand Down
51 changes: 5 additions & 46 deletions tasks/AI-monitoring-gpu.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,56 +7,15 @@ To effectively monitor the performance and utilization of your GPUs, configure t

[#ai-monitoring-gpu-metrics]
.Collect GPU metrics (recommended)
. *Grant permissions (RBAC).* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
+
Create a file named `otel-rbac.yaml`
with the following content.
It defines a `Role` with permissions to get services and endpoints, and a `RoleBinding` to grant these permissions to the {otelemetry} Collector's service account.
. *Verify RBAC permissions.* The {otelemetry} Collector requires specific permissions to discover the GPU metrics endpoint within the cluster.
These permissions are automatically configured when you install the collector with the `clusterRole` section in the `otel-values.yaml` file (see xref:observability-settingup-ai.adoc[]).
+
----
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: suse-observability-otel-scraper
rules:
- apiGroups:
- ""
resources:
- services
- endpoints
verbs:
- list
- watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: suse-observability-otel-scraper
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: suse-observability-otel-scraper
subjects:
- kind: ServiceAccount
name: OPENTELEMETRY-COLLECTOR
namespace: OBSERVABILITY
---
----
+
[IMPORTANT]
[NOTE]
====
Verify that the `ServiceAccount` name and namespace in the `RoleBinding` match your {otelemetry} Collector's deployment.
If you installed the {otelemetry} Collector without the `clusterRole` configuration, you must upgrade the collector with the updated `otel-values.yaml` that includes the `clusterRole` section.
====
+
. Apply this configuration to the `gpu-operator` namespace.
+
[source,bash]
----
> kubectl apply -n gpu-operator -f otel-rbac.yaml
----
. *Configure the {otelemetry} Collector.* Add the following Prometheus receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds.
. *Configure the {otelemetry} Collector.* Add the following Prometheus receiver configuration to your {otelemetry} Collector's values file. This tells the collector to scrape metrics from any endpoint in the `gpu-operator` namespace every 10 seconds.
+
[source,yaml]
----
Expand Down
6 changes: 3 additions & 3 deletions tasks/AI-monitoring-owui.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ pipelines:
storageClass: longhorn <.>
extraEnvVars: <.>
- name: PIPELINES_URLS <.>
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py"
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py"
- name: OTEL_SERVICE_NAME <.>
value: "Open WebUI"
- name: OTEL_EXPORTER_HTTP_OTLP_ENDPOINT <.>
value: "http://opentelemetry-collector.suse-observability.svc.cluster.local:4318"
- name: PRICING_JSON <.>
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/pricing.json"
value: "https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/pricing.json"
extraEnvVars:
- name: OPENAI_API_KEY <.>
value: "0p3n-w3bu!"
Expand Down Expand Up @@ -102,7 +102,7 @@ include::../snippets/openwebui-requirement-admin-privileges.adoc[]

. In the bottom left of the {owui} window, click your avatar icon to open the user menu and select menu:Admin Panel[].
. Click the menu:Settings[] tab and select menu:Pipelines[] from the left menu.
. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/heads/main/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL.
. In the menu:Install from Github URL[] section, enter `https://raw.githubusercontent.com/SUSE/suse-ai-observability-extension/refs/tags/v2.0.0/integrations/oi-filter/suse_ai_filter.py` and click the upload button on the right to upload the pipeline from the URL.
. After the upload is finished, you can review the configuration of the pipeline. Confirm with menu:Save[].
+
[#fig-ai-monitoring-owui-pipelines-webui]
Expand Down
Loading