diff --git a/inference/a3mega/llama3.1-70b/trtllm-gke/README.md b/inference/a3mega/llama3.1-70b/trtllm-gke/README.md new file mode 100644 index 00000000..dbd7d53d --- /dev/null +++ b/inference/a3mega/llama3.1-70b/trtllm-gke/README.md @@ -0,0 +1,386 @@ +# Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A3mega GKE Node Pool + +This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) framework on a single [A3-Mega GKE Node pool](https://cloud.google.com/kubernetes-engine). + +This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference. + + +## Table of Contents + +* [1. Test Environment](#test-environment) +* [2. High-Level Architecture](#architecture) +* [3. Environment Setup (One-Time)](#environment-setup) + * [3.1. Clone the Repository](#clone-repo) + * [3.2. Configure Environment Variables](#configure-vars) + * [3.3. Connect to your GKE Cluster](#connect-cluster) + * [3.4. Get Hugging Face Token](#get-hf-token) + * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret) +* [4. Run the Recipe](#run-the-recipe) + * [4.1. Supported Models](#supported-models) + * [4.2. Deploy and Benchmark a Model](#deploy-model) +* [5. Monitoring and Troubleshooting](#monitoring) + * [5.1. Check Deployment Status](#check-status) + * [5.2. View Logs](#view-logs) +* [6. Cleanup](#cleanup) + + +## 1. Test Environment + +[Back to Top](#table-of-contents) + +The recipe uses the following setup: + +* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) +* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the TensorRT-LLM framework. + +This recipe has been optimized for and tested with the following configuration: + +* **GKE Cluster**: + * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later. + * A GPU node pool with 1 [a3-megagpu-8g](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) machine. + * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. + * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. + * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. + * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. + * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). +* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. + +> [!IMPORTANT] +> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-mega.md). +> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**. + + +## 2. High-Level Flow + +[Back to Top](#table-of-contents) + +Here is a simplified diagram of the flow that we follow in this recipe: + +```mermaid +--- +config: + layout: dagre +--- +flowchart TD + subgraph workstation["Client Workstation"] + T["Cluster Toolkit"] + B("Kubernetes API") + A["helm install"] + end + subgraph huggingface["Hugging Face Hub"] + I["Model Weights"] + end + subgraph gke["GKE Cluster (A3-Mega)"] + C["Deployment"] + D["Pod"] + E["TensorRT-LLM container"] + F["Service"] + end + subgraph storage["Cloud Storage"] + J["Bucket"] + end + + %% Logical/actual flow + T -- Create Cluster --> gke + A --> B + B --> C & F + C --> D + D --> E + F --> C + E -- Downloads at runtime --> I + E -- Write logs --> J + + + %% Layout control + gke +``` + +* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment. +* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running. +* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server. +* **Pod:** The smallest deployable unit in Kubernetes. The Triton server container with TensorRT-LLM runs inside this pod on a GPU-enabled node. +* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts. + + +## 3. Environment Setup (One-Time) + +[Back to Top](#table-of-contents) + +First, you'll configure your local environment. These steps are required once before you can deploy any models. + + +### 3.1. Clone the Repository + +```bash +git clone https://github.com/ai-hypercomputer/gpu-recipes.git +cd gpu-recipes +export REPO_ROOT=$(pwd) +export RECIPE_ROOT=$REPO_ROOT/inference/a3mega/llama3.1-70b/trtllm-gke +``` + + +### 3.2. Configure Environment Variables + +This is the most critical step. These variables are used in subsequent commands to target the correct resources. + +```bash +export PROJECT_ID= +export CLUSTER_REGION= +export CLUSTER_NAME= +export KUEUE_NAME= +export GCS_BUCKET= +export TRTLLM_VERSION=1.3.0rc3 + +# Set the project for gcloud commands +gcloud config set project $PROJECT_ID +``` + +Replace the following values: + +| Variable | Description | Example | +| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | +| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` | +| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` | +| `CLUSTER_NAME` | The name of your GKE cluster. | `a3-mega` | +| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a3mega`. Verify the name in your cluster. | `a3mega` | +| `ARTIFACT_REGISTRY` | Full path to your Artifact Registry repository. | `us-central1-docker.pkg.dev/gcp-project-12345/my-repo` | +| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` | +| `TRTLLM_VERSION` | The tag/version for the Docker image. Other verions can be found at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release | `1.3.0rc3` | + + + +### 3.3. Connect to your GKE Cluster + +Fetch credentials for `kubectl` to communicate with your cluster. + +```bash +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION +``` + + +### 3.4. Get Hugging Face token + +To access models through Hugging Face, you'll need a Hugging Face token. + 1. Create a [Hugging Face account](https://huggingface.co/) if you don't have one. + 2. For **gated models** like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding. + 3. Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**. + 4. Select **New Token**. + 5. Specify a Name and a Role of at least `Read`. + 6. Select **Generate a token**. + 7. Copy the generated token to your clipboard. You'll use this later. + + + +### 3.5. Create Hugging Face Kubernetes Secret + +Create a Kubernetes Secret with your Hugging Face token to enable the pod to download model checkpoints from Hugging Face. + +```bash +# Paste your Hugging Face token here +export HF_TOKEN= + +kubectl create secret generic hf-secret \ +--from-literal=hf_api_token=${HF_TOKEN} \ +--dry-run=client -o yaml | kubectl apply -f - +``` + + +## 4. Run the recipe + +[Back to Top](#table-of-contents) + +> [!NOTE] +> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face. + +> [!TIP] +> You can use the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) to quantize these models to FP8 for improved performance. + + +### 4.1. Supported Models + +[Back to Top](#table-of-contents) + +This recipe supports the deployment of the following models: + +| Model Name | Hugging Face ID | Configuration File | Release Name Suffix | +| :--- | :--- | :--- | :--- | +| **Llama 3.1 70B** | `meta-llama/Llama-3.1-70B-Instruct` | `llama-3.1-70b.yaml` | `llama-3-1-70b` | + + +### 4.2. Deploy and Benchmark a Model + +[Back to Top](#table-of-contents) + +The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/perf-benchmarking.md), a command-line tool from NVIDIA to benchmark the performance of TensorRT-LLM engine. + +1. **Configure model-specific variables.** Choose a model from the [table above](#supported-models) and set the variables: + + ```bash + # Example for Llama 3.1 70B + export HF_MODEL_ID="meta-llama/Llama-3.1-70B-Instruct" + export CONFIG_FILE="llama-3.1-70b.yaml" + export RELEASE_NAME="$USER-serving-llama-3-1-70b" + ``` + +2. **Install the helm chart:** + + ```bash + cd $RECIPE_ROOT + helm install -f values.yaml \ + --set-file workload_launcher=$REPO_ROOT/src/launchers/trtllm-launcher.sh \ + --set-file serving_config=$REPO_ROOT/src/frameworks/a3mega/trtllm-configs/${CONFIG_FILE} \ + --set queue=${KUEUE_NAME} \ + --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \ + --set workload.model.name=${HF_MODEL_ID} \ + --set workload.image=nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_VERSION} \ + --set workload.framework=trtllm \ + ${RELEASE_NAME} \ + $REPO_ROOT/src/helm-charts/a3mega/trtllm-inference/single-node + ``` + +3. **Check the deployment status:** + + ```bash + kubectl get deployment/${RELEASE_NAME} + ``` + + Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs. + + +## 5. Monitoring and Troubleshooting + +[Back to Top](#table-of-contents) + +After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-llama3.1-70b` and `$USER-serving-llama3.1-70b-svc`). + + +### 5.1. Check Deployment Status + +Check the status of your deployment. Replace the name if you deployed a different model. + +```bash +# Example for Llama 3.1 70B +kubectl get deployment/$USER-serving-llama3.1-70b +``` + +Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up. + +> [!NOTE] +> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready. + + +### 5.2. View Logs + +To see the logs from the TRTLLM server (useful for debugging), use the `-f` flag to follow the log stream: + +```bash +kubectl logs -f deployment/$USER-serving-llama3.1-70b +``` + +You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this: + +```bash +Running benchmark for nvidia/Llama3.1-70b with ISL=128, OSL=128, TP=8, EP=1, PP=1 + +=========================================================== += PYTORCH BACKEND +=========================================================== +Model: nvidia/Llama3.1-70b +Model Path: /ssd/nvidia/Llama3.1-70b +TensorRT LLM Version: 1.2 +Dtype: bfloat16 +KV Cache Dtype: FP8 +Quantization: FP8 + +=========================================================== += REQUEST DETAILS +=========================================================== +Number of requests: 1000 +Number of concurrent requests: 985.9849 +Average Input Length (tokens): 128.0000 +Average Output Length (tokens): 128.0000 +=========================================================== += WORLD + RUNTIME INFORMATION +=========================================================== +TP Size: 8 +PP Size: 1 +EP Size: 1 +Max Runtime Batch Size: 2304 +Max Runtime Tokens: 4608 +Scheduling Policy: GUARANTEED_NO_EVICT +KV Memory Percentage: 85.00% +Issue Rate (req/sec): 8.3913E+13 + +=========================================================== += PERFORMANCE OVERVIEW +=========================================================== +Request Throughput (req/sec): X.XX +Total Output Throughput (tokens/sec): X.XX +Total Token Throughput (tokens/sec): X.XX +Total Latency (ms): X.XX +Average request latency (ms): X.XX +Per User Output Throughput [w/ ctx] (tps/user): X.XX +Per GPU Output Throughput (tps/gpu): X.XX + +-- Request Latency Breakdown (ms) ----------------------- + +[Latency] P50 : X.XX +[Latency] P90 : X.XX +[Latency] P95 : X.XX +[Latency] P99 : X.XX +[Latency] MINIMUM: X.XX +[Latency] MAXIMUM: X.XX +[Latency] AVERAGE: X.XX + +=========================================================== += DATASET DETAILS +=========================================================== +Dataset Path: /ssd/token-norm-dist_llama3.1-70b_128_128_tp4.json +Number of Sequences: 1000 + +-- Percentiles statistics --------------------------------- + + Input Output Seq. Length +----------------------------------------------------------- +MIN: 128.0000 128.0000 256.0000 +MAX: 128.0000 128.0000 256.0000 +AVG: 128.0000 128.0000 256.0000 +P50: 128.0000 128.0000 256.0000 +P90: 128.0000 128.0000 256.0000 +P95: 128.0000 128.0000 256.0000 +P99: 128.0000 128.0000 256.0000 +=========================================================== +``` + + +## 6. Cleanup + +To avoid incurring further charges, clean up the resources you created. + +1. **Uninstall the Helm Release:** + + First, list your releases to get the deployed models: + + ```bash + # list deployed models + helm list --filter $USER-serving- + ``` + + Then, uninstall the desired release: + + ```bash + # uninstall the deployed model + helm uninstall + ``` + Replace `` with the helm release names listed. + +2. **Delete the Kubernetes Secret:** + + ```bash + kubectl delete secret hf-secret --ignore-not-found=true + ``` + +3. (Optional) Delete the built Docker image from Artifact Registry if no longer needed. +4. (Optional) Delete Cloud Build logs. +5. (Optional) Clean up files in your GCS bucket if benchmarking was performed. +6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster. \ No newline at end of file diff --git a/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml b/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml new file mode 100644 index 00000000..2649ebdd --- /dev/null +++ b/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml @@ -0,0 +1,64 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +queue: + +dwsSettings: + maxRunDurationSeconds: + +huggingface: + secretName: hf-secret + secretData: + token: "hf_api_token" + +volumes: + gcsVolumes: true + ssdMountPath: "/ssd" + gcsMounts: + - bucketName: + mountPath: "/gcs" + +service: + type: ClusterIP + ports: + http: 8000 + +workload: + model: + name: + gpus: 8 + image: + framework: + configFile: serving-args.yaml + configPath: /workload/configs + envs: + - name: LAUNCHER_SCRIPT + value: "/workload/launcher/launch-workload.sh" + - name: HF_HUB_ENABLE_HF_TRANSFER + value: "0" + - name: SERVER_ARGS_FILE + value: "/workload/configs/serving-args.yaml" + benchmarks: + experiments: + - isl: 128 + osl: 128 + num_requests: 30000 + +network: + hostNetwork: true + subnetworks[]: + gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1 + ncclSettings: + - name: NCCL_DEBUG + value: "VERSION" diff --git a/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml b/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml new file mode 100644 index 00000000..8cfa6a4e --- /dev/null +++ b/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml @@ -0,0 +1,4 @@ +tp_size: 8 +pp_size: 1 +backend: pytorch +kv_cache_free_gpu_mem_fraction: 0.80 diff --git a/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml b/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml new file mode 100644 index 00000000..4f584cc4 --- /dev/null +++ b/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: single-host-serving-deployment-template +description: single-host-serving-deployment-template +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml new file mode 100644 index 00000000..a17bdf49 --- /dev/null +++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml @@ -0,0 +1,25 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + serving-configuration: |- +{{- if .Values.serving_config }} +{{ .Values.serving_config | nindent 4 }} +{{- else }} +{{ "config: null" | nindent 4 }} +{{- end }} \ No newline at end of file diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml new file mode 100644 index 00000000..b111553b --- /dev/null +++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml @@ -0,0 +1,27 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} \ No newline at end of file diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml new file mode 100644 index 00000000..23b7ddfd --- /dev/null +++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml @@ -0,0 +1,265 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{ $nodes := div .Values.workload.gpus 8 | max 1 }} +{{ $gpusPerNode := min .Values.workload.gpus 8 }} + +{{ $root := . }} + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + app: {{ .Release.Name }}-serving + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + replicas: {{ $nodes }} + selector: + matchLabels: + app: {{ .Release.Name }}-serving + template: + metadata: + labels: + app: {{ .Release.Name }}-serving + annotations: + kubectl.kubernetes.io/default-container: serving + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "0" + gke-gcsfuse/memory-limit: "0" + gke-gcsfuse/ephemeral-storage-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Always + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + {{ range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{ end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + volumes: + {{- if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{- end }} + - name: serving-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: serving-configuration + path: {{ $root.Values.workload.configFile | default "serving-args" }} + - name: serving-launcher + configMap: + name: "{{.Release.Name}}-launcher" + defaultMode: 0700 + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end }} + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{- if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{- end }} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + - name: serving + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: "{{ $root.Values.huggingface.secretName }}" + key: "{{ $root.Values.huggingface.secretData.token }}" + # Pass NCCL settings to the container + {{- if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 12 }} + {{- end }} + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64 + - name: LD_LIBRARY_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + {{- if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{- end }} + # Workload specific environment variables + - name: MODEL_NAME + value: "{{ $root.Values.workload.model.name }}" + - name: MODEL_DOWNLOAD_DIR + value: "/ssd/{{ $root.Values.workload.model.name }}" + # A3-Ultra recipe is based on the TensorRT image, which puts tensorrt_llm in a different path than default + - name: TRTLLM_DIR + value: "/app/tensorrt_llm" + {{- if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 12 }} + {{- end }} + + workingDir: /workload + command: ["/bin/bash", "-c"] + args: + - | + #!/bin/bash + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + env | grep NCCL + fi + + if [ ! -f "$LAUNCHER_SCRIPT" ]; then + echo "Error: Launcher script $LAUNCHER_SCRIPT not found!" + exit 1 + fi + + ARGS=() + + if [ -f "$SERVER_ARGS_FILE" ]; then + echo "Loading server arguments from ConfigMap" + while IFS=': ' read -r key value || [ -n "$key" ]; do + [[ -z "$key" || "$key" == \#* ]] && continue + key=$(echo "$key" | xargs) + value=$(echo "$value" | xargs) + + if [ -n "$key" ]; then + # Handle boolean values + if [[ "$value" == "true" ]]; then + # For true values, just add the flag without a value + ARGS+=("--$key") + elif [[ "$value" == "false" ]]; then + ARGS+=("--$key" "false") + elif [ -n "$value" ]; then + # For non-boolean values, add both the flag and its value + ARGS+=("--$key" "$value") + else + ARGS+=("--$key") + fi + fi + done < "$SERVER_ARGS_FILE" + fi + + {{ if eq $root.Values.workload.framework "trtllm" }} + {{- range $root.Values.workload.benchmarks.experiments }} + echo "Running: $LAUNCHER_SCRIPT --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- ${ARGS[@]} --max_batch_size 4096 --max_num_tokens 8192" + exec "$LAUNCHER_SCRIPT" --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- "${ARGS[@]}" --max_batch_size 4096 --max_num_tokens 8192 + {{- end }} + {{ else }} + echo "Running: $LAUNCHER_SCRIPT ${ARGS[@]}" + exec "$LAUNCHER_SCRIPT" "${ARGS[@]}" + {{- end }} + + volumeMounts: + {{- if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{- end }} + - name: serving-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + - name: serving-launcher + mountPath: /workload/launcher + - name: shared-memory + mountPath: /dev/shm + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + requests: + nvidia.com/gpu: {{ $gpusPerNode }} + limits: + nvidia.com/gpu: {{ $gpusPerNode }} \ No newline at end of file diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml new file mode 100644 index 00000000..3d1363b9 --- /dev/null +++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml @@ -0,0 +1,26 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: {{ .Release.Name }}-svc +spec: + selector: + app: {{ .Release.Name }}-serving + ports: + - name: http + port: {{ .Values.service.ports.http }} + targetPort: {{ .Values.service.ports.http }} + type: {{ .Values.service.type }} \ No newline at end of file diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml new file mode 100644 index 00000000..758baf9c --- /dev/null +++ b/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml @@ -0,0 +1,20 @@ +# Copyright 2025 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v2 +name: trtllm-llama-3-1-405b-inference +description: trtllm-llama-3-1-405b-inference +type: application +version: 0.1.0 +appVersion: "1.16.0" diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml new file mode 100644 index 00000000..d8c0759f --- /dev/null +++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml @@ -0,0 +1,7 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-config" +data: + serving-configuration: |- +{{ .Values.serving_config | indent 4 }} diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml new file mode 100644 index 00000000..9415d5be --- /dev/null +++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml @@ -0,0 +1,13 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: "{{ .Release.Name }}-launcher" +data: + launch-workload.sh: |- +{{- if .Values.workload_launcher }} +{{ .Values.workload_launcher | nindent 4 }} +{{- else }} + #!/bin/bash + echo "No workload launcher specified" + exit 1 +{{- end }} diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml new file mode 100644 index 00000000..f3055ba5 --- /dev/null +++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml @@ -0,0 +1,280 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{{ $nodes := div .Values.workload.gpus 8 | max 1 }} +{{ $gpusPerNode := min .Values.workload.gpus 8 }} + +{{ $root := . }} + +apiVersion: apps/v1 +kind: Deployment +metadata: + name: "{{ .Release.Name }}" + namespace: default + labels: + app: {{ .Release.Name }}-serving + {{- if $root.Values.queue }} + kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}" + {{- end }} +spec: + replicas: {{ $nodes }} + selector: + matchLabels: + app: {{ .Release.Name }}-serving + template: + metadata: + labels: + app: {{ .Release.Name }}-serving + annotations: + kubectl.kubernetes.io/default-container: serving + {{- if $root.Values.volumes.gcsVolumes }} + gke-gcsfuse/volumes: "true" + gke-gcsfuse/cpu-limit: "0" + gke-gcsfuse/memory-limit: "0" + gke-gcsfuse/ephemeral-storage-limit: "0" + {{- end }} + {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }} + provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}" + {{- end }} + {{- if not $root.Values.network.hostNetwork }} + networking.gke.io/default-interface: "eth0" + networking.gke.io/interfaces: | + {{- if $root.Values.network.subnetworks }} + [ + {{- range $i, $subnetwork := $root.Values.network.subnetworks }} + {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}} + {{- end }} + ] + {{- else }} + [ + {"interfaceName":"eth0","network":"default"}, + {"interfaceName":"eth1","network":"gvnic-1"}, + {{- range $i := until 8 }} + {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}} + {{- end }} + ] + {{- end }} + {{- end }} + spec: + {{- if $root.Values.network.hostNetwork }} + hostNetwork: true + dnsPolicy: ClusterFirstWithHostNet + {{- end }} + subdomain: "{{.Release.Name}}" + restartPolicy: Always + {{- if $root.Values.targetNodes }} + affinity: + nodeAffinity: + requiredDuringSchedulingIgnoredDuringExecution: + nodeSelectorTerms: + - matchExpressions: + - key: kubernetes.io/hostname + operator: In + values: + {{ range $hostname := $root.Values.targetNodes }} + - {{ $hostname }} + {{ end }} + {{- end }} + tolerations: + - operator: "Exists" + key: nvidia.com/gpu + - operator: "Exists" + key: cloud.google.com/impending-node-termination + volumes: + {{- if $root.Values.network.gibVersion }} + - name: gib + emptyDir: {} + {{- end }} + - name: serving-configuration + configMap: + name: "{{.Release.Name}}-config" + items: + - key: serving-configuration + path: {{ $root.Values.workload.configFile | default "serving-args.yaml" }} + - name: serving-launcher + configMap: + name: "{{.Release.Name}}-launcher" + defaultMode: 0700 + - name: shared-memory + emptyDir: + medium: "Memory" + sizeLimit: 250Gi + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + csi: + driver: gcsfuse.csi.storage.gke.io + volumeAttributes: + bucketName: "{{ $gcs.bucketName }}" + {{- if $gcs.mountOptions }} + mountOptions: "{{ $gcs.mountOptions }}" + {{- end }} + {{- end }} + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + hostPath: + path: /mnt/stateful_partition/kube-ephemeral-ssd + {{- end }} + + initContainers: + {{- if $root.Values.network.gibVersion }} + - name: nccl-plugin-installer + image: {{ $root.Values.network.gibVersion }} + imagePullPolicy: Always + args: + - | + set -ex + /scripts/container_entry.sh install --install-nccl + cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64 + cp -R /var/lib/gib/. /target/usr/local/gib + command: + - /bin/sh + - -c + volumeMounts: + - mountPath: /target/usr/local/gib + name: gib + {{- end }} + + containers: + {{- if $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-sidecar + image: {{ $root.Values.workload.gcsSidecarImage }} + - name: gke-gcsfuse-metadata-prefetch + image: {{ $root.Values.workload.gcsSidecarImage }} + {{- end }} + - name: serving + image: "{{ $root.Values.workload.image }}" + imagePullPolicy: Always + {{- if $root.Values.network.hostNetwork }} + securityContext: + privileged: true + {{- end }} + env: + - name: HF_TOKEN + valueFrom: + secretKeyRef: + name: "{{ $root.Values.huggingface.secretName }}" + key: "{{ $root.Values.huggingface.secretData.token }}" + {{- if $root.Values.network.ncclSettings }} + {{- toYaml .Values.network.ncclSettings | nindent 12 }} + {{- end }} + - name: NCCL_PLUGIN_PATH + value: /usr/local/gib/lib64 + - name: LD_LIBRARY_PATH + value: /usr/local/gib/lib64:/usr/local/nvidia/lib64 + {{- if $root.Values.network.gibVersion }} + - name: NCCL_INIT_SCRIPT + value: "/usr/local/gib/scripts/set_nccl_env.sh" + {{- end }} + - name: MODEL_NAME + value: "{{ $root.Values.workload.model.name }}" + - name: MODEL_DOWNLOAD_DIR + value: "/ssd/{{ $root.Values.workload.model.name }}" + - name: TRTLLM_DIR + value: "/app/tensorrt_llm" + - name: LAUNCHER_SCRIPT + value: "/workload/launcher/launch-workload.sh" + - name: SERVER_ARGS_FILE + value: "/workload/configs/serving-args.yaml" + - name: NCCL_PROFILER_PLUGIN + value: "none" + - name: NCCL_TUNER_PLUGIN + value: "none" + - name: TRTLLM_DISABLE_AUTOTUNER + value: "1" + - name: TRTLLM_CUSTOM_ALLREDUCE + value: "0" + + + {{- if $root.Values.workload.envs }} + {{- toYaml .Values.workload.envs | nindent 12 }} + {{- end }} + + workingDir: /workload + command: ["/bin/bash", "-c"] + args: + - | + #!/bin/bash + + if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then + echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}" + source ${NCCL_INIT_SCRIPT} + env | grep NCCL + ldconfig + fi + + if [ ! -f "$LAUNCHER_SCRIPT" ]; then + echo "Error: Launcher script $LAUNCHER_SCRIPT not found!" + exit 1 + fi + + ARGS=() + + if [ -f "$SERVER_ARGS_FILE" ]; then + echo "Loading server arguments from ConfigMap" + while IFS=': ' read -r key value || [ -n "$key" ]; do + [[ -z "$key" || "$key" == \#* ]] && continue + key=$(echo "$key" | xargs) + value=$(echo "$value" | xargs) + + if [ -n "$key" ]; then + if [[ "$value" == "true" ]]; then + ARGS+=("--$key") + elif [[ "$value" == "false" ]]; then + ARGS+=("--$key" "false") + elif [ -n "$value" ]; then + ARGS+=("--$key" "$value") + else + ARGS+=("--$key") + fi + fi + done < "$SERVER_ARGS_FILE" + fi + + {{ if eq $root.Values.workload.framework "trtllm" }} + {{- range $root.Values.workload.benchmarks.experiments }} + echo "Running: $LAUNCHER_SCRIPT --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- ${ARGS[@]}" + exec "$LAUNCHER_SCRIPT" --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- "${ARGS[@]}" + {{- end }} + {{ else }} + echo "Running: $LAUNCHER_SCRIPT ${ARGS[@]}" + exec "$LAUNCHER_SCRIPT" "${ARGS[@]}" + {{- end }} + + volumeMounts: + {{- if $root.Values.network.gibVersion }} + - name: gib + mountPath: /usr/local/gib + {{- end }} + - name: serving-configuration + mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }} + - name: serving-launcher + mountPath: /workload/launcher + - name: shared-memory + mountPath: /dev/shm + {{- range $gcs := $root.Values.volumes.gcsMounts }} + - name: "{{ $gcs.bucketName }}" + mountPath: "{{ $gcs.mountPath }}" + {{- end }} + {{- if $root.Values.volumes.ssdMountPath }} + - name: local-ssd + mountPath: "{{ $root.Values.volumes.ssdMountPath }}" + {{- end }} + + resources: + requests: + nvidia.com/gpu: {{ $gpusPerNode }} + memory: "500Gi" + limits: + nvidia.com/gpu: {{ $gpusPerNode }} + memory: "500Gi" diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml new file mode 100644 index 00000000..c1eeb742 --- /dev/null +++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml @@ -0,0 +1,26 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +apiVersion: v1 +kind: Service +metadata: + name: {{ .Release.Name }}-svc +spec: + selector: + app: {{ .Release.Name }}-serving + ports: + - name: http + port: {{ .Values.service.ports.http }} + targetPort: {{ .Values.service.ports.http }} + type: {{ .Values.service.type }} diff --git a/src/launchers/trtllm-launcher.sh b/src/launchers/trtllm-launcher.sh old mode 100644 new mode 100755 index dc3f828b..7659dddb --- a/src/launchers/trtllm-launcher.sh +++ b/src/launchers/trtllm-launcher.sh @@ -1,4 +1,4 @@ -# Copyright 2025 Google LLC +# Copyright 2026 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -124,6 +124,7 @@ parse_serving_config() { streaming=${SERVING_CONFIG_DICT["streaming"]:="false"} max_input_len=${SERVING_CONFIG_DICT["max_input_len"]:=""} max_batch_size=${SERVING_CONFIG_DICT["max_batch_size"]:=""} + max_num_tokens=${SERVING_CONFIG_DICT["max_num_tokens"]:=""} custom_dataset=${SERVING_CONFIG_DICT["dataset"]:=""} } @@ -145,6 +146,7 @@ print_configuration() { echo "streaming: $streaming" echo "max input length: $max_input_len" echo "max batch size: $max_batch_size" + echo "max num tokens: $max_num_tokens" echo "kv_cache_free_gpu_mem_fraction: $kv_cache_free_gpu_mem_fraction" echo "--------------------------------" } @@ -174,6 +176,7 @@ run_benchmark() { if [ "$streaming" == "true" ]; then vl_args="$vl_args --streaming"; fi if [ -n "$max_input_len" ]; then vl_args="$vl_args --max_input_len $max_input_len"; fi if [ -n "$max_batch_size" ]; then vl_args="$vl_args --max_batch_size $max_batch_size"; fi + if [ -n "$max_num_tokens" ]; then vl_args="$vl_args --max_num_tokens $max_num_tokens"; fi dataset_file=$custom_dataset # If custom_dataset is not set, generate a textual dataset with tokens sampled in normal distribution @@ -198,14 +201,13 @@ run_benchmark() { fi export TOKENIZERS_PARALLELISM=false - echo "enable_cuda_graph: false" > /tmp/extra_llm_api_args.yaml if [[ $backend == "pytorch" ]]; then echo "Running throughput benchmark" export NCCL_P2P_LEVEL=PHB trtllm-bench \ --model $model_name \ - --model_path /ssd/${model_name} throughput \ + --model_path /ssd/${model_name} throughput --tp $tp_size --pp $pp_size \ --dataset $dataset_file \ --num_requests $num_requests \ --tp $tp_size \ @@ -231,10 +233,10 @@ run_benchmark() { echo "Running throughput benchmark" trtllm-bench \ --model $model_name \ - --model_path /ssd/${model_name} throughput \ + --model_path /ssd/${model_name} throughput --tp $tp_size --pp $pp_size \ --dataset $dataset_file \ --engine_dir $engine_dir \ - --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args >$output_file + --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args $vl_args >$output_file fi cat $output_file @@ -263,7 +265,8 @@ main() { # Set environment variables export HF_HOME=/ssd -export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib +export LD_LIBRARY_PATH=/usr/local/gib/lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib # Run the main function -main "$@" \ No newline at end of file +main "$@" +