diff --git a/inference/a3mega/llama3.1-70b/trtllm-gke/README.md b/inference/a3mega/llama3.1-70b/trtllm-gke/README.md
new file mode 100644
index 00000000..dbd7d53d
--- /dev/null
+++ b/inference/a3mega/llama3.1-70b/trtllm-gke/README.md
@@ -0,0 +1,386 @@
+# Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A3mega GKE Node Pool
+
+This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) framework on a single [A3-Mega GKE Node pool](https://cloud.google.com/kubernetes-engine).
+
+This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference.
+
+
+## Table of Contents
+
+* [1. Test Environment](#test-environment)
+* [2. High-Level Architecture](#architecture)
+* [3. Environment Setup (One-Time)](#environment-setup)
+ * [3.1. Clone the Repository](#clone-repo)
+ * [3.2. Configure Environment Variables](#configure-vars)
+ * [3.3. Connect to your GKE Cluster](#connect-cluster)
+ * [3.4. Get Hugging Face Token](#get-hf-token)
+ * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret)
+* [4. Run the Recipe](#run-the-recipe)
+ * [4.1. Supported Models](#supported-models)
+ * [4.2. Deploy and Benchmark a Model](#deploy-model)
+* [5. Monitoring and Troubleshooting](#monitoring)
+ * [5.1. Check Deployment Status](#check-status)
+ * [5.2. View Logs](#view-logs)
+* [6. Cleanup](#cleanup)
+
+
+## 1. Test Environment
+
+[Back to Top](#table-of-contents)
+
+The recipe uses the following setup:
+
+* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the TensorRT-LLM framework.
+
+This recipe has been optimized for and tested with the following configuration:
+
+* **GKE Cluster**:
+ * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later.
+ * A GPU node pool with 1 [a3-megagpu-8g](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) machine.
+ * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
+ * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
+ * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
+ * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
+ * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
+* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
+
+> [!IMPORTANT]
+> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-mega.md).
+> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**.
+
+
+## 2. High-Level Flow
+
+[Back to Top](#table-of-contents)
+
+Here is a simplified diagram of the flow that we follow in this recipe:
+
+```mermaid
+---
+config:
+ layout: dagre
+---
+flowchart TD
+ subgraph workstation["Client Workstation"]
+ T["Cluster Toolkit"]
+ B("Kubernetes API")
+ A["helm install"]
+ end
+ subgraph huggingface["Hugging Face Hub"]
+ I["Model Weights"]
+ end
+ subgraph gke["GKE Cluster (A3-Mega)"]
+ C["Deployment"]
+ D["Pod"]
+ E["TensorRT-LLM container"]
+ F["Service"]
+ end
+ subgraph storage["Cloud Storage"]
+ J["Bucket"]
+ end
+
+ %% Logical/actual flow
+ T -- Create Cluster --> gke
+ A --> B
+ B --> C & F
+ C --> D
+ D --> E
+ F --> C
+ E -- Downloads at runtime --> I
+ E -- Write logs --> J
+
+
+ %% Layout control
+ gke
+```
+
+* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment.
+* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running.
+* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server.
+* **Pod:** The smallest deployable unit in Kubernetes. The Triton server container with TensorRT-LLM runs inside this pod on a GPU-enabled node.
+* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts.
+
+
+## 3. Environment Setup (One-Time)
+
+[Back to Top](#table-of-contents)
+
+First, you'll configure your local environment. These steps are required once before you can deploy any models.
+
+
+### 3.1. Clone the Repository
+
+```bash
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=$(pwd)
+export RECIPE_ROOT=$REPO_ROOT/inference/a3mega/llama3.1-70b/trtllm-gke
+```
+
+
+### 3.2. Configure Environment Variables
+
+This is the most critical step. These variables are used in subsequent commands to target the correct resources.
+
+```bash
+export PROJECT_ID=
+export CLUSTER_REGION=
+export CLUSTER_NAME=
+export KUEUE_NAME=
+export GCS_BUCKET=
+export TRTLLM_VERSION=1.3.0rc3
+
+# Set the project for gcloud commands
+gcloud config set project $PROJECT_ID
+```
+
+Replace the following values:
+
+| Variable | Description | Example |
+| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` |
+| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` |
+| `CLUSTER_NAME` | The name of your GKE cluster. | `a3-mega` |
+| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a3mega`. Verify the name in your cluster. | `a3mega` |
+| `ARTIFACT_REGISTRY` | Full path to your Artifact Registry repository. | `us-central1-docker.pkg.dev/gcp-project-12345/my-repo` |
+| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` |
+| `TRTLLM_VERSION` | The tag/version for the Docker image. Other verions can be found at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release | `1.3.0rc3` |
+
+
+
+### 3.3. Connect to your GKE Cluster
+
+Fetch credentials for `kubectl` to communicate with your cluster.
+
+```bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+
+### 3.4. Get Hugging Face token
+
+To access models through Hugging Face, you'll need a Hugging Face token.
+ 1. Create a [Hugging Face account](https://huggingface.co/) if you don't have one.
+ 2. For **gated models** like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding.
+ 3. Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**.
+ 4. Select **New Token**.
+ 5. Specify a Name and a Role of at least `Read`.
+ 6. Select **Generate a token**.
+ 7. Copy the generated token to your clipboard. You'll use this later.
+
+
+
+### 3.5. Create Hugging Face Kubernetes Secret
+
+Create a Kubernetes Secret with your Hugging Face token to enable the pod to download model checkpoints from Hugging Face.
+
+```bash
+# Paste your Hugging Face token here
+export HF_TOKEN=
+
+kubectl create secret generic hf-secret \
+--from-literal=hf_api_token=${HF_TOKEN} \
+--dry-run=client -o yaml | kubectl apply -f -
+```
+
+
+## 4. Run the recipe
+
+[Back to Top](#table-of-contents)
+
+> [!NOTE]
+> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face.
+
+> [!TIP]
+> You can use the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) to quantize these models to FP8 for improved performance.
+
+
+### 4.1. Supported Models
+
+[Back to Top](#table-of-contents)
+
+This recipe supports the deployment of the following models:
+
+| Model Name | Hugging Face ID | Configuration File | Release Name Suffix |
+| :--- | :--- | :--- | :--- |
+| **Llama 3.1 70B** | `meta-llama/Llama-3.1-70B-Instruct` | `llama-3.1-70b.yaml` | `llama-3-1-70b` |
+
+
+### 4.2. Deploy and Benchmark a Model
+
+[Back to Top](#table-of-contents)
+
+The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/perf-benchmarking.md), a command-line tool from NVIDIA to benchmark the performance of TensorRT-LLM engine.
+
+1. **Configure model-specific variables.** Choose a model from the [table above](#supported-models) and set the variables:
+
+ ```bash
+ # Example for Llama 3.1 70B
+ export HF_MODEL_ID="meta-llama/Llama-3.1-70B-Instruct"
+ export CONFIG_FILE="llama-3.1-70b.yaml"
+ export RELEASE_NAME="$USER-serving-llama-3-1-70b"
+ ```
+
+2. **Install the helm chart:**
+
+ ```bash
+ cd $RECIPE_ROOT
+ helm install -f values.yaml \
+ --set-file workload_launcher=$REPO_ROOT/src/launchers/trtllm-launcher.sh \
+ --set-file serving_config=$REPO_ROOT/src/frameworks/a3mega/trtllm-configs/${CONFIG_FILE} \
+ --set queue=${KUEUE_NAME} \
+ --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
+ --set workload.model.name=${HF_MODEL_ID} \
+ --set workload.image=nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_VERSION} \
+ --set workload.framework=trtllm \
+ ${RELEASE_NAME} \
+ $REPO_ROOT/src/helm-charts/a3mega/trtllm-inference/single-node
+ ```
+
+3. **Check the deployment status:**
+
+ ```bash
+ kubectl get deployment/${RELEASE_NAME}
+ ```
+
+ Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs.
+
+
+## 5. Monitoring and Troubleshooting
+
+[Back to Top](#table-of-contents)
+
+After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `` and `` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-llama3.1-70b` and `$USER-serving-llama3.1-70b-svc`).
+
+
+### 5.1. Check Deployment Status
+
+Check the status of your deployment. Replace the name if you deployed a different model.
+
+```bash
+# Example for Llama 3.1 70B
+kubectl get deployment/$USER-serving-llama3.1-70b
+```
+
+Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up.
+
+> [!NOTE]
+> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready.
+
+
+### 5.2. View Logs
+
+To see the logs from the TRTLLM server (useful for debugging), use the `-f` flag to follow the log stream:
+
+```bash
+kubectl logs -f deployment/$USER-serving-llama3.1-70b
+```
+
+You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this:
+
+```bash
+Running benchmark for nvidia/Llama3.1-70b with ISL=128, OSL=128, TP=8, EP=1, PP=1
+
+===========================================================
+= PYTORCH BACKEND
+===========================================================
+Model: nvidia/Llama3.1-70b
+Model Path: /ssd/nvidia/Llama3.1-70b
+TensorRT LLM Version: 1.2
+Dtype: bfloat16
+KV Cache Dtype: FP8
+Quantization: FP8
+
+===========================================================
+= REQUEST DETAILS
+===========================================================
+Number of requests: 1000
+Number of concurrent requests: 985.9849
+Average Input Length (tokens): 128.0000
+Average Output Length (tokens): 128.0000
+===========================================================
+= WORLD + RUNTIME INFORMATION
+===========================================================
+TP Size: 8
+PP Size: 1
+EP Size: 1
+Max Runtime Batch Size: 2304
+Max Runtime Tokens: 4608
+Scheduling Policy: GUARANTEED_NO_EVICT
+KV Memory Percentage: 85.00%
+Issue Rate (req/sec): 8.3913E+13
+
+===========================================================
+= PERFORMANCE OVERVIEW
+===========================================================
+Request Throughput (req/sec): X.XX
+Total Output Throughput (tokens/sec): X.XX
+Total Token Throughput (tokens/sec): X.XX
+Total Latency (ms): X.XX
+Average request latency (ms): X.XX
+Per User Output Throughput [w/ ctx] (tps/user): X.XX
+Per GPU Output Throughput (tps/gpu): X.XX
+
+-- Request Latency Breakdown (ms) -----------------------
+
+[Latency] P50 : X.XX
+[Latency] P90 : X.XX
+[Latency] P95 : X.XX
+[Latency] P99 : X.XX
+[Latency] MINIMUM: X.XX
+[Latency] MAXIMUM: X.XX
+[Latency] AVERAGE: X.XX
+
+===========================================================
+= DATASET DETAILS
+===========================================================
+Dataset Path: /ssd/token-norm-dist_llama3.1-70b_128_128_tp4.json
+Number of Sequences: 1000
+
+-- Percentiles statistics ---------------------------------
+
+ Input Output Seq. Length
+-----------------------------------------------------------
+MIN: 128.0000 128.0000 256.0000
+MAX: 128.0000 128.0000 256.0000
+AVG: 128.0000 128.0000 256.0000
+P50: 128.0000 128.0000 256.0000
+P90: 128.0000 128.0000 256.0000
+P95: 128.0000 128.0000 256.0000
+P99: 128.0000 128.0000 256.0000
+===========================================================
+```
+
+
+## 6. Cleanup
+
+To avoid incurring further charges, clean up the resources you created.
+
+1. **Uninstall the Helm Release:**
+
+ First, list your releases to get the deployed models:
+
+ ```bash
+ # list deployed models
+ helm list --filter $USER-serving-
+ ```
+
+ Then, uninstall the desired release:
+
+ ```bash
+ # uninstall the deployed model
+ helm uninstall
+ ```
+ Replace `` with the helm release names listed.
+
+2. **Delete the Kubernetes Secret:**
+
+ ```bash
+ kubectl delete secret hf-secret --ignore-not-found=true
+ ```
+
+3. (Optional) Delete the built Docker image from Artifact Registry if no longer needed.
+4. (Optional) Delete Cloud Build logs.
+5. (Optional) Clean up files in your GCS bucket if benchmarking was performed.
+6. (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster.
\ No newline at end of file
diff --git a/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml b/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml
new file mode 100644
index 00000000..2649ebdd
--- /dev/null
+++ b/inference/a3mega/llama3.1-70b/trtllm-gke/values.yaml
@@ -0,0 +1,64 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+queue:
+
+dwsSettings:
+ maxRunDurationSeconds:
+
+huggingface:
+ secretName: hf-secret
+ secretData:
+ token: "hf_api_token"
+
+volumes:
+ gcsVolumes: true
+ ssdMountPath: "/ssd"
+ gcsMounts:
+ - bucketName:
+ mountPath: "/gcs"
+
+service:
+ type: ClusterIP
+ ports:
+ http: 8000
+
+workload:
+ model:
+ name:
+ gpus: 8
+ image:
+ framework:
+ configFile: serving-args.yaml
+ configPath: /workload/configs
+ envs:
+ - name: LAUNCHER_SCRIPT
+ value: "/workload/launcher/launch-workload.sh"
+ - name: HF_HUB_ENABLE_HF_TRANSFER
+ value: "0"
+ - name: SERVER_ARGS_FILE
+ value: "/workload/configs/serving-args.yaml"
+ benchmarks:
+ experiments:
+ - isl: 128
+ osl: 128
+ num_requests: 30000
+
+network:
+ hostNetwork: true
+ subnetworks[]:
+ gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1
+ ncclSettings:
+ - name: NCCL_DEBUG
+ value: "VERSION"
diff --git a/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml b/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml
new file mode 100644
index 00000000..8cfa6a4e
--- /dev/null
+++ b/src/frameworks/a3mega/trtllm-configs/llama3.1-70b.yaml
@@ -0,0 +1,4 @@
+tp_size: 8
+pp_size: 1
+backend: pytorch
+kv_cache_free_gpu_mem_fraction: 0.80
diff --git a/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml b/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml
new file mode 100644
index 00000000..4f584cc4
--- /dev/null
+++ b/src/helm-charts/a3mega/inference-templates/deployment/Chart.yaml
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: single-host-serving-deployment-template
+description: single-host-serving-deployment-template
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml
new file mode 100644
index 00000000..a17bdf49
--- /dev/null
+++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-config-configmap.yaml
@@ -0,0 +1,25 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: "{{ .Release.Name }}-config"
+data:
+ serving-configuration: |-
+{{- if .Values.serving_config }}
+{{ .Values.serving_config | nindent 4 }}
+{{- else }}
+{{ "config: null" | nindent 4 }}
+{{- end }}
\ No newline at end of file
diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml
new file mode 100644
index 00000000..b111553b
--- /dev/null
+++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher-configmap.yaml
@@ -0,0 +1,27 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: "{{ .Release.Name }}-launcher"
+data:
+ launch-workload.sh: |-
+{{- if .Values.workload_launcher }}
+{{ .Values.workload_launcher | nindent 4 }}
+{{- else }}
+ #!/bin/bash
+ echo "No workload launcher specified"
+ exit 1
+{{- end }}
\ No newline at end of file
diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml
new file mode 100644
index 00000000..23b7ddfd
--- /dev/null
+++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-launcher.yaml
@@ -0,0 +1,265 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+{{ $nodes := div .Values.workload.gpus 8 | max 1 }}
+{{ $gpusPerNode := min .Values.workload.gpus 8 }}
+
+{{ $root := . }}
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: "{{ .Release.Name }}"
+ namespace: default
+ labels:
+ app: {{ .Release.Name }}-serving
+ {{- if $root.Values.queue }}
+ kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}"
+ {{- end }}
+spec:
+ replicas: {{ $nodes }}
+ selector:
+ matchLabels:
+ app: {{ .Release.Name }}-serving
+ template:
+ metadata:
+ labels:
+ app: {{ .Release.Name }}-serving
+ annotations:
+ kubectl.kubernetes.io/default-container: serving
+ {{- if $root.Values.volumes.gcsVolumes }}
+ gke-gcsfuse/volumes: "true"
+ gke-gcsfuse/cpu-limit: "0"
+ gke-gcsfuse/memory-limit: "0"
+ gke-gcsfuse/ephemeral-storage-limit: "0"
+ {{- end }}
+ {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }}
+ provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}"
+ {{- end }}
+ {{- if not $root.Values.network.hostNetwork }}
+ networking.gke.io/default-interface: "eth0"
+ networking.gke.io/interfaces: |
+ {{- if $root.Values.network.subnetworks }}
+ [
+ {{- range $i, $subnetwork := $root.Values.network.subnetworks }}
+ {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}}
+ {{- end }}
+ ]
+ {{- else }}
+ [
+ {"interfaceName":"eth0","network":"default"}
+ ]
+ {{- end }}
+ {{- end }}
+ spec:
+ {{- if $root.Values.network.hostNetwork }}
+ hostNetwork: true
+ dnsPolicy: ClusterFirstWithHostNet
+ {{- end }}
+ subdomain: "{{.Release.Name}}"
+ restartPolicy: Always
+ {{- if $root.Values.targetNodes }}
+ affinity:
+ nodeAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: kubernetes.io/hostname
+ operator: In
+ values:
+ {{ range $hostname := $root.Values.targetNodes }}
+ - {{ $hostname }}
+ {{ end }}
+ {{- end }}
+ tolerations:
+ - operator: "Exists"
+ key: nvidia.com/gpu
+ - operator: "Exists"
+ key: cloud.google.com/impending-node-termination
+ volumes:
+ {{- if $root.Values.network.gibVersion }}
+ - name: gib
+ emptyDir: {}
+ {{- end }}
+ - name: serving-configuration
+ configMap:
+ name: "{{.Release.Name}}-config"
+ items:
+ - key: serving-configuration
+ path: {{ $root.Values.workload.configFile | default "serving-args" }}
+ - name: serving-launcher
+ configMap:
+ name: "{{.Release.Name}}-launcher"
+ defaultMode: 0700
+ - name: shared-memory
+ emptyDir:
+ medium: "Memory"
+ sizeLimit: 250Gi
+ {{- range $gcs := $root.Values.volumes.gcsMounts }}
+ - name: "{{ $gcs.bucketName }}"
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: "{{ $gcs.bucketName }}"
+ {{- if $gcs.mountOptions }}
+ mountOptions: "{{ $gcs.mountOptions }}"
+ {{- end }}
+ {{- end }}
+ {{- if $root.Values.volumes.ssdMountPath }}
+ - name: local-ssd
+ hostPath:
+ path: /mnt/stateful_partition/kube-ephemeral-ssd
+ {{- end }}
+
+ initContainers:
+ {{- if $root.Values.network.gibVersion }}
+ - name: nccl-plugin-installer
+ image: {{ $root.Values.network.gibVersion }}
+ imagePullPolicy: Always
+ args:
+ - |
+ set -ex
+ /scripts/container_entry.sh install --install-nccl
+ cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64
+ cp -R /var/lib/gib/. /target/usr/local/gib
+ command:
+ - /bin/sh
+ - -c
+ volumeMounts:
+ - mountPath: /target/usr/local/gib
+ name: gib
+ {{- end }}
+
+ containers:
+ {{- if $root.Values.workload.gcsSidecarImage }}
+ - name: gke-gcsfuse-sidecar
+ image: {{ $root.Values.workload.gcsSidecarImage }}
+ - name: gke-gcsfuse-metadata-prefetch
+ image: {{ $root.Values.workload.gcsSidecarImage }}
+ {{- end }}
+ - name: serving
+ image: "{{ $root.Values.workload.image }}"
+ imagePullPolicy: Always
+ {{- if $root.Values.network.hostNetwork }}
+ securityContext:
+ privileged: true
+ {{- end }}
+ env:
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: "{{ $root.Values.huggingface.secretName }}"
+ key: "{{ $root.Values.huggingface.secretData.token }}"
+ # Pass NCCL settings to the container
+ {{- if $root.Values.network.ncclSettings }}
+ {{- toYaml .Values.network.ncclSettings | nindent 12 }}
+ {{- end }}
+ - name: NCCL_PLUGIN_PATH
+ value: /usr/local/gib/lib64
+ - name: LD_LIBRARY_PATH
+ value: /usr/local/gib/lib64:/usr/local/nvidia/lib64
+ {{- if $root.Values.network.gibVersion }}
+ - name: NCCL_INIT_SCRIPT
+ value: "/usr/local/gib/scripts/set_nccl_env.sh"
+ {{- end }}
+ # Workload specific environment variables
+ - name: MODEL_NAME
+ value: "{{ $root.Values.workload.model.name }}"
+ - name: MODEL_DOWNLOAD_DIR
+ value: "/ssd/{{ $root.Values.workload.model.name }}"
+ # A3-Ultra recipe is based on the TensorRT image, which puts tensorrt_llm in a different path than default
+ - name: TRTLLM_DIR
+ value: "/app/tensorrt_llm"
+ {{- if $root.Values.workload.envs }}
+ {{- toYaml .Values.workload.envs | nindent 12 }}
+ {{- end }}
+
+ workingDir: /workload
+ command: ["/bin/bash", "-c"]
+ args:
+ - |
+ #!/bin/bash
+
+ if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then
+ echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}"
+ source ${NCCL_INIT_SCRIPT}
+ env | grep NCCL
+ fi
+
+ if [ ! -f "$LAUNCHER_SCRIPT" ]; then
+ echo "Error: Launcher script $LAUNCHER_SCRIPT not found!"
+ exit 1
+ fi
+
+ ARGS=()
+
+ if [ -f "$SERVER_ARGS_FILE" ]; then
+ echo "Loading server arguments from ConfigMap"
+ while IFS=': ' read -r key value || [ -n "$key" ]; do
+ [[ -z "$key" || "$key" == \#* ]] && continue
+ key=$(echo "$key" | xargs)
+ value=$(echo "$value" | xargs)
+
+ if [ -n "$key" ]; then
+ # Handle boolean values
+ if [[ "$value" == "true" ]]; then
+ # For true values, just add the flag without a value
+ ARGS+=("--$key")
+ elif [[ "$value" == "false" ]]; then
+ ARGS+=("--$key" "false")
+ elif [ -n "$value" ]; then
+ # For non-boolean values, add both the flag and its value
+ ARGS+=("--$key" "$value")
+ else
+ ARGS+=("--$key")
+ fi
+ fi
+ done < "$SERVER_ARGS_FILE"
+ fi
+
+ {{ if eq $root.Values.workload.framework "trtllm" }}
+ {{- range $root.Values.workload.benchmarks.experiments }}
+ echo "Running: $LAUNCHER_SCRIPT --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- ${ARGS[@]} --max_batch_size 4096 --max_num_tokens 8192"
+ exec "$LAUNCHER_SCRIPT" --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- "${ARGS[@]}" --max_batch_size 4096 --max_num_tokens 8192
+ {{- end }}
+ {{ else }}
+ echo "Running: $LAUNCHER_SCRIPT ${ARGS[@]}"
+ exec "$LAUNCHER_SCRIPT" "${ARGS[@]}"
+ {{- end }}
+
+ volumeMounts:
+ {{- if $root.Values.network.gibVersion }}
+ - name: gib
+ mountPath: /usr/local/gib
+ {{- end }}
+ - name: serving-configuration
+ mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }}
+ - name: serving-launcher
+ mountPath: /workload/launcher
+ - name: shared-memory
+ mountPath: /dev/shm
+ {{- range $gcs := $root.Values.volumes.gcsMounts }}
+ - name: "{{ $gcs.bucketName }}"
+ mountPath: "{{ $gcs.mountPath }}"
+ {{- end }}
+ {{- if $root.Values.volumes.ssdMountPath }}
+ - name: local-ssd
+ mountPath: "{{ $root.Values.volumes.ssdMountPath }}"
+ {{- end }}
+
+ resources:
+ requests:
+ nvidia.com/gpu: {{ $gpusPerNode }}
+ limits:
+ nvidia.com/gpu: {{ $gpusPerNode }}
\ No newline at end of file
diff --git a/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml
new file mode 100644
index 00000000..3d1363b9
--- /dev/null
+++ b/src/helm-charts/a3mega/inference-templates/deployment/templates/serving-svc.yaml
@@ -0,0 +1,26 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: Service
+metadata:
+ name: {{ .Release.Name }}-svc
+spec:
+ selector:
+ app: {{ .Release.Name }}-serving
+ ports:
+ - name: http
+ port: {{ .Values.service.ports.http }}
+ targetPort: {{ .Values.service.ports.http }}
+ type: {{ .Values.service.type }}
\ No newline at end of file
diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml
new file mode 100644
index 00000000..758baf9c
--- /dev/null
+++ b/src/helm-charts/a3mega/trtllm-inference/single-node/Chart.yaml
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: trtllm-llama-3-1-405b-inference
+description: trtllm-llama-3-1-405b-inference
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml
new file mode 100644
index 00000000..d8c0759f
--- /dev/null
+++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-config-configmap.yaml
@@ -0,0 +1,7 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: "{{ .Release.Name }}-config"
+data:
+ serving-configuration: |-
+{{ .Values.serving_config | indent 4 }}
diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml
new file mode 100644
index 00000000..9415d5be
--- /dev/null
+++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher-configmap.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+ name: "{{ .Release.Name }}-launcher"
+data:
+ launch-workload.sh: |-
+{{- if .Values.workload_launcher }}
+{{ .Values.workload_launcher | nindent 4 }}
+{{- else }}
+ #!/bin/bash
+ echo "No workload launcher specified"
+ exit 1
+{{- end }}
diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml
new file mode 100644
index 00000000..f3055ba5
--- /dev/null
+++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-launcher.yaml
@@ -0,0 +1,280 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+{{ $nodes := div .Values.workload.gpus 8 | max 1 }}
+{{ $gpusPerNode := min .Values.workload.gpus 8 }}
+
+{{ $root := . }}
+
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+ name: "{{ .Release.Name }}"
+ namespace: default
+ labels:
+ app: {{ .Release.Name }}-serving
+ {{- if $root.Values.queue }}
+ kueue.x-k8s.io/queue-name: "{{ $root.Values.queue }}"
+ {{- end }}
+spec:
+ replicas: {{ $nodes }}
+ selector:
+ matchLabels:
+ app: {{ .Release.Name }}-serving
+ template:
+ metadata:
+ labels:
+ app: {{ .Release.Name }}-serving
+ annotations:
+ kubectl.kubernetes.io/default-container: serving
+ {{- if $root.Values.volumes.gcsVolumes }}
+ gke-gcsfuse/volumes: "true"
+ gke-gcsfuse/cpu-limit: "0"
+ gke-gcsfuse/memory-limit: "0"
+ gke-gcsfuse/ephemeral-storage-limit: "0"
+ {{- end }}
+ {{- if and $root.Values.queue $root.Values.dwsSettings.maxRunDurationSeconds }}
+ provreq.kueue.x-k8s.io/maxRunDurationSeconds: "{{ $root.Values.dwsSettings.maxRunDurationSeconds }}"
+ {{- end }}
+ {{- if not $root.Values.network.hostNetwork }}
+ networking.gke.io/default-interface: "eth0"
+ networking.gke.io/interfaces: |
+ {{- if $root.Values.network.subnetworks }}
+ [
+ {{- range $i, $subnetwork := $root.Values.network.subnetworks }}
+ {"interfaceName":"eth{{ $i }}","network":"{{ $subnetwork }}"}{{ eq $i 9 | ternary "" ","}}
+ {{- end }}
+ ]
+ {{- else }}
+ [
+ {"interfaceName":"eth0","network":"default"},
+ {"interfaceName":"eth1","network":"gvnic-1"},
+ {{- range $i := until 8 }}
+ {"interfaceName":"eth{{ add 2 $i }}","network":"rdma-{{ $i }}"}{{ eq $i 7 | ternary "" ","}}
+ {{- end }}
+ ]
+ {{- end }}
+ {{- end }}
+ spec:
+ {{- if $root.Values.network.hostNetwork }}
+ hostNetwork: true
+ dnsPolicy: ClusterFirstWithHostNet
+ {{- end }}
+ subdomain: "{{.Release.Name}}"
+ restartPolicy: Always
+ {{- if $root.Values.targetNodes }}
+ affinity:
+ nodeAffinity:
+ requiredDuringSchedulingIgnoredDuringExecution:
+ nodeSelectorTerms:
+ - matchExpressions:
+ - key: kubernetes.io/hostname
+ operator: In
+ values:
+ {{ range $hostname := $root.Values.targetNodes }}
+ - {{ $hostname }}
+ {{ end }}
+ {{- end }}
+ tolerations:
+ - operator: "Exists"
+ key: nvidia.com/gpu
+ - operator: "Exists"
+ key: cloud.google.com/impending-node-termination
+ volumes:
+ {{- if $root.Values.network.gibVersion }}
+ - name: gib
+ emptyDir: {}
+ {{- end }}
+ - name: serving-configuration
+ configMap:
+ name: "{{.Release.Name}}-config"
+ items:
+ - key: serving-configuration
+ path: {{ $root.Values.workload.configFile | default "serving-args.yaml" }}
+ - name: serving-launcher
+ configMap:
+ name: "{{.Release.Name}}-launcher"
+ defaultMode: 0700
+ - name: shared-memory
+ emptyDir:
+ medium: "Memory"
+ sizeLimit: 250Gi
+ {{- range $gcs := $root.Values.volumes.gcsMounts }}
+ - name: "{{ $gcs.bucketName }}"
+ csi:
+ driver: gcsfuse.csi.storage.gke.io
+ volumeAttributes:
+ bucketName: "{{ $gcs.bucketName }}"
+ {{- if $gcs.mountOptions }}
+ mountOptions: "{{ $gcs.mountOptions }}"
+ {{- end }}
+ {{- end }}
+ {{- if $root.Values.volumes.ssdMountPath }}
+ - name: local-ssd
+ hostPath:
+ path: /mnt/stateful_partition/kube-ephemeral-ssd
+ {{- end }}
+
+ initContainers:
+ {{- if $root.Values.network.gibVersion }}
+ - name: nccl-plugin-installer
+ image: {{ $root.Values.network.gibVersion }}
+ imagePullPolicy: Always
+ args:
+ - |
+ set -ex
+ /scripts/container_entry.sh install --install-nccl
+ cp -R /var/lib/gib/lib64/. /target/usr/local/gib/lib64
+ cp -R /var/lib/gib/. /target/usr/local/gib
+ command:
+ - /bin/sh
+ - -c
+ volumeMounts:
+ - mountPath: /target/usr/local/gib
+ name: gib
+ {{- end }}
+
+ containers:
+ {{- if $root.Values.workload.gcsSidecarImage }}
+ - name: gke-gcsfuse-sidecar
+ image: {{ $root.Values.workload.gcsSidecarImage }}
+ - name: gke-gcsfuse-metadata-prefetch
+ image: {{ $root.Values.workload.gcsSidecarImage }}
+ {{- end }}
+ - name: serving
+ image: "{{ $root.Values.workload.image }}"
+ imagePullPolicy: Always
+ {{- if $root.Values.network.hostNetwork }}
+ securityContext:
+ privileged: true
+ {{- end }}
+ env:
+ - name: HF_TOKEN
+ valueFrom:
+ secretKeyRef:
+ name: "{{ $root.Values.huggingface.secretName }}"
+ key: "{{ $root.Values.huggingface.secretData.token }}"
+ {{- if $root.Values.network.ncclSettings }}
+ {{- toYaml .Values.network.ncclSettings | nindent 12 }}
+ {{- end }}
+ - name: NCCL_PLUGIN_PATH
+ value: /usr/local/gib/lib64
+ - name: LD_LIBRARY_PATH
+ value: /usr/local/gib/lib64:/usr/local/nvidia/lib64
+ {{- if $root.Values.network.gibVersion }}
+ - name: NCCL_INIT_SCRIPT
+ value: "/usr/local/gib/scripts/set_nccl_env.sh"
+ {{- end }}
+ - name: MODEL_NAME
+ value: "{{ $root.Values.workload.model.name }}"
+ - name: MODEL_DOWNLOAD_DIR
+ value: "/ssd/{{ $root.Values.workload.model.name }}"
+ - name: TRTLLM_DIR
+ value: "/app/tensorrt_llm"
+ - name: LAUNCHER_SCRIPT
+ value: "/workload/launcher/launch-workload.sh"
+ - name: SERVER_ARGS_FILE
+ value: "/workload/configs/serving-args.yaml"
+ - name: NCCL_PROFILER_PLUGIN
+ value: "none"
+ - name: NCCL_TUNER_PLUGIN
+ value: "none"
+ - name: TRTLLM_DISABLE_AUTOTUNER
+ value: "1"
+ - name: TRTLLM_CUSTOM_ALLREDUCE
+ value: "0"
+
+
+ {{- if $root.Values.workload.envs }}
+ {{- toYaml .Values.workload.envs | nindent 12 }}
+ {{- end }}
+
+ workingDir: /workload
+ command: ["/bin/bash", "-c"]
+ args:
+ - |
+ #!/bin/bash
+
+ if [[ -n "${NCCL_INIT_SCRIPT}" ]]; then
+ echo "Running NCCL init script: ${NCCL_INIT_SCRIPT}"
+ source ${NCCL_INIT_SCRIPT}
+ env | grep NCCL
+ ldconfig
+ fi
+
+ if [ ! -f "$LAUNCHER_SCRIPT" ]; then
+ echo "Error: Launcher script $LAUNCHER_SCRIPT not found!"
+ exit 1
+ fi
+
+ ARGS=()
+
+ if [ -f "$SERVER_ARGS_FILE" ]; then
+ echo "Loading server arguments from ConfigMap"
+ while IFS=': ' read -r key value || [ -n "$key" ]; do
+ [[ -z "$key" || "$key" == \#* ]] && continue
+ key=$(echo "$key" | xargs)
+ value=$(echo "$value" | xargs)
+
+ if [ -n "$key" ]; then
+ if [[ "$value" == "true" ]]; then
+ ARGS+=("--$key")
+ elif [[ "$value" == "false" ]]; then
+ ARGS+=("--$key" "false")
+ elif [ -n "$value" ]; then
+ ARGS+=("--$key" "$value")
+ else
+ ARGS+=("--$key")
+ fi
+ fi
+ done < "$SERVER_ARGS_FILE"
+ fi
+
+ {{ if eq $root.Values.workload.framework "trtllm" }}
+ {{- range $root.Values.workload.benchmarks.experiments }}
+ echo "Running: $LAUNCHER_SCRIPT --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- ${ARGS[@]}"
+ exec "$LAUNCHER_SCRIPT" --model_name $MODEL_NAME --isl {{ .isl }} --osl {{ .osl }} --num_requests {{ .num_requests }} -- "${ARGS[@]}"
+ {{- end }}
+ {{ else }}
+ echo "Running: $LAUNCHER_SCRIPT ${ARGS[@]}"
+ exec "$LAUNCHER_SCRIPT" "${ARGS[@]}"
+ {{- end }}
+
+ volumeMounts:
+ {{- if $root.Values.network.gibVersion }}
+ - name: gib
+ mountPath: /usr/local/gib
+ {{- end }}
+ - name: serving-configuration
+ mountPath: {{ $root.Values.workload.configPath | default "/workload/configs" }}
+ - name: serving-launcher
+ mountPath: /workload/launcher
+ - name: shared-memory
+ mountPath: /dev/shm
+ {{- range $gcs := $root.Values.volumes.gcsMounts }}
+ - name: "{{ $gcs.bucketName }}"
+ mountPath: "{{ $gcs.mountPath }}"
+ {{- end }}
+ {{- if $root.Values.volumes.ssdMountPath }}
+ - name: local-ssd
+ mountPath: "{{ $root.Values.volumes.ssdMountPath }}"
+ {{- end }}
+
+ resources:
+ requests:
+ nvidia.com/gpu: {{ $gpusPerNode }}
+ memory: "500Gi"
+ limits:
+ nvidia.com/gpu: {{ $gpusPerNode }}
+ memory: "500Gi"
diff --git a/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml
new file mode 100644
index 00000000..c1eeb742
--- /dev/null
+++ b/src/helm-charts/a3mega/trtllm-inference/single-node/templates/serving-svc.yaml
@@ -0,0 +1,26 @@
+# Copyright 2026 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: Service
+metadata:
+ name: {{ .Release.Name }}-svc
+spec:
+ selector:
+ app: {{ .Release.Name }}-serving
+ ports:
+ - name: http
+ port: {{ .Values.service.ports.http }}
+ targetPort: {{ .Values.service.ports.http }}
+ type: {{ .Values.service.type }}
diff --git a/src/launchers/trtllm-launcher.sh b/src/launchers/trtllm-launcher.sh
old mode 100644
new mode 100755
index dc3f828b..7659dddb
--- a/src/launchers/trtllm-launcher.sh
+++ b/src/launchers/trtllm-launcher.sh
@@ -1,4 +1,4 @@
-# Copyright 2025 Google LLC
+# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -124,6 +124,7 @@ parse_serving_config() {
streaming=${SERVING_CONFIG_DICT["streaming"]:="false"}
max_input_len=${SERVING_CONFIG_DICT["max_input_len"]:=""}
max_batch_size=${SERVING_CONFIG_DICT["max_batch_size"]:=""}
+ max_num_tokens=${SERVING_CONFIG_DICT["max_num_tokens"]:=""}
custom_dataset=${SERVING_CONFIG_DICT["dataset"]:=""}
}
@@ -145,6 +146,7 @@ print_configuration() {
echo "streaming: $streaming"
echo "max input length: $max_input_len"
echo "max batch size: $max_batch_size"
+ echo "max num tokens: $max_num_tokens"
echo "kv_cache_free_gpu_mem_fraction: $kv_cache_free_gpu_mem_fraction"
echo "--------------------------------"
}
@@ -174,6 +176,7 @@ run_benchmark() {
if [ "$streaming" == "true" ]; then vl_args="$vl_args --streaming"; fi
if [ -n "$max_input_len" ]; then vl_args="$vl_args --max_input_len $max_input_len"; fi
if [ -n "$max_batch_size" ]; then vl_args="$vl_args --max_batch_size $max_batch_size"; fi
+ if [ -n "$max_num_tokens" ]; then vl_args="$vl_args --max_num_tokens $max_num_tokens"; fi
dataset_file=$custom_dataset
# If custom_dataset is not set, generate a textual dataset with tokens sampled in normal distribution
@@ -198,14 +201,13 @@ run_benchmark() {
fi
export TOKENIZERS_PARALLELISM=false
- echo "enable_cuda_graph: false" > /tmp/extra_llm_api_args.yaml
if [[ $backend == "pytorch" ]]; then
echo "Running throughput benchmark"
export NCCL_P2P_LEVEL=PHB
trtllm-bench \
--model $model_name \
- --model_path /ssd/${model_name} throughput \
+ --model_path /ssd/${model_name} throughput --tp $tp_size --pp $pp_size \
--dataset $dataset_file \
--num_requests $num_requests \
--tp $tp_size \
@@ -231,10 +233,10 @@ run_benchmark() {
echo "Running throughput benchmark"
trtllm-bench \
--model $model_name \
- --model_path /ssd/${model_name} throughput \
+ --model_path /ssd/${model_name} throughput --tp $tp_size --pp $pp_size \
--dataset $dataset_file \
--engine_dir $engine_dir \
- --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args >$output_file
+ --kv_cache_free_gpu_mem_fraction $kv_cache_free_gpu_mem_fraction $extra_args $vl_args >$output_file
fi
cat $output_file
@@ -263,7 +265,8 @@ main() {
# Set environment variables
export HF_HOME=/ssd
-export LD_LIBRARY_PATH=/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib
+export LD_LIBRARY_PATH=/usr/local/gib/lib64:/usr/local/lib/python3.12/dist-packages/torch/lib:/usr/local/lib/python3.12/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/tensorrt/lib
# Run the main function
-main "$@"
\ No newline at end of file
+main "$@"
+