AI-Hypercomputer · Priya-Quad · May 5, 2026 · May 6, 2026 · May 6, 2026 · May 7, 2026
diff --git a/inference/a3mega/llama3.1-70b/trtllm-gke/README.md b/inference/a3mega/llama3.1-70b/trtllm-gke/README.md
@@ -0,0 +1,386 @@
+# Single Host Model Serving with NVIDIA TensorRT-LLM (TRT-LLM) on A3mega GKE Node Pool
+
+This document outlines the steps to serve and benchmark various Large Language Models (LLMs) using the [NVIDIA TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) framework on a single [A3-Mega GKE Node pool](https://cloud.google.com/kubernetes-engine).
+
+This guide walks you through setting up the necessary cloud infrastructure, configuring your environment, and deploying a high-performance LLM for inference.
+
+<a name="table-of-contents"></a>
+## Table of Contents
+
+* [1. Test Environment](#test-environment)
+* [2. High-Level Architecture](#architecture)
+* [3. Environment Setup (One-Time)](#environment-setup)
+  * [3.1. Clone the Repository](#clone-repo)
+  * [3.2. Configure Environment Variables](#configure-vars)
+  * [3.3. Connect to your GKE Cluster](#connect-cluster)
+  * [3.4. Get Hugging Face Token](#get-hf-token)
+  * [3.5. Create Hugging Face Kubernetes Secret](#setup-hf-secret)
+* [4. Run the Recipe](#run-the-recipe)
+  * [4.1. Supported Models](#supported-models)
+  * [4.2. Deploy and Benchmark a Model](#deploy-model)
+* [5. Monitoring and Troubleshooting](#monitoring)
+  * [5.1. Check Deployment Status](#check-status)
+  * [5.2. View Logs](#view-logs)
+* [6. Cleanup](#cleanup)
+
+<a name="test-environment"></a>
+## 1. Test Environment
+
+[Back to Top](#table-of-contents)
+
+The recipe uses the following setup:
+
+* **Orchestration**: [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+* **Deployment Configuration**: A [Helm chart](https://helm.sh/) is used to configure and deploy a [Kubernetes Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). This deployment encapsulates the inference of the target LLM using the TensorRT-LLM framework.
+
+This recipe has been optimized for and tested with the following configuration:
+
+* **GKE Cluster**:
+    * A [regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: `1.33.4-gke.1036000` or later.
+    * A GPU node pool with 1 [a3-megagpu-8g](https://docs.cloud.google.com/compute/docs/accelerator-optimized-machines#a3-mega-vms) machine.
+    * [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
+    * [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
+    * [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
+    * [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
+    * Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
+* A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
+
+> [!IMPORTANT]
+> To prepare the required environment, see the [GKE environment setup guide](../../../../docs/configuring-environment-gke-a3-mega.md).
+> Provisioning a new GKE cluster is a long-running operation and can take **20-30 minutes**.
+
+<a name="architecture"></a>
+## 2. High-Level Flow
+
+[Back to Top](#table-of-contents)
+
+Here is a simplified diagram of the flow that we follow in this recipe:
+
+```mermaid
+---
+config:
+  layout: dagre
+---
+flowchart TD
+ subgraph workstation["Client Workstation"]
+    T["Cluster Toolkit"]
+    B("Kubernetes API")
+    A["helm install"]
+  end
+ subgraph huggingface["Hugging Face Hub"]
+    I["Model Weights"]
+  end
+ subgraph gke["GKE Cluster (A3-Mega)"]
+    C["Deployment"]
+    D["Pod"]
+    E["TensorRT-LLM container"]
+    F["Service"]
+  end
+ subgraph storage["Cloud Storage"]
+    J["Bucket"]
+  end
+
+    %% Logical/actual flow
+    T -- Create Cluster --> gke
+    A --> B
+    B --> C & F
+    C --> D
+    D --> E
+    F --> C
+    E -- Downloads at runtime --> I
+    E -- Write logs --> J
+
+
+    %% Layout control
+    gke
+```
+
+* **helm:** A package manager for Kubernetes to define, install, and upgrade applications. It's used here to configure and deploy the Kubernetes Deployment.
+* **Deployment:** Manages the lifecycle of your model server pod, ensuring it stays running.
+* **Service:** Provides a stable network endpoint (a DNS name and IP address) to access your model server.
+* **Pod:** The smallest deployable unit in Kubernetes. The Triton server container with TensorRT-LLM runs inside this pod on a GPU-enabled node.
+* **Cloud Storage:** A Cloud Storage bucket to store benchmark logs and other artifacts.
+
+<a name="environment-setup"></a>
+## 3. Environment Setup (One-Time)
+
+[Back to Top](#table-of-contents)
+
+First, you'll configure your local environment. These steps are required once before you can deploy any models.
+
+<a name="clone-repo"></a>
+### 3.1. Clone the Repository
+
+```bash
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=$(pwd)
+export RECIPE_ROOT=$REPO_ROOT/inference/a3mega/llama3.1-70b/trtllm-gke
+```
+
+<a name="configure-vars"></a>
+### 3.2. Configure Environment Variables
+
+This is the most critical step. These variables are used in subsequent commands to target the correct resources.
+
+```bash
+export PROJECT_ID=<PROJECT_ID>
+export CLUSTER_REGION=<REGION_of_your_cluster>
+export CLUSTER_NAME=<YOUR_GKE_CLUSTER_NAME>
+export KUEUE_NAME=<YOUR_KUEUE_NAME>
+export GCS_BUCKET=<your-gcs-bucket-for-logs>
+export TRTLLM_VERSION=1.3.0rc3
+
+# Set the project for gcloud commands
+gcloud config set project $PROJECT_ID
+```
+
+Replace the following values:
+
+| Variable              | Description                                                                                             | Example                                                 |
+| --------------------- | ------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
+| `PROJECT_ID` | Your Google Cloud Project ID. | `gcp-project-12345` |
+| `CLUSTER_REGION` | The GCP region where your GKE cluster is located. | `us-central1` |
+| `CLUSTER_NAME` | The name of your GKE cluster. | `a3-mega` |
+| `KUEUE_NAME` | The name of the Kueue local queue. The default queue created by the cluster toolkit is `a3mega`. Verify the name in your cluster. | `a3mega` |
+| `ARTIFACT_REGISTRY` | Full path to your Artifact Registry repository. | `us-central1-docker.pkg.dev/gcp-project-12345/my-repo` |
+| `GCS_BUCKET` | Name of your GCS bucket (do not include `gs://`). | `my-benchmark-logs-bucket` |
+| `TRTLLM_VERSION` | The tag/version for the Docker image. Other verions can be found at https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release | `1.3.0rc3` |
+
+
+<a name="connect-cluster"></a>
+### 3.3. Connect to your GKE Cluster
+
+Fetch credentials for `kubectl` to communicate with your cluster.
+
+```bash
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+<a name="get-hf-token"></a>
+### 3.4. Get Hugging Face token
+
+To access models through Hugging Face, you'll need a Hugging Face token.
+  1.  Create a [Hugging Face account](https://huggingface.co/) if you don't have one.
+  2.  For **gated models** like Llama 4, ensure you have requested and been granted access on Hugging Face before proceeding.
+  3.  Generate an Access Token: Go to **Your Profile > Settings > Access Tokens**.
+  4.  Select **New Token**.
+  5.  Specify a Name and a Role of at least `Read`.
+  6.  Select **Generate a token**.
+  7.  Copy the generated token to your clipboard. You'll use this later.
+
+
+<a name="setup-hf-secret"></a>
+### 3.5. Create Hugging Face Kubernetes Secret
+
+Create a Kubernetes Secret with your Hugging Face token to enable the pod to download model checkpoints from Hugging Face.
+
+```bash
+# Paste your Hugging Face token here
+export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
+
+kubectl create secret generic hf-secret \
+--from-literal=hf_api_token=${HF_TOKEN} \
+--dry-run=client -o yaml | kubectl apply -f -
+```
+
+<a name="run-the-recipe"></a>
+## 4. Run the recipe
+
+[Back to Top](#table-of-contents)
+
+> [!NOTE]
+> After running the recipe with `helm install`, it can take **up to 30 minutes** for the deployment to become fully available. This is because the GKE node must first pull the Docker image and then download the model weights from Hugging Face.
+
+> [!TIP]
+> You can use the [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_ptq) to quantize these models to FP8 for improved performance.
+
+<a name="supported-models"></a>
+### 4.1. Supported Models
+
+[Back to Top](#table-of-contents)
+
+This recipe supports the deployment of the following models:
+
+| Model Name | Hugging Face ID | Configuration File | Release Name Suffix |
+| :--- | :--- | :--- | :--- |
+| **Llama 3.1 70B** | `meta-llama/Llama-3.1-70B-Instruct` | `llama-3.1-70b.yaml` | `llama-3-1-70b` |
+
+<a name="deploy-model"></a>
+### 4.2. Deploy and Benchmark a Model
+
+[Back to Top](#table-of-contents)
+
+The recipe uses [`trtllm-bench`](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/legacy/performance/perf-benchmarking.md), a command-line tool from NVIDIA to benchmark the performance of TensorRT-LLM engine.
+
+1.  **Configure model-specific variables.** Choose a model from the [table above](#supported-models) and set the variables:
+
+    ```bash
+    # Example for Llama 3.1 70B
+    export HF_MODEL_ID="meta-llama/Llama-3.1-70B-Instruct"
+    export CONFIG_FILE="llama-3.1-70b.yaml"
+    export RELEASE_NAME="$USER-serving-llama-3-1-70b"
+    ```
+
+2.  **Install the helm chart:**
+
+    ```bash
+    cd $RECIPE_ROOT
+    helm install -f values.yaml \
+    --set-file workload_launcher=$REPO_ROOT/src/launchers/trtllm-launcher.sh \
+    --set-file serving_config=$REPO_ROOT/src/frameworks/a3mega/trtllm-configs/${CONFIG_FILE} \
+    --set queue=${KUEUE_NAME} \
+    --set "volumes.gcsMounts[0].bucketName=${GCS_BUCKET}" \
+    --set workload.model.name=${HF_MODEL_ID} \
+    --set workload.image=nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_VERSION} \
+    --set workload.framework=trtllm \
+    ${RELEASE_NAME} \
+    $REPO_ROOT/src/helm-charts/a3mega/trtllm-inference/single-node
+    ```
+
+3.  **Check the deployment status:**
+
+    ```bash
+    kubectl get deployment/${RELEASE_NAME}
+    ```
+
+    Wait until the `READY` column shows `1/1`. See the [Monitoring and Troubleshooting](#monitoring) section to view the deployment logs.
+
+<a name="monitoring"></a>
+## 5. Monitoring and Troubleshooting
+
+[Back to Top](#table-of-contents)
+
+After the model is deployed via Helm as described in the sections [above](#run-the-recipe), use the following steps to monitor the deployment and interact with the model. Replace `<deployment-name>` and `<service-name>` with the appropriate names from the model-specific deployment instructions (e.g., `$USER-serving-llama3.1-70b` and `$USER-serving-llama3.1-70b-svc`).
+
+<a name="check-status"></a>
+### 5.1. Check Deployment Status
+
+Check the status of your deployment. Replace the name if you deployed a different model.
+
+```bash
+# Example for Llama 3.1 70B
+kubectl get deployment/$USER-serving-llama3.1-70b
+```
+
+Wait until the `READY` column shows `1/1`. If it shows `0/1`, the pod is still starting up.
+
+> [!NOTE]
+> In the GKE UI on Cloud Console, you might see a status of "Does not have minimum availability" during startup. This is normal and will resolve once the pod is ready.
+
+<a name="view-logs"></a>
+### 5.2. View Logs
+
+To see the logs from the TRTLLM server (useful for debugging), use the `-f` flag to follow the log stream:
+
+```bash
+kubectl logs -f deployment/$USER-serving-llama3.1-70b
+```
+
+You should see logs indicating preparing the model, and then running the throughput benchmark test, similar to this:
+
+```bash
+Running benchmark for nvidia/Llama3.1-70b with ISL=128, OSL=128, TP=8, EP=1, PP=1
+
+===========================================================
+= PYTORCH BACKEND
+===========================================================
+Model:			nvidia/Llama3.1-70b
+Model Path:		/ssd/nvidia/Llama3.1-70b
+TensorRT LLM Version:	1.2
+Dtype:			bfloat16
+KV Cache Dtype:		FP8
+Quantization:		FP8
+
+===========================================================
+= REQUEST DETAILS 
+===========================================================
+Number of requests:             1000
+Number of concurrent requests:  985.9849
+Average Input Length (tokens):  128.0000
+Average Output Length (tokens): 128.0000
+===========================================================
+= WORLD + RUNTIME INFORMATION 
+===========================================================
+TP Size:                8
+PP Size:                1
+EP Size:                1
+Max Runtime Batch Size: 2304
+Max Runtime Tokens:     4608
+Scheduling Policy:      GUARANTEED_NO_EVICT
+KV Memory Percentage:   85.00%
+Issue Rate (req/sec):   8.3913E+13
+
+===========================================================
+= PERFORMANCE OVERVIEW 
+===========================================================
+Request Throughput (req/sec):                     X.XX
+Total Output Throughput (tokens/sec):             X.XX
+Total Token Throughput (tokens/sec):              X.XX
+Total Latency (ms):                               X.XX
+Average request latency (ms):                     X.XX
+Per User Output Throughput [w/ ctx] (tps/user):   X.XX
+Per GPU Output Throughput (tps/gpu):              X.XX
+
+-- Request Latency Breakdown (ms) -----------------------
+
+[Latency] P50    : X.XX
+[Latency] P90    : X.XX
+[Latency] P95    : X.XX
+[Latency] P99    : X.XX
+[Latency] MINIMUM: X.XX
+[Latency] MAXIMUM: X.XX
+[Latency] AVERAGE: X.XX
+
+===========================================================
+= DATASET DETAILS
+===========================================================
+Dataset Path:         /ssd/token-norm-dist_llama3.1-70b_128_128_tp4.json
+Number of Sequences:  1000
+
+-- Percentiles statistics ---------------------------------
+
+        Input              Output           Seq. Length
+-----------------------------------------------------------
+MIN:   128.0000           128.0000           256.0000
+MAX:   128.0000           128.0000           256.0000
+AVG:   128.0000           128.0000           256.0000
+P50:   128.0000           128.0000           256.0000
+P90:   128.0000           128.0000           256.0000
+P95:   128.0000           128.0000           256.0000
+P99:   128.0000           128.0000           256.0000
+===========================================================
+```
+
+<a name="cleanup"></a>
+## 6. Cleanup
+
+To avoid incurring further charges, clean up the resources you created.
+
+1.  **Uninstall the Helm Release:**
+
+    First, list your releases to get the deployed models:
+
+    ```bash
+    # list deployed models
+    helm list --filter $USER-serving-
+    ```
+
+    Then, uninstall the desired release:
+
+    ```bash
+    # uninstall the deployed model
+    helm uninstall <release_name>
+    ```
+    Replace `<release_name>` with the helm release names listed.
+
+2.  **Delete the Kubernetes Secret:**
+
+    ```bash
+    kubectl delete secret hf-secret --ignore-not-found=true
+    ```
+
+3.  (Optional) Delete the built Docker image from Artifact Registry if no longer needed.
+4.  (Optional) Delete Cloud Build logs.
+5.  (Optional) Clean up files in your GCS bucket if benchmarking was performed.
+6.  (Optional) Delete the [test environment](#test-environment) provisioned including GKE cluster.