diff --git a/tutorials/README.md b/tutorials/README.md
index 8a1fd9f..dfe2c57 100644
--- a/tutorials/README.md
+++ b/tutorials/README.md
@@ -16,6 +16,7 @@ Step-by-step walkthroughs covering adapter invocation, pipeline construction, an
 | [03_03_govt_rag_pipeline_loops.ipynb](notebooks/03_03_govt_rag_pipeline_loops.ipynb) | Complex RAG pipeline with retry loops for scope and answerability | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/03_03_govt_rag_pipeline_loops.ipynb) |
 | [04_compose_granite_switch.ipynb](notebooks/04_compose_granite_switch.ipynb) | Compose a checkpoint from adapter libraries | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/04_compose_granite_switch.ipynb) |
 | [05_alora_vs_lora_race.ipynb](notebooks/05_alora_vs_lora_race.ipynb) | ALORA vs LoRA race: side-by-side throughput comparison on a multi-step RAG pipeline | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/05_alora_vs_lora_race.ipynb) |
+| [06_granite_speech_demo.ipynb](notebooks/06_granite_speech_demo.ipynb) | Real-time voice assistant: Granite Speech STT + Granite Switch LLM + Granite Libraries validation, orchestrated by Mellea over WebRTC | 10 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/06_granite_speech_demo.ipynb) |
 
 ## Guides
 
diff --git a/tutorials/notebooks/06_granite_speech_demo.ipynb b/tutorials/notebooks/06_granite_speech_demo.ipynb
new file mode 100644
index 0000000..990f816
--- /dev/null
+++ b/tutorials/notebooks/06_granite_speech_demo.ipynb
@@ -0,0 +1,330 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": "# Granite Speech Demo — full stack in Colab\n\nSpin up a real-time, validated voice assistant powered by IBM Granite 4.1 — entirely inside a Colab notebook. One cell brings up both vLLM model servers (Granite Speech 4.1 STT + Granite Switch 4.1 LLM), the Pipecat backend, and the Next.js frontend, then prints a public URL you open in your browser to start talking.\n\n**Browser mic → WebRTC → Granite Speech STT → Mellea/Granite Switch LLM → Kokoro TTS → browser speaker.**\n\nThis notebook is a runnable companion to the [granite-speech-demo](https://github.com/generative-computing/mellea-demos/tree/main/2026-granite-speech) reference implementation.\n\n## What this demo is\n\nOne WebRTC conversation in which every layer of the Granite 4.1 release does something load-bearing: **Granite Speech 4.1** transcribes the audio (with keyword biasing for terms like \"Granite\" and \"Mellea\"); **Granite Switch 4.1** answers, hot-swapping LoRA adapters from inside a single checkpoint via control tokens; the **Granite Libraries** — twelve task-specific adapters spanning Core (explainability and validation), RAG, and Guardian (safety) — score and shape each response, with this demo using `requirement_check` to validate candidates against plain-English requirements (\"no markdown\", \"natural spoken cadence\", \"relevant to IBM\", \"no code\"); **Mellea** orchestrates the turn with its Instruct-Validate-Repair pattern, generating Best-of-N candidates in parallel and only sending one that passes every check to TTS. Validation is on by default, with a UI toggle for plain streaming if you want to feel the latency difference.\n\n## Prerequisites\n\n- **GPU runtime: A100 (Colab Pro) recommended.** L4 works. T4 will OOM — both Granite models won't fit.\n- **HuggingFace read token.** Free; create one at https://huggingface.co/settings/tokens. Add it as a Colab Secret named `HF_TOKEN` (sidebar → 🔑 → New secret). Used for two things: downloading the Granite model weights, *and* minting per-session WebRTC TURN credentials so audio reaches your browser.\n- **Browser:** Chrome, Edge, or Firefox. Safari may behave oddly with WebRTC.\n\n## How long this takes\n\n- **First run on a fresh runtime: ~8–10 min** (model downloads dominate).\n- **Subsequent runs with weights cached: ~3 min.**\n\n## What to do\n\n1. Set the `HF_TOKEN` Colab Secret.\n2. Switch the runtime to a GPU (Runtime → Change runtime type → A100/L4).\n3. **Runtime → Run all.**\n4. When the last cell prints a `*.trycloudflare.com` URL, open it, allow mic access, and start talking.\n\nIf anything goes wrong, scroll to the bottom — there's a troubleshooting section and a kill-switch cell."
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cell 2 — Install dependencies (~3 min)\n",
+    "\n",
+    "Clones the repo, installs Python deps via `uv`, installs frontend deps via `npm`, and downloads the `cloudflared` binary used for the public tunnel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import subprocess, os, shutil\n\ndef sh(cmd, **kwargs):\n    print(f\"\\n$ {cmd}\")\n    subprocess.run(cmd, shell=True, check=True, **kwargs)\n\n# Re-runnable: nuke any stale clones so we don't trip on existing dirs.\nshutil.rmtree(\"mellea-demos\", ignore_errors=True)\nshutil.rmtree(\"/tmp/granite-switch\", ignore_errors=True)\n\n# Colab's default Ubuntu repo has Node 12, which is too old for Next.js\n# (chokes on optional-chaining). Install Node 20 from NodeSource instead.\nsh(\"curl -fsSL https://deb.nodesource.com/setup_20.x | bash -\")\nsh(\"apt-get -qq install -y nodejs\")\n\nsh(\"git clone https://github.com/generative-computing/mellea-demos\")\nos.chdir(\"mellea-demos/2026-granite-speech\")\nprint(\"cwd:\", os.getcwd())\n\nsh(\"pip install -q uv\")\nsh(\"uv sync\")\n\n# IMPORTANT: uv sync creates .venv, but `uv pip install` by default targets\n# the system Python. Pin every subsequent install to the project venv.\nVENV_PY = os.path.abspath(\".venv/bin/python\")\nassert os.path.exists(VENV_PY), f\"venv missing: {VENV_PY}\"\n\n# The install order below is load-bearing. Each step's pins can override\n# the previous step's resolution; the final order leaves us with:\n#   - mellea 0.5.0 (provides register_embedded_adapter_model, missing in 0.4.2)\n#   - vllm 0.19.x with audio deps (Granite Speech needs librosa + soundfile)\n#   - granite_switch model architecture registered\n#   - transformers 5.5.1 (older versions truncate the requirement_check JSON;\n#     newer versions might or might not, so pin exactly what we tested)\n\n# 1. mellea 0.5.0 (0.4.2 release lacks APIs the demo uses)\nsh(f\"uv pip install --python {VENV_PY} 'mellea[all]==0.5.0'\")\n\n# 2. vllm + the right transformers floor + granite_switch model registration.\n#    The granite-switch repo's [vllm] extra pins vllm >=0.19.1,<0.20.0 and\n#    transformers >=5.5.1 — installing plain `pip install vllm` gives 0.21.0\n#    with an older transformers, which fails to recognize the architecture.\nsh(\"git clone https://github.com/generative-computing/granite-switch /tmp/granite-switch\")\nassert os.path.exists(\"/tmp/granite-switch/pyproject.toml\"), \"granite-switch clone failed\"\nsh(f\"uv pip install --python {VENV_PY} -e '/tmp/granite-switch[vllm]'\")\n\n# 3. vllm audio deps. We install librosa + soundfile directly instead of\n#    relying on `vllm[audio]` — uv sees vllm as already satisfied from step 2\n#    and skips re-resolving the [audio] extras, leaving librosa missing.\n#    Without these, /v1/chat/completions returns 500 with\n#    'Please install vllm[audio] for audio support' on any audio input.\nsh(f\"uv pip install --python {VENV_PY} librosa soundfile\")\n\n# 4. Final transformers pin. The earlier installs can leave us on 4.57.6\n#    (GPT2 tokenizer crashes on Granite Switch) or 5.0.0 (works for chat\n#    but truncates requirement_check JSON output). 5.5.1 is what we tested.\nsh(f\"uv pip install --python {VENV_PY} 'transformers==5.5.1'\")\n\nsh(\"cd frontend && npm install --silent\")\nsh(\"wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared\")\nsh(\"chmod +x /usr/local/bin/cloudflared\")\n\n# Sanity checks — explicitly use the venv's Python so we're checking the right env.\nsh(f\"{VENV_PY} -c 'import vllm; print(\\\"vllm version:\\\", vllm.__version__)'\")\nsh(f\"{VENV_PY} -c 'import granite_switch.hf'\")\nsh(f\"{VENV_PY} -c 'import transformers; v = transformers.__version__; assert v == \\\"5.5.1\\\", \\\"got \\\" + v + \\\", wanted 5.5.1\\\"; print(\\\"transformers OK:\\\", v)'\")\nsh(f\"{VENV_PY} -c 'from mellea.backends.openai import OpenAIBackend; assert hasattr(OpenAIBackend, \\\"register_embedded_adapter_model\\\"), \\\"mellea version too old\\\"; print(\\\"mellea OK\\\")'\")\nsh(f\"{VENV_PY} -c 'import librosa, soundfile; print(\\\"vllm audio deps OK (librosa\\\", librosa.__version__, \\\"/ soundfile\\\", soundfile.__version__, \\\")\\\")'\")\n\nprint(\"\\n✅ Install complete\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cell 3 — Configure secrets (instant)\n",
+    "\n",
+    "Reads `HF_TOKEN` from Colab Secrets and exports it. Used for both HuggingFace model downloads and per-session TURN credential minting (see [TURN setup](https://turn.fastrtc.org/) — Cloudflare-backed, 10GB/mo free per HF token)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from google.colab import userdata\n",
+    "import os\n",
+    "\n",
+    "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n",
+    "print(\"✅ HF_TOKEN configured — TURN credentials will be minted per-session\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Cell 4 — Configure the assistant (optional)\n",
+    "\n",
+    "The backend reads two env vars to customize what the assistant knows and how it behaves:\n",
+    "\n",
+    "- **`PROMPT_FILE`** — path to a `.txt` file with the system prompt. Defaults to [`prompts/granite.txt`](https://github.com/generative-computing/mellea-demos/blob/main/2026-granite-speech/prompts/granite.txt), which casts the assistant as Granite, IBM's real-time speech assistant.\n",
+    "- **`DOCUMENTS_DIR`** — path to a directory of `.txt` files. Each file becomes a grounding document the LLM can cite. The repo ships with [`docs/`](https://github.com/generative-computing/mellea-demos/tree/main/2026-granite-speech/docs) (Granite model cards, Mellea overview, demo architecture).\n",
+    "\n",
+    "Paths are resolved relative to the project root (`mellea-demos/2026-granite-speech/`).\n",
+    "\n",
+    "**To use your own:** edit the cell below before running it. Drop your prompt file and/or doc directory anywhere reachable from the runtime — e.g. upload via the Colab file browser, or `!wget` from a URL — then point the env vars at them."
+   ],
+   "metadata": {}
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "source": [
+    "import os\n",
+    "\n",
+    "# Edit these to point at your own prompt or docs.\n",
+    "# Both paths are resolved relative to the project root if not absolute.\n",
+    "os.environ[\"PROMPT_FILE\"] = \"prompts/granite.txt\"\n",
+    "os.environ[\"DOCUMENTS_DIR\"] = \"docs\"\n",
+    "\n",
+    "print(f\"PROMPT_FILE   = {os.environ['PROMPT_FILE']}\")\n",
+    "print(f\"DOCUMENTS_DIR = {os.environ['DOCUMENTS_DIR']}\")"
+   ],
+   "outputs": [],
+   "execution_count": null
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cell 5 — Launch vLLM model servers (~5–8 min cold, ~30s cached)\n",
+    "\n",
+    "Two vLLM processes:\n",
+    "- **Port 8083:** [`ibm-granite/granite-speech-4.1-2b`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b) — STT.\n",
+    "- **Port 8000:** [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) — chat LLM with `requirement_check` ALoRA intrinsics.\n",
+    "\n",
+    "Both run in the background; logs stream to `logs/vllm-*.log`. The cell blocks until both servers respond on `/v1/models`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": "import os\nimport subprocess\nimport time\nimport urllib.request\nimport urllib.error\n\nos.makedirs(\"logs\", exist_ok=True)\n\nVENV_VLLM = os.path.abspath(\".venv/bin/vllm\")\nassert os.path.exists(VENV_VLLM), f\"vllm not installed in venv: {VENV_VLLM}\"\n\n# Pre-flight: kill any stale vllm processes from a prior failed run, then\n# verify the GPU has enough free memory before we try again.\nsubprocess.run(\"pkill -9 -f vllm || true\", shell=True)\ntime.sleep(3)\nfree_mem = subprocess.check_output(\n    [\"nvidia-smi\", \"--query-gpu=memory.free\", \"--format=csv,noheader,nounits\"]\n).decode().strip().splitlines()[0]\nfree_gib = int(free_mem) / 1024\nprint(f\"GPU free memory: {free_gib:.1f} GiB\")\nif free_gib < 22:\n    raise RuntimeError(\n        f\"Only {free_gib:.1f} GiB free on the GPU — need >=22. Something else is using it.\\n\"\n        \"Run `!nvidia-smi` in a new cell to see which process. Kill it with `!kill -9 <PID>`.\"\n    )\n\ndef _tail(path: str, n: int = 80) -> str:\n    try:\n        with open(path) as f:\n            return \"\".join(f.readlines()[-n:])\n    except FileNotFoundError:\n        return \"(log file missing)\"\n\ndef wait_for(url: str, name: str, proc: subprocess.Popen, log_path: str, timeout: int = 1200) -> None:\n    \"\"\"Poll until the URL returns 2xx. Bails out early if the process dies.\"\"\"\n    start = time.time()\n    last_err = None\n    while time.time() - start < timeout:\n        rc = proc.poll()\n        if rc is not None:\n            raise RuntimeError(\n                f\"{name} exited early with code {rc}. Last log lines:\\n\"\n                + \"-\" * 60 + \"\\n\" + _tail(log_path) + \"-\" * 60\n            )\n        try:\n            with urllib.request.urlopen(url, timeout=5) as r:\n                if 200 <= r.status < 300:\n                    elapsed = int(time.time() - start)\n                    print(f\"✅ {name} ready ({elapsed}s)\")\n                    return\n        except urllib.error.HTTPError as e:\n            # vllm returns 401 to unauth'd /v1/models polls when --api-key is set.\n            # The 401 proves the server is up and accepting requests, which is\n            # all we care about for readiness. Any HTTPError means the server\n            # is responding, so treat it as ready.\n            elapsed = int(time.time() - start)\n            print(f\"✅ {name} ready ({elapsed}s, status {e.code})\")\n            return\n        except (urllib.error.URLError, ConnectionError, TimeoutError) as e:\n            last_err = e\n        time.sleep(5)\n    raise TimeoutError(\n        f\"{name} did not become ready in {timeout}s. Last error: {last_err}.\\n\"\n        f\"Last log lines:\\n\" + \"-\" * 60 + \"\\n\" + _tail(log_path) + \"-\" * 60\n    )\n\n# Launch SEQUENTIALLY — wait for each to fully initialize before starting the next.\n# Parallel launch causes vllm's memory-profiling assertion to fire because\n# both processes are allocating/freeing GPU memory at the same time and each\n# sees the other's churn as 'unexpected' free-memory deltas.\nspeech_log = open(\"logs/vllm-speech.log\", \"w\")\nprint(\"⏳ Starting Granite Speech vLLM (downloads weights on first run, ~4 min)...\")\nspeech_proc = subprocess.Popen(\n    [\n        VENV_VLLM, \"serve\", \"ibm-granite/granite-speech-4.1-2b\",\n        \"--api-key\", \"token-abc123\",\n        \"--max-model-len\", \"2048\",\n        \"--gpu-memory-utilization\", \"0.4\",\n        \"--port\", \"8083\",\n    ],\n    stdout=speech_log, stderr=subprocess.STDOUT,\n)\nwait_for(\"http://127.0.0.1:8083/v1/models\", \"Granite Speech (STT)\", speech_proc, \"logs/vllm-speech.log\", timeout=1200)\n\nswitch_log = open(\"logs/vllm-switch.log\", \"w\")\nprint(\"⏳ Starting Granite Switch vLLM (downloads weights on first run, ~4 min)...\")\nswitch_proc = subprocess.Popen(\n    [\n        VENV_VLLM, \"serve\", \"ibm-granite/granite-switch-4.1-3b-preview\",\n        \"--gpu-memory-utilization\", \"0.4\",\n        # Cap context window so KV cache fits in our 0.4 GPU share. The default\n        # 131072 wants ~15 GiB of KV cache; voice turns need a tiny fraction of that.\n        \"--max-model-len\", \"8192\",\n        \"--port\", \"8000\",\n    ],\n    stdout=switch_log, stderr=subprocess.STDOUT,\n)\nwait_for(\"http://127.0.0.1:8000/v1/models\", \"Granite Switch (LLM)\", switch_proc, \"logs/vllm-switch.log\", timeout=1200)\n\nprint(\"✅ Both vLLM servers are up\")"
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cell 6 — Launch backend + frontend (~30s)\n",
+    "\n",
+    "- **Pipecat backend** on port 7860 (FastAPI + SmallWebRTC signaling).\n",
+    "- **Next.js frontend** on port 3000 (proxies WebRTC signaling to the backend in-process).\n",
+    "\n",
+    "The backend reads `HF_TOKEN` and uses it to mint a TURN relay credential per session — that's how WebRTC media reaches your browser through the cloudflared tunnel."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import subprocess\n",
+    "import time\n",
+    "import urllib.request\n",
+    "import urllib.error\n",
+    "\n",
+    "VENV_PY = os.path.abspath(\".venv/bin/python\")\n",
+    "\n",
+    "# Build the frontend in production mode. Dev mode (`npm run dev`) tries to\n",
+    "# open a webpack-hmr WebSocket back through the cloudflared tunnel, which\n",
+    "# tunnels poorly and triggers dynamic-import failures that leave the chat\n",
+    "# UI blank. Prod mode is a static-served bundle — no HMR, no SSR weirdness.\n",
+    "print(\"⏳ Building frontend (prod mode, ~30-60s)...\")\n",
+    "subprocess.run(\n",
+    "    \"cd frontend && rm -rf .next && npm run build 2>&1 | tail -10\",\n",
+    "    shell=True, check=True,\n",
+    ")\n",
+    "\n",
+    "backend_env = {**os.environ}\n",
+    "backend_env.setdefault(\"HOST\", \"127.0.0.1\")\n",
+    "backend_env.setdefault(\"PORT\", \"7860\")\n",
+    "# PROMPT_FILE and DOCUMENTS_DIR are set in the configuration cell above and\n",
+    "# inherited via os.environ.\n",
+    "\n",
+    "backend_log = open(\"logs/backend.log\", \"w\")\n",
+    "backend_proc = subprocess.Popen(\n",
+    "    [VENV_PY, \"-m\", \"granite_speech_demo.server\"],\n",
+    "    env=backend_env,\n",
+    "    stdout=backend_log, stderr=subprocess.STDOUT,\n",
+    ")\n",
+    "\n",
+    "frontend_env = {**os.environ, \"PIPECAT_BACKEND_URL\": \"http://127.0.0.1:7860\"}\n",
+    "frontend_log = open(\"logs/frontend.log\", \"w\")\n",
+    "frontend_proc = subprocess.Popen(\n",
+    "    [\"npm\", \"run\", \"start\"],\n",
+    "    cwd=\"frontend\",\n",
+    "    env=frontend_env,\n",
+    "    stdout=frontend_log, stderr=subprocess.STDOUT,\n",
+    ")\n",
+    "\n",
+    "wait_for(\"http://127.0.0.1:7860/api/ivr/config\", \"Pipecat backend\", backend_proc, \"logs/backend.log\", timeout=120)\n",
+    "wait_for(\"http://127.0.0.1:3000\", \"Next.js frontend\", frontend_proc, \"logs/frontend.log\", timeout=120)\n",
+    "print(\"✅ Backend + frontend are up\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Cell 7 — Open the public URL and talk\n",
+    "\n",
+    "Starts a Cloudflare Quick Tunnel to expose `localhost:3000` on a public `*.trycloudflare.com` URL. The tunnel handles WebRTC *signaling* (HTTP/WebSocket); the *media* path goes through the TURN relay minted by the backend, so audio works even though the Colab runtime has no public IP.\n",
+    "\n",
+    "**One tunnel is enough** — the frontend talks to the backend in-process via Next.js API routes.\n",
+    "\n",
+    "**Heads up:** the first interaction will feel slow. There's one-time setup that runs when the environment and networking first spin up (TURN credentials, WebRTC negotiation, model warmup). Subsequent turns are much faster."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re\n",
+    "import subprocess\n",
+    "import time\n",
+    "\n",
+    "tunnel_log_path = \"logs/cloudflared.log\"\n",
+    "tunnel_log = open(tunnel_log_path, \"w\")\n",
+    "tunnel_proc = subprocess.Popen(\n",
+    "    [\"cloudflared\", \"tunnel\", \"--url\", \"http://localhost:3000\", \"--no-autoupdate\"],\n",
+    "    stdout=tunnel_log, stderr=subprocess.STDOUT,\n",
+    ")\n",
+    "\n",
+    "url_re = re.compile(r\"https://[a-z0-9-]+\\.trycloudflare\\.com\")\n",
+    "public_url = None\n",
+    "deadline = time.time() + 60\n",
+    "while time.time() < deadline and public_url is None:\n",
+    "    time.sleep(2)\n",
+    "    with open(tunnel_log_path) as f:\n",
+    "        m = url_re.search(f.read())\n",
+    "    if m:\n",
+    "        public_url = m.group(0)\n",
+    "\n",
+    "if not public_url:\n",
+    "    raise RuntimeError(\"cloudflared did not print a public URL. See logs/cloudflared.log\")\n",
+    "\n",
+    "banner = \"\\n\".join([\n",
+    "    \"\",\n",
+    "    \"╔\" + \"═\" * 70 + \"╗\",\n",
+    "    \"║\" + \"  GRANITE SPEECH DEMO IS LIVE\".ljust(70) + \"║\",\n",
+    "    \"╠\" + \"═\" * 70 + \"╣\",\n",
+    "    \"║\" + f\"  {public_url}\".ljust(70) + \"║\",\n",
+    "    \"║\" + \"\".ljust(70) + \"║\",\n",
+    "    \"║\" + \"  1. Open the URL above in Chrome / Edge / Firefox\".ljust(70) + \"║\",\n",
+    "    \"║\" + \"  2. Allow microphone access when prompted\".ljust(70) + \"║\",\n",
+    "    \"║\" + \"  3. Click the mic button and start talking\".ljust(70) + \"║\",\n",
+    "    \"╚\" + \"═\" * 70 + \"╝\",\n",
+    "    \"\",\n",
+    "])\n",
+    "print(banner)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## If something goes wrong\n",
+    "\n",
+    "Each background process writes to a file in `logs/`:\n",
+    "\n",
+    "- `logs/vllm-speech.log` — Granite Speech STT server\n",
+    "- `logs/vllm-switch.log` — Granite Switch LLM server\n",
+    "- `logs/backend.log` — Pipecat backend (look here for TURN minting messages)\n",
+    "- `logs/frontend.log` — Next.js dev server\n",
+    "- `logs/cloudflared.log` — Cloudflare tunnel (the public URL is in here)\n",
+    "\n",
+    "View one with `!tail -100 logs/vllm-speech.log` (or open the file from the Colab file browser).\n",
+    "\n",
+    "**Common failures:**\n",
+    "- *T4 OOM:* switch the runtime to A100 or L4. Both Granite models won't fit on a T4.\n",
+    "- *`HF_TOKEN` missing:* re-run Cell 3 after adding the secret. Without it, the backend falls back to STUN-only and audio likely won't connect through the cloudflared tunnel.\n",
+    "- *Stuck \"waiting for vLLM\":* model weights are downloading. The cell waits up to 20 min — let it run.\n",
+    "- *Re-running cells without cleaning up:* old processes still hold the ports. Run the kill-switch cell below, then re-run from the top.\n",
+    "\n",
+    "## Caveats\n",
+    "\n",
+    "- The `*.trycloudflare.com` URL is public for as long as this notebook runs. Anyone with the link can join the session.\n",
+    "- Colab kernels die after ~24h or when idle. Restart the notebook to get a fresh URL.\n",
+    "- One Colab session serves one user. Each reader runs their own copy of this notebook.\n",
+    "\n",
+    "## Kill switch — clean up before re-running\n",
+    "\n",
+    "Run this if you need to re-run any of the launch cells. It stops the tunnel, frontend, backend, and both vLLM processes.\n",
+    "\n",
+    "**This cell is gated** so \"Run all\" won't tear down the stack you just brought up. To actually run it, uncomment the `RUN_KILL_SWITCH = True` line at the top of the cell, then run the cell."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import subprocess\n",
+    "\n",
+    "# Uncomment the next line to actually run the kill switch.\n",
+    "# This guard exists so \"Run all\" doesn't tear down the stack you just brought up.\n",
+    "# RUN_KILL_SWITCH = True\n",
+    "\n",
+    "if not globals().get(\"RUN_KILL_SWITCH\"):\n",
+    "    print(\"Kill switch is disabled. Uncomment `RUN_KILL_SWITCH = True` above and re-run this cell to stop all processes.\")\n",
+    "else:\n",
+    "    # Stop tracked Popen handles from this kernel session.\n",
+    "    for name, p in [\n",
+    "        (\"cloudflared\", globals().get(\"tunnel_proc\")),\n",
+    "        (\"frontend\", globals().get(\"frontend_proc\")),\n",
+    "        (\"backend\", globals().get(\"backend_proc\")),\n",
+    "        (\"vllm-switch\", globals().get(\"switch_proc\")),\n",
+    "        (\"vllm-speech\", globals().get(\"speech_proc\")),\n",
+    "    ]:\n",
+    "        if p is not None and p.poll() is None:\n",
+    "            p.terminate()\n",
+    "            try:\n",
+    "                p.wait(timeout=10)\n",
+    "            except Exception:\n",
+    "                p.kill()\n",
+    "            print(f\"🛑 stopped {name}\")\n",
+    "        else:\n",
+    "            print(f\"   {name}: not tracked / already dead\")\n",
+    "\n",
+    "    # Also kill by name — catches processes whose Popen handles got lost across\n",
+    "    # cell re-runs or kernel restarts. Without this, GPU memory stays held by\n",
+    "    # zombie vllm processes and the next Cell 5 run fails with OOM at startup.\n",
+    "    # We run the frontend in prod mode (`npm run start` -> `next start`), so\n",
+    "    # match `next` rather than `next dev`.\n",
+    "    for pattern in [\"vllm\", \"cloudflared tunnel\", \"granite_speech_demo.server\", \"next start\", \"node.*next\"]:\n",
+    "        subprocess.run(f\"pkill -9 -f '{pattern}' || true\", shell=True)\n",
+    "        print(f\"🧹 pkill -9 -f '{pattern}'\")\n",
+    "\n",
+    "    # Final safety net: free the ports the launch cells bind to. If a process\n",
+    "    # slipped past the name-based pkill above, this kills whatever is still\n",
+    "    # listening so re-running Cell 5 / Cell 6 doesn't fail with EADDRINUSE.\n",
+    "    #   3000 = Next.js frontend\n",
+    "    #   7860 = Pipecat backend\n",
+    "    #   8000 = Granite Switch vLLM\n",
+    "    #   8083 = Granite Speech vLLM\n",
+    "    for port in (3000, 7860, 8000, 8083):\n",
+    "        subprocess.run(f\"fuser -k {port}/tcp 2>/dev/null || true\", shell=True)\n",
+    "        print(f\"🔓 freed port {port}\")\n",
+    "\n",
+    "    print(\"\\nIf any vllm processes were running, GPU memory should now be freed.\")\n",
+    "    print(\"Run `!nvidia-smi` to confirm before re-running Cell 5.\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "gpuType": "A100",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file