diff --git a/tutorials/README.md b/tutorials/README.md index 8a1fd9f..dfe2c57 100644 --- a/tutorials/README.md +++ b/tutorials/README.md @@ -16,6 +16,7 @@ Step-by-step walkthroughs covering adapter invocation, pipeline construction, an | [03_03_govt_rag_pipeline_loops.ipynb](notebooks/03_03_govt_rag_pipeline_loops.ipynb) | Complex RAG pipeline with retry loops for scope and answerability | 30 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/03_03_govt_rag_pipeline_loops.ipynb) | | [04_compose_granite_switch.ipynb](notebooks/04_compose_granite_switch.ipynb) | Compose a checkpoint from adapter libraries | 15 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/04_compose_granite_switch.ipynb) | | [05_alora_vs_lora_race.ipynb](notebooks/05_alora_vs_lora_race.ipynb) | ALORA vs LoRA race: side-by-side throughput comparison on a multi-step RAG pipeline | 20 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/05_alora_vs_lora_race.ipynb) | +| [06_granite_speech_demo.ipynb](notebooks/06_granite_speech_demo.ipynb) | Real-time voice assistant: Granite Speech STT + Granite Switch LLM + Granite Libraries validation, orchestrated by Mellea over WebRTC | 10 min | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/generative-computing/granite-switch/blob/main/tutorials/notebooks/06_granite_speech_demo.ipynb) | ## Guides diff --git a/tutorials/notebooks/06_granite_speech_demo.ipynb b/tutorials/notebooks/06_granite_speech_demo.ipynb new file mode 100644 index 0000000..990f816 --- /dev/null +++ b/tutorials/notebooks/06_granite_speech_demo.ipynb @@ -0,0 +1,330 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": "# Granite Speech Demo — full stack in Colab\n\nSpin up a real-time, validated voice assistant powered by IBM Granite 4.1 — entirely inside a Colab notebook. One cell brings up both vLLM model servers (Granite Speech 4.1 STT + Granite Switch 4.1 LLM), the Pipecat backend, and the Next.js frontend, then prints a public URL you open in your browser to start talking.\n\n**Browser mic → WebRTC → Granite Speech STT → Mellea/Granite Switch LLM → Kokoro TTS → browser speaker.**\n\nThis notebook is a runnable companion to the [granite-speech-demo](https://github.com/generative-computing/mellea-demos/tree/main/2026-granite-speech) reference implementation.\n\n## What this demo is\n\nOne WebRTC conversation in which every layer of the Granite 4.1 release does something load-bearing: **Granite Speech 4.1** transcribes the audio (with keyword biasing for terms like \"Granite\" and \"Mellea\"); **Granite Switch 4.1** answers, hot-swapping LoRA adapters from inside a single checkpoint via control tokens; the **Granite Libraries** — twelve task-specific adapters spanning Core (explainability and validation), RAG, and Guardian (safety) — score and shape each response, with this demo using `requirement_check` to validate candidates against plain-English requirements (\"no markdown\", \"natural spoken cadence\", \"relevant to IBM\", \"no code\"); **Mellea** orchestrates the turn with its Instruct-Validate-Repair pattern, generating Best-of-N candidates in parallel and only sending one that passes every check to TTS. Validation is on by default, with a UI toggle for plain streaming if you want to feel the latency difference.\n\n## Prerequisites\n\n- **GPU runtime: A100 (Colab Pro) recommended.** L4 works. T4 will OOM — both Granite models won't fit.\n- **HuggingFace read token.** Free; create one at https://huggingface.co/settings/tokens. Add it as a Colab Secret named `HF_TOKEN` (sidebar → 🔑 → New secret). Used for two things: downloading the Granite model weights, *and* minting per-session WebRTC TURN credentials so audio reaches your browser.\n- **Browser:** Chrome, Edge, or Firefox. Safari may behave oddly with WebRTC.\n\n## How long this takes\n\n- **First run on a fresh runtime: ~8–10 min** (model downloads dominate).\n- **Subsequent runs with weights cached: ~3 min.**\n\n## What to do\n\n1. Set the `HF_TOKEN` Colab Secret.\n2. Switch the runtime to a GPU (Runtime → Change runtime type → A100/L4).\n3. **Runtime → Run all.**\n4. When the last cell prints a `*.trycloudflare.com` URL, open it, allow mic access, and start talking.\n\nIf anything goes wrong, scroll to the bottom — there's a troubleshooting section and a kill-switch cell." + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cell 2 — Install dependencies (~3 min)\n", + "\n", + "Clones the repo, installs Python deps via `uv`, installs frontend deps via `npm`, and downloads the `cloudflared` binary used for the public tunnel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import subprocess, os, shutil\n\ndef sh(cmd, **kwargs):\n print(f\"\\n$ {cmd}\")\n subprocess.run(cmd, shell=True, check=True, **kwargs)\n\n# Re-runnable: nuke any stale clones so we don't trip on existing dirs.\nshutil.rmtree(\"mellea-demos\", ignore_errors=True)\nshutil.rmtree(\"/tmp/granite-switch\", ignore_errors=True)\n\n# Colab's default Ubuntu repo has Node 12, which is too old for Next.js\n# (chokes on optional-chaining). Install Node 20 from NodeSource instead.\nsh(\"curl -fsSL https://deb.nodesource.com/setup_20.x | bash -\")\nsh(\"apt-get -qq install -y nodejs\")\n\nsh(\"git clone https://github.com/generative-computing/mellea-demos\")\nos.chdir(\"mellea-demos/2026-granite-speech\")\nprint(\"cwd:\", os.getcwd())\n\nsh(\"pip install -q uv\")\nsh(\"uv sync\")\n\n# IMPORTANT: uv sync creates .venv, but `uv pip install` by default targets\n# the system Python. Pin every subsequent install to the project venv.\nVENV_PY = os.path.abspath(\".venv/bin/python\")\nassert os.path.exists(VENV_PY), f\"venv missing: {VENV_PY}\"\n\n# The install order below is load-bearing. Each step's pins can override\n# the previous step's resolution; the final order leaves us with:\n# - mellea 0.5.0 (provides register_embedded_adapter_model, missing in 0.4.2)\n# - vllm 0.19.x with audio deps (Granite Speech needs librosa + soundfile)\n# - granite_switch model architecture registered\n# - transformers 5.5.1 (older versions truncate the requirement_check JSON;\n# newer versions might or might not, so pin exactly what we tested)\n\n# 1. mellea 0.5.0 (0.4.2 release lacks APIs the demo uses)\nsh(f\"uv pip install --python {VENV_PY} 'mellea[all]==0.5.0'\")\n\n# 2. vllm + the right transformers floor + granite_switch model registration.\n# The granite-switch repo's [vllm] extra pins vllm >=0.19.1,<0.20.0 and\n# transformers >=5.5.1 — installing plain `pip install vllm` gives 0.21.0\n# with an older transformers, which fails to recognize the architecture.\nsh(\"git clone https://github.com/generative-computing/granite-switch /tmp/granite-switch\")\nassert os.path.exists(\"/tmp/granite-switch/pyproject.toml\"), \"granite-switch clone failed\"\nsh(f\"uv pip install --python {VENV_PY} -e '/tmp/granite-switch[vllm]'\")\n\n# 3. vllm audio deps. We install librosa + soundfile directly instead of\n# relying on `vllm[audio]` — uv sees vllm as already satisfied from step 2\n# and skips re-resolving the [audio] extras, leaving librosa missing.\n# Without these, /v1/chat/completions returns 500 with\n# 'Please install vllm[audio] for audio support' on any audio input.\nsh(f\"uv pip install --python {VENV_PY} librosa soundfile\")\n\n# 4. Final transformers pin. The earlier installs can leave us on 4.57.6\n# (GPT2 tokenizer crashes on Granite Switch) or 5.0.0 (works for chat\n# but truncates requirement_check JSON output). 5.5.1 is what we tested.\nsh(f\"uv pip install --python {VENV_PY} 'transformers==5.5.1'\")\n\nsh(\"cd frontend && npm install --silent\")\nsh(\"wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 -O /usr/local/bin/cloudflared\")\nsh(\"chmod +x /usr/local/bin/cloudflared\")\n\n# Sanity checks — explicitly use the venv's Python so we're checking the right env.\nsh(f\"{VENV_PY} -c 'import vllm; print(\\\"vllm version:\\\", vllm.__version__)'\")\nsh(f\"{VENV_PY} -c 'import granite_switch.hf'\")\nsh(f\"{VENV_PY} -c 'import transformers; v = transformers.__version__; assert v == \\\"5.5.1\\\", \\\"got \\\" + v + \\\", wanted 5.5.1\\\"; print(\\\"transformers OK:\\\", v)'\")\nsh(f\"{VENV_PY} -c 'from mellea.backends.openai import OpenAIBackend; assert hasattr(OpenAIBackend, \\\"register_embedded_adapter_model\\\"), \\\"mellea version too old\\\"; print(\\\"mellea OK\\\")'\")\nsh(f\"{VENV_PY} -c 'import librosa, soundfile; print(\\\"vllm audio deps OK (librosa\\\", librosa.__version__, \\\"/ soundfile\\\", soundfile.__version__, \\\")\\\")'\")\n\nprint(\"\\n✅ Install complete\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cell 3 — Configure secrets (instant)\n", + "\n", + "Reads `HF_TOKEN` from Colab Secrets and exports it. Used for both HuggingFace model downloads and per-session TURN credential minting (see [TURN setup](https://turn.fastrtc.org/) — Cloudflare-backed, 10GB/mo free per HF token)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from google.colab import userdata\n", + "import os\n", + "\n", + "os.environ[\"HF_TOKEN\"] = userdata.get(\"HF_TOKEN\")\n", + "print(\"✅ HF_TOKEN configured — TURN credentials will be minted per-session\")" + ] + }, + { + "cell_type": "markdown", + "source": [ + "## Cell 4 — Configure the assistant (optional)\n", + "\n", + "The backend reads two env vars to customize what the assistant knows and how it behaves:\n", + "\n", + "- **`PROMPT_FILE`** — path to a `.txt` file with the system prompt. Defaults to [`prompts/granite.txt`](https://github.com/generative-computing/mellea-demos/blob/main/2026-granite-speech/prompts/granite.txt), which casts the assistant as Granite, IBM's real-time speech assistant.\n", + "- **`DOCUMENTS_DIR`** — path to a directory of `.txt` files. Each file becomes a grounding document the LLM can cite. The repo ships with [`docs/`](https://github.com/generative-computing/mellea-demos/tree/main/2026-granite-speech/docs) (Granite model cards, Mellea overview, demo architecture).\n", + "\n", + "Paths are resolved relative to the project root (`mellea-demos/2026-granite-speech/`).\n", + "\n", + "**To use your own:** edit the cell below before running it. Drop your prompt file and/or doc directory anywhere reachable from the runtime — e.g. upload via the Colab file browser, or `!wget` from a URL — then point the env vars at them." + ], + "metadata": {} + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import os\n", + "\n", + "# Edit these to point at your own prompt or docs.\n", + "# Both paths are resolved relative to the project root if not absolute.\n", + "os.environ[\"PROMPT_FILE\"] = \"prompts/granite.txt\"\n", + "os.environ[\"DOCUMENTS_DIR\"] = \"docs\"\n", + "\n", + "print(f\"PROMPT_FILE = {os.environ['PROMPT_FILE']}\")\n", + "print(f\"DOCUMENTS_DIR = {os.environ['DOCUMENTS_DIR']}\")" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cell 5 — Launch vLLM model servers (~5–8 min cold, ~30s cached)\n", + "\n", + "Two vLLM processes:\n", + "- **Port 8083:** [`ibm-granite/granite-speech-4.1-2b`](https://huggingface.co/ibm-granite/granite-speech-4.1-2b) — STT.\n", + "- **Port 8000:** [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) — chat LLM with `requirement_check` ALoRA intrinsics.\n", + "\n", + "Both run in the background; logs stream to `logs/vllm-*.log`. The cell blocks until both servers respond on `/v1/models`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": "import os\nimport subprocess\nimport time\nimport urllib.request\nimport urllib.error\n\nos.makedirs(\"logs\", exist_ok=True)\n\nVENV_VLLM = os.path.abspath(\".venv/bin/vllm\")\nassert os.path.exists(VENV_VLLM), f\"vllm not installed in venv: {VENV_VLLM}\"\n\n# Pre-flight: kill any stale vllm processes from a prior failed run, then\n# verify the GPU has enough free memory before we try again.\nsubprocess.run(\"pkill -9 -f vllm || true\", shell=True)\ntime.sleep(3)\nfree_mem = subprocess.check_output(\n [\"nvidia-smi\", \"--query-gpu=memory.free\", \"--format=csv,noheader,nounits\"]\n).decode().strip().splitlines()[0]\nfree_gib = int(free_mem) / 1024\nprint(f\"GPU free memory: {free_gib:.1f} GiB\")\nif free_gib < 22:\n raise RuntimeError(\n f\"Only {free_gib:.1f} GiB free on the GPU — need >=22. Something else is using it.\\n\"\n \"Run `!nvidia-smi` in a new cell to see which process. Kill it with `!kill -9 `.\"\n )\n\ndef _tail(path: str, n: int = 80) -> str:\n try:\n with open(path) as f:\n return \"\".join(f.readlines()[-n:])\n except FileNotFoundError:\n return \"(log file missing)\"\n\ndef wait_for(url: str, name: str, proc: subprocess.Popen, log_path: str, timeout: int = 1200) -> None:\n \"\"\"Poll until the URL returns 2xx. Bails out early if the process dies.\"\"\"\n start = time.time()\n last_err = None\n while time.time() - start < timeout:\n rc = proc.poll()\n if rc is not None:\n raise RuntimeError(\n f\"{name} exited early with code {rc}. Last log lines:\\n\"\n + \"-\" * 60 + \"\\n\" + _tail(log_path) + \"-\" * 60\n )\n try:\n with urllib.request.urlopen(url, timeout=5) as r:\n if 200 <= r.status < 300:\n elapsed = int(time.time() - start)\n print(f\"✅ {name} ready ({elapsed}s)\")\n return\n except urllib.error.HTTPError as e:\n # vllm returns 401 to unauth'd /v1/models polls when --api-key is set.\n # The 401 proves the server is up and accepting requests, which is\n # all we care about for readiness. Any HTTPError means the server\n # is responding, so treat it as ready.\n elapsed = int(time.time() - start)\n print(f\"✅ {name} ready ({elapsed}s, status {e.code})\")\n return\n except (urllib.error.URLError, ConnectionError, TimeoutError) as e:\n last_err = e\n time.sleep(5)\n raise TimeoutError(\n f\"{name} did not become ready in {timeout}s. Last error: {last_err}.\\n\"\n f\"Last log lines:\\n\" + \"-\" * 60 + \"\\n\" + _tail(log_path) + \"-\" * 60\n )\n\n# Launch SEQUENTIALLY — wait for each to fully initialize before starting the next.\n# Parallel launch causes vllm's memory-profiling assertion to fire because\n# both processes are allocating/freeing GPU memory at the same time and each\n# sees the other's churn as 'unexpected' free-memory deltas.\nspeech_log = open(\"logs/vllm-speech.log\", \"w\")\nprint(\"⏳ Starting Granite Speech vLLM (downloads weights on first run, ~4 min)...\")\nspeech_proc = subprocess.Popen(\n [\n VENV_VLLM, \"serve\", \"ibm-granite/granite-speech-4.1-2b\",\n \"--api-key\", \"token-abc123\",\n \"--max-model-len\", \"2048\",\n \"--gpu-memory-utilization\", \"0.4\",\n \"--port\", \"8083\",\n ],\n stdout=speech_log, stderr=subprocess.STDOUT,\n)\nwait_for(\"http://127.0.0.1:8083/v1/models\", \"Granite Speech (STT)\", speech_proc, \"logs/vllm-speech.log\", timeout=1200)\n\nswitch_log = open(\"logs/vllm-switch.log\", \"w\")\nprint(\"⏳ Starting Granite Switch vLLM (downloads weights on first run, ~4 min)...\")\nswitch_proc = subprocess.Popen(\n [\n VENV_VLLM, \"serve\", \"ibm-granite/granite-switch-4.1-3b-preview\",\n \"--gpu-memory-utilization\", \"0.4\",\n # Cap context window so KV cache fits in our 0.4 GPU share. The default\n # 131072 wants ~15 GiB of KV cache; voice turns need a tiny fraction of that.\n \"--max-model-len\", \"8192\",\n \"--port\", \"8000\",\n ],\n stdout=switch_log, stderr=subprocess.STDOUT,\n)\nwait_for(\"http://127.0.0.1:8000/v1/models\", \"Granite Switch (LLM)\", switch_proc, \"logs/vllm-switch.log\", timeout=1200)\n\nprint(\"✅ Both vLLM servers are up\")" + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cell 6 — Launch backend + frontend (~30s)\n", + "\n", + "- **Pipecat backend** on port 7860 (FastAPI + SmallWebRTC signaling).\n", + "- **Next.js frontend** on port 3000 (proxies WebRTC signaling to the backend in-process).\n", + "\n", + "The backend reads `HF_TOKEN` and uses it to mint a TURN relay credential per session — that's how WebRTC media reaches your browser through the cloudflared tunnel." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import subprocess\n", + "import time\n", + "import urllib.request\n", + "import urllib.error\n", + "\n", + "VENV_PY = os.path.abspath(\".venv/bin/python\")\n", + "\n", + "# Build the frontend in production mode. Dev mode (`npm run dev`) tries to\n", + "# open a webpack-hmr WebSocket back through the cloudflared tunnel, which\n", + "# tunnels poorly and triggers dynamic-import failures that leave the chat\n", + "# UI blank. Prod mode is a static-served bundle — no HMR, no SSR weirdness.\n", + "print(\"⏳ Building frontend (prod mode, ~30-60s)...\")\n", + "subprocess.run(\n", + " \"cd frontend && rm -rf .next && npm run build 2>&1 | tail -10\",\n", + " shell=True, check=True,\n", + ")\n", + "\n", + "backend_env = {**os.environ}\n", + "backend_env.setdefault(\"HOST\", \"127.0.0.1\")\n", + "backend_env.setdefault(\"PORT\", \"7860\")\n", + "# PROMPT_FILE and DOCUMENTS_DIR are set in the configuration cell above and\n", + "# inherited via os.environ.\n", + "\n", + "backend_log = open(\"logs/backend.log\", \"w\")\n", + "backend_proc = subprocess.Popen(\n", + " [VENV_PY, \"-m\", \"granite_speech_demo.server\"],\n", + " env=backend_env,\n", + " stdout=backend_log, stderr=subprocess.STDOUT,\n", + ")\n", + "\n", + "frontend_env = {**os.environ, \"PIPECAT_BACKEND_URL\": \"http://127.0.0.1:7860\"}\n", + "frontend_log = open(\"logs/frontend.log\", \"w\")\n", + "frontend_proc = subprocess.Popen(\n", + " [\"npm\", \"run\", \"start\"],\n", + " cwd=\"frontend\",\n", + " env=frontend_env,\n", + " stdout=frontend_log, stderr=subprocess.STDOUT,\n", + ")\n", + "\n", + "wait_for(\"http://127.0.0.1:7860/api/ivr/config\", \"Pipecat backend\", backend_proc, \"logs/backend.log\", timeout=120)\n", + "wait_for(\"http://127.0.0.1:3000\", \"Next.js frontend\", frontend_proc, \"logs/frontend.log\", timeout=120)\n", + "print(\"✅ Backend + frontend are up\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Cell 7 — Open the public URL and talk\n", + "\n", + "Starts a Cloudflare Quick Tunnel to expose `localhost:3000` on a public `*.trycloudflare.com` URL. The tunnel handles WebRTC *signaling* (HTTP/WebSocket); the *media* path goes through the TURN relay minted by the backend, so audio works even though the Colab runtime has no public IP.\n", + "\n", + "**One tunnel is enough** — the frontend talks to the backend in-process via Next.js API routes.\n", + "\n", + "**Heads up:** the first interaction will feel slow. There's one-time setup that runs when the environment and networking first spin up (TURN credentials, WebRTC negotiation, model warmup). Subsequent turns are much faster." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "import subprocess\n", + "import time\n", + "\n", + "tunnel_log_path = \"logs/cloudflared.log\"\n", + "tunnel_log = open(tunnel_log_path, \"w\")\n", + "tunnel_proc = subprocess.Popen(\n", + " [\"cloudflared\", \"tunnel\", \"--url\", \"http://localhost:3000\", \"--no-autoupdate\"],\n", + " stdout=tunnel_log, stderr=subprocess.STDOUT,\n", + ")\n", + "\n", + "url_re = re.compile(r\"https://[a-z0-9-]+\\.trycloudflare\\.com\")\n", + "public_url = None\n", + "deadline = time.time() + 60\n", + "while time.time() < deadline and public_url is None:\n", + " time.sleep(2)\n", + " with open(tunnel_log_path) as f:\n", + " m = url_re.search(f.read())\n", + " if m:\n", + " public_url = m.group(0)\n", + "\n", + "if not public_url:\n", + " raise RuntimeError(\"cloudflared did not print a public URL. See logs/cloudflared.log\")\n", + "\n", + "banner = \"\\n\".join([\n", + " \"\",\n", + " \"╔\" + \"═\" * 70 + \"╗\",\n", + " \"║\" + \" GRANITE SPEECH DEMO IS LIVE\".ljust(70) + \"║\",\n", + " \"╠\" + \"═\" * 70 + \"╣\",\n", + " \"║\" + f\" {public_url}\".ljust(70) + \"║\",\n", + " \"║\" + \"\".ljust(70) + \"║\",\n", + " \"║\" + \" 1. Open the URL above in Chrome / Edge / Firefox\".ljust(70) + \"║\",\n", + " \"║\" + \" 2. Allow microphone access when prompted\".ljust(70) + \"║\",\n", + " \"║\" + \" 3. Click the mic button and start talking\".ljust(70) + \"║\",\n", + " \"╚\" + \"═\" * 70 + \"╝\",\n", + " \"\",\n", + "])\n", + "print(banner)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## If something goes wrong\n", + "\n", + "Each background process writes to a file in `logs/`:\n", + "\n", + "- `logs/vllm-speech.log` — Granite Speech STT server\n", + "- `logs/vllm-switch.log` — Granite Switch LLM server\n", + "- `logs/backend.log` — Pipecat backend (look here for TURN minting messages)\n", + "- `logs/frontend.log` — Next.js dev server\n", + "- `logs/cloudflared.log` — Cloudflare tunnel (the public URL is in here)\n", + "\n", + "View one with `!tail -100 logs/vllm-speech.log` (or open the file from the Colab file browser).\n", + "\n", + "**Common failures:**\n", + "- *T4 OOM:* switch the runtime to A100 or L4. Both Granite models won't fit on a T4.\n", + "- *`HF_TOKEN` missing:* re-run Cell 3 after adding the secret. Without it, the backend falls back to STUN-only and audio likely won't connect through the cloudflared tunnel.\n", + "- *Stuck \"waiting for vLLM\":* model weights are downloading. The cell waits up to 20 min — let it run.\n", + "- *Re-running cells without cleaning up:* old processes still hold the ports. Run the kill-switch cell below, then re-run from the top.\n", + "\n", + "## Caveats\n", + "\n", + "- The `*.trycloudflare.com` URL is public for as long as this notebook runs. Anyone with the link can join the session.\n", + "- Colab kernels die after ~24h or when idle. Restart the notebook to get a fresh URL.\n", + "- One Colab session serves one user. Each reader runs their own copy of this notebook.\n", + "\n", + "## Kill switch — clean up before re-running\n", + "\n", + "Run this if you need to re-run any of the launch cells. It stops the tunnel, frontend, backend, and both vLLM processes.\n", + "\n", + "**This cell is gated** so \"Run all\" won't tear down the stack you just brought up. To actually run it, uncomment the `RUN_KILL_SWITCH = True` line at the top of the cell, then run the cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import subprocess\n", + "\n", + "# Uncomment the next line to actually run the kill switch.\n", + "# This guard exists so \"Run all\" doesn't tear down the stack you just brought up.\n", + "# RUN_KILL_SWITCH = True\n", + "\n", + "if not globals().get(\"RUN_KILL_SWITCH\"):\n", + " print(\"Kill switch is disabled. Uncomment `RUN_KILL_SWITCH = True` above and re-run this cell to stop all processes.\")\n", + "else:\n", + " # Stop tracked Popen handles from this kernel session.\n", + " for name, p in [\n", + " (\"cloudflared\", globals().get(\"tunnel_proc\")),\n", + " (\"frontend\", globals().get(\"frontend_proc\")),\n", + " (\"backend\", globals().get(\"backend_proc\")),\n", + " (\"vllm-switch\", globals().get(\"switch_proc\")),\n", + " (\"vllm-speech\", globals().get(\"speech_proc\")),\n", + " ]:\n", + " if p is not None and p.poll() is None:\n", + " p.terminate()\n", + " try:\n", + " p.wait(timeout=10)\n", + " except Exception:\n", + " p.kill()\n", + " print(f\"🛑 stopped {name}\")\n", + " else:\n", + " print(f\" {name}: not tracked / already dead\")\n", + "\n", + " # Also kill by name — catches processes whose Popen handles got lost across\n", + " # cell re-runs or kernel restarts. Without this, GPU memory stays held by\n", + " # zombie vllm processes and the next Cell 5 run fails with OOM at startup.\n", + " # We run the frontend in prod mode (`npm run start` -> `next start`), so\n", + " # match `next` rather than `next dev`.\n", + " for pattern in [\"vllm\", \"cloudflared tunnel\", \"granite_speech_demo.server\", \"next start\", \"node.*next\"]:\n", + " subprocess.run(f\"pkill -9 -f '{pattern}' || true\", shell=True)\n", + " print(f\"🧹 pkill -9 -f '{pattern}'\")\n", + "\n", + " # Final safety net: free the ports the launch cells bind to. If a process\n", + " # slipped past the name-based pkill above, this kills whatever is still\n", + " # listening so re-running Cell 5 / Cell 6 doesn't fail with EADDRINUSE.\n", + " # 3000 = Next.js frontend\n", + " # 7860 = Pipecat backend\n", + " # 8000 = Granite Switch vLLM\n", + " # 8083 = Granite Speech vLLM\n", + " for port in (3000, 7860, 8000, 8083):\n", + " subprocess.run(f\"fuser -k {port}/tcp 2>/dev/null || true\", shell=True)\n", + " print(f\"🔓 freed port {port}\")\n", + "\n", + " print(\"\\nIf any vllm processes were running, GPU memory should now be freed.\")\n", + " print(\"Run `!nvidia-smi` to confirm before re-running Cell 5.\")" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file