lightspeed-core · asimurka · May 19, 2026 · asimurka · May 19, 2026 · asimurka
diff --git a/Makefile b/Makefile
@@ -16,7 +16,6 @@ run: ## Run the service locally
 
 run-llama-stack: ## Start Llama Stack with enriched config (for local service mode)
 	uv run src/llama_stack_configuration.py -c $(CONFIG) -i $(LLAMA_STACK_CONFIG) -o $(LLAMA_STACK_CONFIG) && \
-	AZURE_API_KEY=$$(grep '^AZURE_API_KEY=' .env | cut -d'=' -f2-) \
 	uv run llama stack run $(LLAMA_STACK_CONFIG)
 
 test-unit: ## Run the unit tests

diff --git a/docs/providers.md b/docs/providers.md
@@ -92,51 +92,44 @@ azure_entra_id:
 
 #### Llama Stack Configuration Requirements
 
-Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema. The config block for the Azure inference provider **must** include `api_key`, `api_base`, and `api_version` — Llama Stack will fail to start if any of these are missing.
+Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema. The config block for the Azure inference provider **must** include `base_url` and `api_version`. When using Entra ID authentication, `api_key` is not required to be configured, since the API key is acquired and passed automatically at runtime.
 
-**Important:** The `api_key` field must be set to `${env.AZURE_API_KEY}` exactly as shown below. This is not optional — Lightspeed uses this specific environment variable name as a placeholder for injection of the Entra ID access token. Using a different variable name will break the authentication flow.
+When `azure_entra_id` is configured in Lightspeed, config enrichment automatically sets `model_validation: false` on the `remote::azure` provider so Llama Stack can start without validating models against Azure at startup.
 
 ```yaml
 inference:
   - provider_id: azure
     provider_type: remote::azure
     config:
-      api_key: ${env.AZURE_API_KEY}    # Must be exactly this - placeholder for Entra ID token
-      api_base: ${env.AZURE_API_BASE}
+      # api_key: ${env.AZURE_API_KEY}  # Can be omitted when Entra ID configured in LCORE
+      base_url: ${env.AZURE_API_BASE}
       api_version: 2025-01-01-preview
+      model_validation: false  # added automatically by Lightspeed enrichment
 ```
 
-**How it works:** At startup, Lightspeed acquires an Entra ID access token and stores it in the `AZURE_API_KEY` environment variable. When Llama Stack initializes, it reads the config, substitutes `${env.AZURE_API_KEY}` with the token value, and uses it to authenticate with Azure OpenAI. Llama Stack also calls `models.list()` during initialization to validate provider connectivity, which is why the token must be available before client initialization.
+**How it works:** Llama Stack defers Azure authentication to inference time. Lightspeed acquires Entra ID tokens at runtime and passes them via the `X-LlamaStack-Provider-Data` header (`azure_api_key`, `azure_api_base`).
 
 #### Access Token Lifecycle and Management
 
-**Library mode startup:**
+**Lightspeed startup (library and service mode):**
 1. Lightspeed reads your Entra ID configuration
-2. Acquires an initial access token from Microsoft Entra ID
-3. Stores the token in the `AZURE_API_KEY` environment variable
-4. **Then** initializes the Llama Stack library client
+2. Does not acquire or cache access tokens at startup—authentication is deferred until request time
+3. Initializes the Llama Stack client without Azure credentials; credentials are supplied later via `X-LlamaStack-Provider-Data` when an Azure model is used
 
-This ordering is critical because Llama Stack calls `models.list()` during initialization to validate provider connectivity. If the token is not set before client initialization, Azure requests will fail with authentication errors.
-
-**Service mode startup:**
-
-When running Llama Stack as a separate service, Lightspeed runs a pre-startup script that:
-1. Reads the Entra ID configuration
-2. Acquires an initial access token
-3. Writes the token to the `AZURE_API_KEY` environment variable
-4. **Then** Llama Stack service starts
-
-This initial token is used solely for the `models.list()` validation call during Llama Stack startup. After startup, Lightspeed manages token refresh independently and passes fresh tokens via request headers.
+**Llama Stack service startup (container mode):**
+1. Config enrichment sets `model_validation: false` on the Azure provider
+2. Llama Stack starts without authenticating models against Azure
+3. Lightspeed connects to this service at startup without Azure credentials; tokens are added only for Azure inference requests
 
 **During inference requests:**
 1. Before each request, Lightspeed checks if the token has expired
-2. If expired, a new token is automatically acquired and the environment variable is updated
-3. For library mode: the Llama Stack client is reloaded to pick up the new token
-4. For service mode: the token is passed via `X-LlamaStack-Provider-Data` request headers
+2. If expired, a new token is automatically acquired and cached in memory
+3. The token is passed via `X-LlamaStack-Provider-Data` (library and service mode)
 
 **Token security:**
 - Access tokens are wrapped in `SecretStr` to prevent accidental logging
-- Tokens are stored only in the `AZURE_API_KEY` environment variable (single source of truth)
+- Tokens are cached in `AzureEntraIDManager` singleton class
+- Inference uses `X-LlamaStack-Provider-Data` headers
 - Each Uvicorn worker maintains its own token lifecycle independently
 
 **Token validity:**

diff --git a/docs/rag_guide.md b/docs/rag_guide.md
@@ -83,7 +83,6 @@ The script reads your `lightspeed-stack.yaml` configuration and enriches a base
 - `-c, --config`: Lightspeed config file (default: `lightspeed-stack.yaml`)
 - `-i, --input`: Input Llama Stack config (default: `run.yaml`)
 - `-o, --output`: Output enriched config (default: `run_.yaml`)
-- `-e, --env-file`: Path to .env file for AZURE_API_KEY (default: `.env`)
 
 > [!TIP]
 > Use this script to generate your initial `run.yaml` configuration, then manually customize as needed for your specific setup.

diff --git a/examples/azure-run.yaml b/examples/azure-run.yaml
@@ -22,9 +22,9 @@ providers:
   - provider_id: azure
     provider_type: remote::azure
     config: 
-      api_key: ${env.AZURE_API_KEY}
       base_url: https://ols-test.openai.azure.com/openai/v1
       api_version: 2024-02-15-preview
+      model_validation: false
   - provider_id: openai
     provider_type: remote::openai
     config:

diff --git a/scripts/llama-stack-entrypoint.sh b/scripts/llama-stack-entrypoint.sh
@@ -7,7 +7,6 @@ set -e
 INPUT_CONFIG="${LLAMA_STACK_CONFIG:-/opt/app-root/run.yaml}"
 ENRICHED_CONFIG="/opt/app-root/run.yaml"
 LIGHTSPEED_CONFIG="${LIGHTSPEED_CONFIG:-/opt/app-root/lightspeed-stack.yaml}"
-ENV_FILE="/opt/app-root/.env"
 
 # Enrich config if lightspeed config exists
 if [ -f "$LIGHTSPEED_CONFIG" ]; then
@@ -16,14 +15,7 @@ if [ -f "$LIGHTSPEED_CONFIG" ]; then
     python3 /opt/app-root/llama_stack_configuration.py \
         -c "$LIGHTSPEED_CONFIG" \
         -i "$INPUT_CONFIG" \
-        -o "$ENRICHED_CONFIG" \
-        -e "$ENV_FILE" 2>&1 || ENRICHMENT_FAILED=1
-
-    # Source .env if generated (contains AZURE_API_KEY)
-    if [ -f "$ENV_FILE" ]; then
-        # shellcheck source=/dev/null
-        set -a && . "$ENV_FILE" && set +a
-    fi
+        -o "$ENRICHED_CONFIG" 2>&1 || ENRICHMENT_FAILED=1
 
     if [ -f "$ENRICHED_CONFIG" ] && [ "$ENRICHMENT_FAILED" -eq 0 ]; then
         echo "Using enriched config: $ENRICHED_CONFIG"

diff --git a/src/app/endpoints/query.py b/src/app/endpoints/query.py
@@ -54,7 +54,6 @@
     is_context_length_error,
     prepare_input,
     store_query_results,
-    update_azure_token,
     validate_attachments_metadata,
     validate_model_provider_override,
 )
@@ -204,7 +203,7 @@ async def query_endpoint_handler(
         and AzureEntraIDManager().is_token_expired
         and AzureEntraIDManager().refresh_token()
     ):
-        client = await update_azure_token(client)
+        client = await AsyncLlamaStackClientHolder().update_azure_token()
 
     # Retrieve response using Responses API
     turn_summary = await retrieve_response(

diff --git a/src/app/endpoints/responses.py b/src/app/endpoints/responses.py
@@ -78,7 +78,6 @@
     handle_known_apistatus_errors,
     is_context_length_error,
     store_query_results,
-    update_azure_token,
     validate_model_provider_override,
 )
 from utils.quota import check_tokens_available, get_available_quotas
@@ -405,7 +404,7 @@ async def responses_endpoint_handler(
         and AzureEntraIDManager().is_token_expired
         and AzureEntraIDManager().refresh_token()
     ):
-        client = await update_azure_token(client)
+        client = await AsyncLlamaStackClientHolder().update_azure_token()
 
     input_text = (
         original_request.input

diff --git a/src/app/endpoints/streaming_query.py b/src/app/endpoints/streaming_query.py
@@ -92,7 +92,6 @@
     is_context_length_error,
     prepare_input,
     store_query_results,
-    update_azure_token,
     update_conversation_topic_summary,
     validate_attachments_metadata,
     validate_model_provider_override,
@@ -262,7 +261,7 @@ async def streaming_query_endpoint_handler(  # pylint: disable=too-many-locals
         and AzureEntraIDManager().is_token_expired
         and AzureEntraIDManager().refresh_token()
     ):
-        client = await update_azure_token(client)
+        client = await AsyncLlamaStackClientHolder().update_azure_token()
 
     request_id = get_suid()
 

diff --git a/src/app/main.py b/src/app/main.py
@@ -77,15 +77,6 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
 
     initialize_sentry()
 
-    azure_config = configuration.configuration.azure_entra_id
-    if azure_config is not None:
-        AzureEntraIDManager().set_config(azure_config)
-        if not AzureEntraIDManager().refresh_token():
-            logger.warning(
-                "Failed to refresh Azure token at startup. "
-                "Token refresh will be retried on next Azure request."
-            )
-
     llama_stack_config = configuration.configuration.llama_stack
     await AsyncLlamaStackClientHolder().load(llama_stack_config)
     client = AsyncLlamaStackClientHolder().get_client()
@@ -104,6 +95,11 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
         )
         raise
 
+    azure_entra_id_config = configuration.configuration.azure_entra_id
+    if azure_entra_id_config is not None:
+        AzureEntraIDManager().set_config(azure_entra_id_config)
+        azure_base_url = await AsyncLlamaStackClientHolder().get_azure_base_url()
+        AzureEntraIDManager().set_base_url(azure_base_url)
     logger.info("Registering MCP servers")
     await register_mcp_servers_async(logger, configuration.configuration)
     logger.info("App startup complete")

diff --git a/src/authorization/azure_token_manager.py b/src/authorization/azure_token_manager.py
@@ -1,6 +1,5 @@
 """Azure Entra ID token manager for Azure OpenAI authentication."""
 
-import os
 import time
 from typing import Optional
 
@@ -34,7 +33,13 @@ class AzureEntraIDManager(metaclass=Singleton):
     def __init__(self) -> None:
         """Initialize the token manager with empty state."""
         self._expires_on: int = 0
+        self._access_token: SecretStr = SecretStr("")
         self._entra_id_config: Optional[AzureEntraIdConfiguration] = None
+        self._azure_base_url: Optional[str] = None
+
+    def set_base_url(self, base_url: Optional[str]) -> None:
+        """Set the Azure API base."""
+        self._azure_base_url = base_url
 
     def set_config(self, azure_config: AzureEntraIdConfiguration) -> None:
         """Set the Azure Entra ID configuration."""
@@ -53,8 +58,24 @@ def is_token_expired(self) -> bool:
 
     @property
     def access_token(self) -> SecretStr:
-        """Return the access token from environment variable as SecretStr."""
-        return SecretStr(os.environ.get("AZURE_API_KEY", ""))
+        """Return the cached access token."""
+        return self._access_token
+
+    @property
+    def azure_base_url(self) -> Optional[str]:
+        """Return the cached Azure API base."""
+        return self._azure_base_url
+
+    def build_azure_provider_data(self) -> Optional[dict[str, str]]:
+        """Build azure_api_key and azure_base_url entries for provider data.
+
+        Returns:
+            Provider data dict when a token and base_url are available.
+        """
+        token = self.access_token.get_secret_value()
+        if not token or self.azure_base_url is None:
+            return None
+        return {"azure_api_key": token, "azure_api_base": self.azure_base_url}
 
     def refresh_token(self) -> bool:
         """Refresh the cached Azure access token.
@@ -76,9 +97,9 @@ def refresh_token(self) -> bool:
         return False
 
     def _update_access_token(self, token: str, expires_on: int) -> None:
-        """Update the token in env var and track expiration time."""
+        """Update the cached token and track expiration time."""
+        self._access_token = SecretStr(token)
         self._expires_on = expires_on - TOKEN_EXPIRATION_LEEWAY
-        os.environ["AZURE_API_KEY"] = token
         expiry_time = time.strftime(
             "%Y-%m-%d %H:%M:%S", time.localtime(self._expires_on)
         )

diff --git a/src/client.py b/src/client.py
@@ -3,15 +3,21 @@
 import json
 import os
 import tempfile
-from typing import Optional
+from typing import Optional, cast
 
 import yaml
 from fastapi import HTTPException
 from llama_stack.core.library_client import AsyncLlamaStackAsLibraryClient
 from llama_stack_client import APIConnectionError, APIStatusError, AsyncLlamaStackClient
 
+from authorization.azure_token_manager import AzureEntraIDManager
 from configuration import configuration
-from llama_stack_configuration import YamlDumper, enrich_byok_rag, enrich_solr
+from llama_stack_configuration import (
+    YamlDumper,
+    enrich_azure_entra_id_inference,
+    enrich_byok_rag,
+    enrich_solr,
+)
 from log import get_logger
 from models.api.responses.error import ServiceUnavailableResponse
 from models.config import LlamaStackConfiguration
@@ -90,6 +96,12 @@ def _enrich_library_config(self, input_config_path: str) -> str:
         # Enrichment: Solr - enabled when "okp" appears in either inline or tool list
         enrich_solr(ls_config, config.rag.model_dump(), config.okp.model_dump())
 
+        # Enrichment: Azure Entra ID deferred auth
+        entra_id_config = (
+            config.azure_entra_id.model_dump() if config.azure_entra_id else None
+        )
+        enrich_azure_entra_id_inference(ls_config, entra_id_config)
+
         enriched_path = os.path.join(
             tempfile.gettempdir(), "llama_stack_enriched_config.yaml"
         )
@@ -211,23 +223,35 @@ async def check_model_available(self, model_id: str) -> tuple[bool, str]:
         )
         return False, f"Model {model_id} not found in model registry"
 
-    def update_provider_data(self, updates: dict[str, str]) -> AsyncLlamaStackClient:
-        """Update provider data headers for service client.
-
-        For use with service mode only.
-
-        Args:
-            updates: Key-value pairs to merge into provider data header.
+    async def update_azure_token(self) -> AsyncLlamaStackClient:
+        """Apply cached Azure credentials and replace the held client.
 
         Returns:
-            The updated client instance.
+            The new client instance assigned to this holder.
         """
-        if not self._lsc:
-            raise RuntimeError(
-                "AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
+        updates = AzureEntraIDManager().build_azure_provider_data()
+        if not updates:
+            return self.get_client()
+
+        if self.is_library_client:
+            if not self._config_path:
+                logger.warning("Cannot update Azure token: config path not set")
+                return self.get_client()
+
+            current_provider_data = dict(
+                cast(AsyncLlamaStackAsLibraryClient, self._lsc).provider_data or {}
+            )
+            current_provider_data.update(updates)
+            client = AsyncLlamaStackAsLibraryClient(
+                self._config_path, provider_data=current_provider_data
             )
+            await client.initialize()
+            self._lsc = client
+            return client
 
-        current_headers = self._lsc.default_headers or {}
+        # Service client mode
+        current_client = self.get_client()
+        current_headers = current_client.default_headers or {}
         provider_data_json = current_headers.get("X-LlamaStack-Provider-Data")
 
         try:
@@ -242,5 +266,32 @@ def update_provider_data(self, updates: dict[str, str]) -> AsyncLlamaStackClient
             "X-LlamaStack-Provider-Data": json.dumps(provider_data),
         }
 
-        self._lsc = self._lsc.copy(set_default_headers=updated_headers)  # type: ignore[arg-type]
-        return self._lsc
+        updated_client = current_client.copy(
+            set_default_headers=updated_headers  # type: ignore[arg-type]
+        )
+        self._lsc = updated_client
+        return updated_client
+
+    async def get_azure_base_url(self) -> Optional[str]:
+        """
+        Retrieve the Azure base_url endpoint from the remote Llama Stack provider configuration.
+
+        Returns:
+            Optional[str]: The Azure base_url if available, otherwise None.
+        """
+        if not self._lsc:
+            return None
+
+        try:
+            providers = await self._lsc.providers.list()
+        except (APIConnectionError, APIStatusError) as err:
+            logger.warning("Failed to list providers for Azure base_url: %s", err)
+            return None
+
+        for provider in providers:
+            if provider.provider_type != "remote::azure":
+                continue
+            base = provider.config.get("base_url")
+            if base is not None:
+                return str(base)
+        return None