Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ run: ## Run the service locally

run-llama-stack: ## Start Llama Stack with enriched config (for local service mode)
uv run src/llama_stack_configuration.py -c $(CONFIG) -i $(LLAMA_STACK_CONFIG) -o $(LLAMA_STACK_CONFIG) && \
AZURE_API_KEY=$$(grep '^AZURE_API_KEY=' .env | cut -d'=' -f2-) \
uv run llama stack run $(LLAMA_STACK_CONFIG)

test-unit: ## Run the unit tests
Expand Down
41 changes: 17 additions & 24 deletions docs/providers.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,51 +92,44 @@ azure_entra_id:

#### Llama Stack Configuration Requirements

Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema. The config block for the Azure inference provider **must** include `api_key`, `api_base`, and `api_version` — Llama Stack will fail to start if any of these are missing.
Because Lightspeed builds on top of Llama Stack, certain configuration fields are required to satisfy the base Llama Stack schema. The config block for the Azure inference provider **must** include `base_url` and `api_version`. When using Entra ID authentication, `api_key` is not required to be configured, since the API key is acquired and passed automatically at runtime.

**Important:** The `api_key` field must be set to `${env.AZURE_API_KEY}` exactly as shown below. This is not optional — Lightspeed uses this specific environment variable name as a placeholder for injection of the Entra ID access token. Using a different variable name will break the authentication flow.
When `azure_entra_id` is configured in Lightspeed, config enrichment automatically sets `model_validation: false` on the `remote::azure` provider so Llama Stack can start without validating models against Azure at startup.

```yaml
inference:
- provider_id: azure
provider_type: remote::azure
config:
api_key: ${env.AZURE_API_KEY} # Must be exactly this - placeholder for Entra ID token
api_base: ${env.AZURE_API_BASE}
# api_key: ${env.AZURE_API_KEY} # Can be omitted when Entra ID configured in LCORE
base_url: ${env.AZURE_API_BASE}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in attribute name

api_version: 2025-01-01-preview
model_validation: false # added automatically by Lightspeed enrichment
Comment thread
asimurka marked this conversation as resolved.
```

**How it works:** At startup, Lightspeed acquires an Entra ID access token and stores it in the `AZURE_API_KEY` environment variable. When Llama Stack initializes, it reads the config, substitutes `${env.AZURE_API_KEY}` with the token value, and uses it to authenticate with Azure OpenAI. Llama Stack also calls `models.list()` during initialization to validate provider connectivity, which is why the token must be available before client initialization.
**How it works:** Llama Stack defers Azure authentication to inference time. Lightspeed acquires Entra ID tokens at runtime and passes them via the `X-LlamaStack-Provider-Data` header (`azure_api_key`, `azure_api_base`).

#### Access Token Lifecycle and Management

**Library mode startup:**
**Lightspeed startup (library and service mode):**
1. Lightspeed reads your Entra ID configuration
2. Acquires an initial access token from Microsoft Entra ID
3. Stores the token in the `AZURE_API_KEY` environment variable
4. **Then** initializes the Llama Stack library client
2. Does not acquire or cache access tokens at startup—authentication is deferred until request time
3. Initializes the Llama Stack client without Azure credentials; credentials are supplied later via `X-LlamaStack-Provider-Data` when an Azure model is used

This ordering is critical because Llama Stack calls `models.list()` during initialization to validate provider connectivity. If the token is not set before client initialization, Azure requests will fail with authentication errors.

**Service mode startup:**

When running Llama Stack as a separate service, Lightspeed runs a pre-startup script that:
1. Reads the Entra ID configuration
2. Acquires an initial access token
3. Writes the token to the `AZURE_API_KEY` environment variable
4. **Then** Llama Stack service starts

This initial token is used solely for the `models.list()` validation call during Llama Stack startup. After startup, Lightspeed manages token refresh independently and passes fresh tokens via request headers.
**Llama Stack service startup (container mode):**
1. Config enrichment sets `model_validation: false` on the Azure provider
2. Llama Stack starts without authenticating models against Azure
3. Lightspeed connects to this service at startup without Azure credentials; tokens are added only for Azure inference requests

**During inference requests:**
1. Before each request, Lightspeed checks if the token has expired
2. If expired, a new token is automatically acquired and the environment variable is updated
3. For library mode: the Llama Stack client is reloaded to pick up the new token
4. For service mode: the token is passed via `X-LlamaStack-Provider-Data` request headers
2. If expired, a new token is automatically acquired and cached in memory
3. The token is passed via `X-LlamaStack-Provider-Data` (library and service mode)

**Token security:**
- Access tokens are wrapped in `SecretStr` to prevent accidental logging
- Tokens are stored only in the `AZURE_API_KEY` environment variable (single source of truth)
- Tokens are cached in `AzureEntraIDManager` singleton class
- Inference uses `X-LlamaStack-Provider-Data` headers
- Each Uvicorn worker maintains its own token lifecycle independently

**Token validity:**
Expand Down
1 change: 0 additions & 1 deletion docs/rag_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ The script reads your `lightspeed-stack.yaml` configuration and enriches a base
- `-c, --config`: Lightspeed config file (default: `lightspeed-stack.yaml`)
- `-i, --input`: Input Llama Stack config (default: `run.yaml`)
- `-o, --output`: Output enriched config (default: `run_.yaml`)
- `-e, --env-file`: Path to .env file for AZURE_API_KEY (default: `.env`)

> [!TIP]
> Use this script to generate your initial `run.yaml` configuration, then manually customize as needed for your specific setup.
Expand Down
2 changes: 1 addition & 1 deletion examples/azure-run.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ providers:
- provider_id: azure
provider_type: remote::azure
config:
api_key: ${env.AZURE_API_KEY}
base_url: https://ols-test.openai.azure.com/openai/v1
api_version: 2024-02-15-preview
model_validation: false
- provider_id: openai
provider_type: remote::openai
config:
Expand Down
10 changes: 1 addition & 9 deletions scripts/llama-stack-entrypoint.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ set -e
INPUT_CONFIG="${LLAMA_STACK_CONFIG:-/opt/app-root/run.yaml}"
ENRICHED_CONFIG="/opt/app-root/run.yaml"
LIGHTSPEED_CONFIG="${LIGHTSPEED_CONFIG:-/opt/app-root/lightspeed-stack.yaml}"
ENV_FILE="/opt/app-root/.env"

# Enrich config if lightspeed config exists
if [ -f "$LIGHTSPEED_CONFIG" ]; then
Expand All @@ -16,14 +15,7 @@ if [ -f "$LIGHTSPEED_CONFIG" ]; then
python3 /opt/app-root/llama_stack_configuration.py \
-c "$LIGHTSPEED_CONFIG" \
-i "$INPUT_CONFIG" \
-o "$ENRICHED_CONFIG" \
-e "$ENV_FILE" 2>&1 || ENRICHMENT_FAILED=1

# Source .env if generated (contains AZURE_API_KEY)
if [ -f "$ENV_FILE" ]; then
# shellcheck source=/dev/null
set -a && . "$ENV_FILE" && set +a
fi
-o "$ENRICHED_CONFIG" 2>&1 || ENRICHMENT_FAILED=1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key is not passed by .env file anymore


if [ -f "$ENRICHED_CONFIG" ] && [ "$ENRICHMENT_FAILED" -eq 0 ]; then
echo "Using enriched config: $ENRICHED_CONFIG"
Expand Down
3 changes: 1 addition & 2 deletions src/app/endpoints/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,6 @@
is_context_length_error,
prepare_input,
store_query_results,
update_azure_token,
validate_attachments_metadata,
validate_model_provider_override,
)
Expand Down Expand Up @@ -204,7 +203,7 @@ async def query_endpoint_handler(
and AzureEntraIDManager().is_token_expired
and AzureEntraIDManager().refresh_token()
):
client = await update_azure_token(client)
client = await AsyncLlamaStackClientHolder().update_azure_token()

# Retrieve response using Responses API
turn_summary = await retrieve_response(
Expand Down
3 changes: 1 addition & 2 deletions src/app/endpoints/responses.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,6 @@
handle_known_apistatus_errors,
is_context_length_error,
store_query_results,
update_azure_token,
validate_model_provider_override,
)
from utils.quota import check_tokens_available, get_available_quotas
Expand Down Expand Up @@ -405,7 +404,7 @@ async def responses_endpoint_handler(
and AzureEntraIDManager().is_token_expired
and AzureEntraIDManager().refresh_token()
):
client = await update_azure_token(client)
client = await AsyncLlamaStackClientHolder().update_azure_token()

input_text = (
original_request.input
Expand Down
3 changes: 1 addition & 2 deletions src/app/endpoints/streaming_query.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@
is_context_length_error,
prepare_input,
store_query_results,
update_azure_token,
update_conversation_topic_summary,
validate_attachments_metadata,
validate_model_provider_override,
Expand Down Expand Up @@ -262,7 +261,7 @@ async def streaming_query_endpoint_handler( # pylint: disable=too-many-locals
and AzureEntraIDManager().is_token_expired
and AzureEntraIDManager().refresh_token()
):
client = await update_azure_token(client)
client = await AsyncLlamaStackClientHolder().update_azure_token()

request_id = get_suid()

Expand Down
14 changes: 5 additions & 9 deletions src/app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,15 +77,6 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:

initialize_sentry()

azure_config = configuration.configuration.azure_entra_id
if azure_config is not None:
AzureEntraIDManager().set_config(azure_config)
if not AzureEntraIDManager().refresh_token():
logger.warning(
"Failed to refresh Azure token at startup. "
"Token refresh will be retried on next Azure request."
)

llama_stack_config = configuration.configuration.llama_stack
await AsyncLlamaStackClientHolder().load(llama_stack_config)
client = AsyncLlamaStackClientHolder().get_client()
Expand All @@ -104,6 +95,11 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
)
raise

azure_entra_id_config = configuration.configuration.azure_entra_id
if azure_entra_id_config is not None:
AzureEntraIDManager().set_config(azure_entra_id_config)
azure_base_url = await AsyncLlamaStackClientHolder().get_azure_base_url()
AzureEntraIDManager().set_base_url(azure_base_url)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FastAPI lifespan just setup manager's attributes and defer the token acquisition to inference requests.

logger.info("Registering MCP servers")
await register_mcp_servers_async(logger, configuration.configuration)
logger.info("App startup complete")
Expand Down
31 changes: 26 additions & 5 deletions src/authorization/azure_token_manager.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""Azure Entra ID token manager for Azure OpenAI authentication."""

import os
import time
from typing import Optional

Expand Down Expand Up @@ -34,7 +33,13 @@ class AzureEntraIDManager(metaclass=Singleton):
def __init__(self) -> None:
"""Initialize the token manager with empty state."""
self._expires_on: int = 0
self._access_token: SecretStr = SecretStr("")
self._entra_id_config: Optional[AzureEntraIdConfiguration] = None
self._azure_base_url: Optional[str] = None

def set_base_url(self, base_url: Optional[str]) -> None:
"""Set the Azure API base."""
self._azure_base_url = base_url

def set_config(self, azure_config: AzureEntraIdConfiguration) -> None:
"""Set the Azure Entra ID configuration."""
Expand All @@ -53,8 +58,24 @@ def is_token_expired(self) -> bool:

@property
def access_token(self) -> SecretStr:
"""Return the access token from environment variable as SecretStr."""
return SecretStr(os.environ.get("AZURE_API_KEY", ""))
"""Return the cached access token."""
return self._access_token

@property
def azure_base_url(self) -> Optional[str]:
"""Return the cached Azure API base."""
return self._azure_base_url

def build_azure_provider_data(self) -> Optional[dict[str, str]]:
"""Build azure_api_key and azure_base_url entries for provider data.

Returns:
Provider data dict when a token and base_url are available.
"""
token = self.access_token.get_secret_value()
if not token or self.azure_base_url is None:
return None
return {"azure_api_key": token, "azure_api_base": self.azure_base_url}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config and validator attribute discrepancy in LLS (api_base vs base_url)


def refresh_token(self) -> bool:
"""Refresh the cached Azure access token.
Expand All @@ -76,9 +97,9 @@ def refresh_token(self) -> bool:
return False

def _update_access_token(self, token: str, expires_on: int) -> None:
"""Update the token in env var and track expiration time."""
"""Update the cached token and track expiration time."""
self._access_token = SecretStr(token)
self._expires_on = expires_on - TOKEN_EXPIRATION_LEEWAY
os.environ["AZURE_API_KEY"] = token
expiry_time = time.strftime(
"%Y-%m-%d %H:%M:%S", time.localtime(self._expires_on)
)
Expand Down
83 changes: 67 additions & 16 deletions src/client.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,21 @@
import json
import os
import tempfile
from typing import Optional
from typing import Optional, cast

import yaml
from fastapi import HTTPException
from llama_stack.core.library_client import AsyncLlamaStackAsLibraryClient
from llama_stack_client import APIConnectionError, APIStatusError, AsyncLlamaStackClient

from authorization.azure_token_manager import AzureEntraIDManager
from configuration import configuration
from llama_stack_configuration import YamlDumper, enrich_byok_rag, enrich_solr
from llama_stack_configuration import (
YamlDumper,
enrich_azure_entra_id_inference,
enrich_byok_rag,
enrich_solr,
)
from log import get_logger
from models.api.responses.error import ServiceUnavailableResponse
from models.config import LlamaStackConfiguration
Expand Down Expand Up @@ -90,6 +96,12 @@ def _enrich_library_config(self, input_config_path: str) -> str:
# Enrichment: Solr - enabled when "okp" appears in either inline or tool list
enrich_solr(ls_config, config.rag.model_dump(), config.okp.model_dump())

# Enrichment: Azure Entra ID deferred auth
entra_id_config = (
config.azure_entra_id.model_dump() if config.azure_entra_id else None
)
enrich_azure_entra_id_inference(ls_config, entra_id_config)

enriched_path = os.path.join(
tempfile.gettempdir(), "llama_stack_enriched_config.yaml"
)
Expand Down Expand Up @@ -211,23 +223,35 @@ async def check_model_available(self, model_id: str) -> tuple[bool, str]:
)
return False, f"Model {model_id} not found in model registry"

def update_provider_data(self, updates: dict[str, str]) -> AsyncLlamaStackClient:
"""Update provider data headers for service client.

For use with service mode only.

Args:
updates: Key-value pairs to merge into provider data header.
async def update_azure_token(self) -> AsyncLlamaStackClient:
"""Apply cached Azure credentials and replace the held client.

Returns:
The updated client instance.
The new client instance assigned to this holder.
"""
if not self._lsc:
raise RuntimeError(
"AsyncLlamaStackClient has not been initialised. Ensure 'load(..)' has been called."
updates = AzureEntraIDManager().build_azure_provider_data()
if not updates:
return self.get_client()

if self.is_library_client:
if not self._config_path:
logger.warning("Cannot update Azure token: config path not set")
return self.get_client()

current_provider_data = dict(
cast(AsyncLlamaStackAsLibraryClient, self._lsc).provider_data or {}
)
current_provider_data.update(updates)
client = AsyncLlamaStackAsLibraryClient(
self._config_path, provider_data=current_provider_data
)
await client.initialize()
self._lsc = client
return client
Comment thread
asimurka marked this conversation as resolved.

current_headers = self._lsc.default_headers or {}
# Service client mode
current_client = self.get_client()
current_headers = current_client.default_headers or {}
provider_data_json = current_headers.get("X-LlamaStack-Provider-Data")

try:
Expand All @@ -242,5 +266,32 @@ def update_provider_data(self, updates: dict[str, str]) -> AsyncLlamaStackClient
"X-LlamaStack-Provider-Data": json.dumps(provider_data),
}

self._lsc = self._lsc.copy(set_default_headers=updated_headers) # type: ignore[arg-type]
return self._lsc
updated_client = current_client.copy(
set_default_headers=updated_headers # type: ignore[arg-type]
)
self._lsc = updated_client
return updated_client

async def get_azure_base_url(self) -> Optional[str]:
"""
Retrieve the Azure base_url endpoint from the remote Llama Stack provider configuration.

Returns:
Optional[str]: The Azure base_url if available, otherwise None.
"""
if not self._lsc:
return None

try:
providers = await self._lsc.providers.list()
except (APIConnectionError, APIStatusError) as err:
logger.warning("Failed to list providers for Azure base_url: %s", err)
return None

for provider in providers:
if provider.provider_type != "remote::azure":
continue
base = provider.config.get("base_url")
if base is not None:
return str(base)
return None
Loading
Loading