Skip to content

LCORE-1518: Skip model authentication against Azure#1766

Open
asimurka wants to merge 1 commit into
lightspeed-core:mainfrom
asimurka:azure_skip_model_availability
Open

LCORE-1518: Skip model authentication against Azure#1766
asimurka wants to merge 1 commit into
lightspeed-core:mainfrom
asimurka:azure_skip_model_availability

Conversation

@asimurka
Copy link
Copy Markdown
Contributor

@asimurka asimurka commented May 19, 2026

Description

Optimizes Azure Entra ID token handling by skipping startup authentication for Azure models. This allows to deffer token acquisition to inference requests allowing to remove prestart token acquisition and passing by environment variables.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: Cursor

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Refactor
    • Azure authentication flow updated: tokens are now acquired and passed at runtime via secure headers instead of being configured at startup through environment variables.
    • Simplified configuration process—no longer requires .env files for Azure setup.
    • Model validation automatically disabled for Azure providers during configuration enrichment.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

Warning

Rate limit exceeded

@asimurka has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 42 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b33b6373-e2bd-4577-86c8-c39585ed5d82

📥 Commits

Reviewing files that changed from the base of the PR and between 3324de6 and ba728c8.

📒 Files selected for processing (22)
  • Makefile
  • docs/providers.md
  • docs/rag_guide.md
  • examples/azure-run.yaml
  • scripts/llama-stack-entrypoint.sh
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/app/main.py
  • src/authorization/azure_token_manager.py
  • src/client.py
  • src/llama_stack_configuration.py
  • src/utils/query.py
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/authorization/test_azure_token_manager.py
  • tests/unit/test_client.py
  • tests/unit/test_llama_stack_configuration.py
  • tests/unit/utils/test_query.py

Walkthrough

This PR refactors Azure Entra ID token management from startup-time file generation to runtime in-memory caching. It moves token acquisition to request time via the X-LlamaStack-Provider-Data header, updates configuration enrichment to set model_validation: false instead of generating tokens, and integrates the new flow through client holder methods, endpoint handlers, and startup initialization.

Changes

Azure Token Handling Refactoring: In-Memory Caching & Runtime Acquisition

Layer / File(s) Summary
Token Manager In-Memory Storage
src/authorization/azure_token_manager.py, tests/unit/authorization/test_azure_token_manager.py
AzureEntraIDManager replaces environment-variable-based token storage with in-memory caching. New _access_token and _azure_base_url cached state, properties (access_token, azure_base_url), set_base_url() method, and build_azure_provider_data() for header construction. Token refresh and expiration logic remain, but token updates no longer write to environment variables. Tests updated to validate in-memory state and provider data building.
Configuration Enrichment with Model Validation
src/llama_stack_configuration.py, tests/unit/test_llama_stack_configuration.py
New enrich_azure_entra_id_inference() sets model_validation: false on remote::azure providers when Entra ID is configured, replacing token-generation and .env writing. generate_configuration() signature removes env_file parameter. CLI no longer accepts --env-file or applies environment variable substitution to Lightspeed config. Tests verify enrichment behavior and confirm no .env file creation.
Client Holder Token Update & Azure Base URL Query
src/client.py, tests/unit/test_client.py
AsyncLlamaStackClientHolder adds update_azure_token() (async) to rebuild the held client with refreshed provider data in both library and service modes, updating headers as needed. Adds get_azure_base_url() to query the remote::azure provider. Library client enrichment extended to call enrich_azure_entra_id_inference(). Old update_provider_data() method removed. Tests validate token update and header deferral behavior.
Endpoint Handler Token Refresh Integration
src/app/endpoints/query.py, src/app/endpoints/responses.py, src/app/endpoints/streaming_query.py, src/utils/query.py, tests/unit/app/endpoints/test_query.py, tests/unit/app/endpoints/test_responses.py, tests/unit/app/endpoints/test_streaming_query.py, tests/unit/utils/test_query.py
Query, responses, and streaming_query handlers now refresh Azure tokens via AsyncLlamaStackClientHolder().update_azure_token() instead of a standalone helper. Standalone update_azure_token() async function removed from utils/query.py along with its imports and tests. All endpoint tests updated to mock the client holder method instead of module-level functions.
Startup & Runtime Integration
src/app/main.py, scripts/llama-stack-entrypoint.sh, tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml, tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
App lifespan removes early Azure token refresh and instead configures AzureEntraIDManager with config and base URL retrieved post-client-load. Shell scripts and Kubernetes manifests stop passing .env file arguments to enrichment, simplifying bootstrap.
Documentation & Configuration Examples
docs/providers.md, docs/rag_guide.md, examples/azure-run.yaml
Providers guide updated to describe runtime token acquisition via headers and in-memory caching. RAG guide removes --env-file CLI reference. Azure example config adds model_validation: false and removes api_key field.
Makefile Simplification
Makefile
Removes inline AZURE_API_KEY environment variable injection from run-llama-stack target.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • manstis
  • ldjebran
  • tisnik
  • maorfr
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: deferring Azure model authentication from startup to runtime, removing pre-startup token acquisition and environment variable passing.
Docstring Coverage ✅ Passed Docstring coverage is 97.06% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/client.py (1)

257-263: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Harden provider-data JSON parsing before merge.

At Line 258, valid JSON that is not an object (for example, a list) will pass decoding and then fail at Line 262 on .update(...).

Proposed fix
-        try:
-            provider_data = json.loads(provider_data_json) if provider_data_json else {}
-        except (json.JSONDecodeError, TypeError):
-            provider_data = {}
+        try:
+            parsed = json.loads(provider_data_json) if provider_data_json else {}
+            provider_data = parsed if isinstance(parsed, dict) else {}
+        except (json.JSONDecodeError, TypeError):
+            provider_data = {}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/client.py` around lines 257 - 263, The JSON parsing for provider data
currently assumes json.loads(provider_data_json) yields a dict and then calls
provider_data.update(updates); to fix, after decoding (and in the existing
except block) ensure the decoded value is a dict (use isinstance(provider_data,
dict)) and if it's not, replace it with an empty dict before calling
provider_data.update(updates); keep the existing exception handling for
JSONDecodeError/TypeError and apply this check where provider_data and
provider_data_json are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/providers.md`:
- Around line 95-107: The docs use two different config keys—`base_url` in the
prose and `api_base` in the YAML—causing inconsistency with the actual schema;
pick the correct schema key (whichever the code expects for the `remote::azure`
provider) and update both the descriptive text and the example so they match
(for example, replace `api_base` with `base_url` or vice versa) and keep
references to `azure_entra_id`, `model_validation: false`, `provider_id: azure`,
and `provider_type: remote::azure` unchanged.
- Around line 114-118: Update the lifecycle text under "Lightspeed startup
(library and service mode)" to remove the claim that Lightspeed "Acquires an
initial access token from Microsoft Entra ID" and "Caches the token in memory
(`SecretStr`, per worker)" at startup; instead state that Lightspeed reads Entra
ID configuration at startup but defers authentication until request time,
acquiring and caching access tokens on demand per-request (and still
initializing the Llama Stack client with `X-LlamaStack-Provider-Data` as
before). Ensure the revised step explicitly reflects the runtime-only,
deferred-auth flow implemented in the PR.

In `@src/client.py`:
- Around line 245-250: When initializing the AsyncLlamaStackAsLibraryClient
inside the token-refresh path, wrap the await client.initialize() call in a
try/except that catches initialization failures (broad Exception or the specific
initialization error type) and translate them into a service-unavailable error
(e.g., raise or return a ServiceUnavailableError / HTTP 503) instead of letting
the exception bubble; set self._lsc only after successful initialization (leave
self._lsc unchanged on failure) and include the client
variable/AsyncLlamaStackAsLibraryClient.initialize in your handling so callers
receive a 503-style backend-unavailable response when initialization fails.

In `@tests/unit/app/endpoints/test_streaming_query.py`:
- Line 731: The test creates an unused mock "mock_updated_client" and sets
update_azure_token to return it but never asserts it's used; either remove the
unused mock_updated_client and its return configuration to simplify the test, or
if you intend to verify the updated client is applied, change the assertions to
check that get_client() (or subsequent operations that use the client)
returns/uses mock_updated_client after calling update_azure_token; update
references to AsyncLlamaStackClient, mock_updated_client, and update_azure_token
accordingly.

---

Outside diff comments:
In `@src/client.py`:
- Around line 257-263: The JSON parsing for provider data currently assumes
json.loads(provider_data_json) yields a dict and then calls
provider_data.update(updates); to fix, after decoding (and in the existing
except block) ensure the decoded value is a dict (use isinstance(provider_data,
dict)) and if it's not, replace it with an empty dict before calling
provider_data.update(updates); keep the existing exception handling for
JSONDecodeError/TypeError and apply this check where provider_data and
provider_data_json are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23accaba-033e-4848-bb78-61bb7f096894

📥 Commits

Reviewing files that changed from the base of the PR and between 559753d and 3324de6.

📒 Files selected for processing (22)
  • Makefile
  • docs/providers.md
  • docs/rag_guide.md
  • examples/azure-run.yaml
  • scripts/llama-stack-entrypoint.sh
  • src/app/endpoints/query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/streaming_query.py
  • src/app/main.py
  • src/authorization/azure_token_manager.py
  • src/client.py
  • src/llama_stack_configuration.py
  • src/utils/query.py
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/authorization/test_azure_token_manager.py
  • tests/unit/test_client.py
  • tests/unit/test_llama_stack_configuration.py
  • tests/unit/utils/test_query.py
💤 Files with no reviewable changes (4)
  • Makefile
  • tests/unit/utils/test_query.py
  • docs/rag_guide.md
  • src/utils/query.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (3)
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/app/endpoints/test_query.py
  • tests/unit/app/endpoints/test_responses.py
  • tests/unit/test_llama_stack_configuration.py
  • tests/unit/app/endpoints/test_streaming_query.py
  • tests/unit/authorization/test_azure_token_manager.py
  • tests/unit/test_client.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/main.py
  • src/authorization/azure_token_manager.py
  • src/llama_stack_configuration.py
  • src/client.py
  • src/app/endpoints/query.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/main.py
  • src/app/endpoints/query.py
🧠 Learnings (4)
📚 Learning: 2026-05-12T15:14:34.788Z
Learnt from: syedriko
Repo: lightspeed-core/lightspeed-stack PR: 1727
File: scripts/konflux_requirements.sh:9-15
Timestamp: 2026-05-12T15:14:34.788Z
Learning: In this repo, the `.konflux/` directory is committed/tracked and is guaranteed to exist in a fresh clone. Therefore, shell scripts that write output under `.konflux/` (e.g., create files like `.konflux/<...>`) should not waste effort by calling `mkdir -p .konflux` first. Only add directory-creation logic if the script may run in an environment/repo state where `.konflux/` might not be present.

Applied to files:

  • scripts/llama-stack-entrypoint.sh
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/streaming_query.py
  • src/app/endpoints/responses.py
  • src/app/endpoints/query.py
📚 Learning: 2026-01-14T09:37:51.612Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 988
File: src/app/endpoints/query.py:319-339
Timestamp: 2026-01-14T09:37:51.612Z
Learning: In the lightspeed-stack repository, when provider_id == "azure", the Azure provider with provider_type "remote::azure" is guaranteed to be present in the providers list. Therefore, avoid defensive StopIteration handling for next() when locating the Azure provider in providers within src/app/endpoints/query.py. This change applies specifically to this file (or nearby provider lookup code) and relies on the invariant that the Azure provider exists; if the invariant could be violated, keep the existing StopIteration handling.

Applied to files:

  • src/app/endpoints/query.py
🔇 Additional comments (17)
examples/azure-run.yaml (1)

27-27: LGTM!

src/app/main.py (1)

98-102: LGTM!

scripts/llama-stack-entrypoint.sh (1)

18-18: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml (1)

163-163: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-prow.yaml (1)

153-153: LGTM!

src/authorization/azure_token_manager.py (2)

36-42: LGTM!

Also applies to: 59-78


99-106: LGTM!

tests/unit/authorization/test_azure_token_manager.py (1)

56-61: LGTM!

Also applies to: 72-78, 80-92, 163-179

src/llama_stack_configuration.py (2)

46-78: LGTM!


582-583: LGTM!

Also applies to: 632-632

tests/unit/test_llama_stack_configuration.py (1)

15-15: LGTM!

Also applies to: 28-104, 527-527

tests/unit/test_client.py (1)

6-6: LGTM!

Also applies to: 13-13, 16-19, 102-178

src/app/endpoints/query.py (1)

199-207: LGTM!

src/app/endpoints/responses.py (1)

400-408: LGTM!

src/app/endpoints/streaming_query.py (1)

257-265: LGTM!

tests/unit/app/endpoints/test_query.py (1)

544-546: LGTM!

Also applies to: 568-568

tests/unit/app/endpoints/test_responses.py (1)

469-470: LGTM!

Also applies to: 485-485, 505-505

Comment thread docs/providers.md
Comment thread docs/providers.md Outdated
Comment thread src/client.py
Comment thread tests/unit/app/endpoints/test_streaming_query.py
@asimurka asimurka force-pushed the azure_skip_model_availability branch from 3324de6 to ba728c8 Compare May 19, 2026 08:16
Comment thread docs/providers.md
api_key: ${env.AZURE_API_KEY} # Must be exactly this - placeholder for Entra ID token
api_base: ${env.AZURE_API_BASE}
# api_key: ${env.AZURE_API_KEY} # Can be omitted when Entra ID configured in LCORE
base_url: ${env.AZURE_API_BASE}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in attribute name

# shellcheck source=/dev/null
set -a && . "$ENV_FILE" && set +a
fi
-o "$ENRICHED_CONFIG" 2>&1 || ENRICHMENT_FAILED=1
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API key is not passed by .env file anymore

Comment thread src/app/main.py
if azure_entra_id_config is not None:
AzureEntraIDManager().set_config(azure_entra_id_config)
azure_base_url = await AsyncLlamaStackClientHolder().get_azure_base_url()
AzureEntraIDManager().set_base_url(azure_base_url)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FastAPI lifespan just setup manager's attributes and defer the token acquisition to inference requests.

token = self.access_token.get_secret_value()
if not token or self.azure_base_url is None:
return None
return {"azure_api_key": token, "azure_api_base": self.azure_base_url}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config and validator attribute discrepancy in LLS (api_base vs base_url)

Comment thread src/utils/query.py
return _is_inout_shield(shield) or not is_output_shield(shield)


async def update_azure_token(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced to client holder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant