Conductor is a platform for managing models, model runners, model configurations, and virtualizing combinations into virtual model runners exposed to the network through OpenAI, vLLM, Gemini, and Ollama APIs.
- Multi-tenant Architecture: Full tenant isolation with tenant-scoped data access
- Model Runner Endpoints: Define and manage first-class endpoint types for OpenAI, vLLM, Gemini, and Ollama model runners
- Model Definitions: Catalog your models with metadata like family, parameter size, and quantization
- Model Configurations: Create reusable configurations with pinned properties for embeddings and completions
- Virtual Model Runners: Combine endpoints and configurations into virtual endpoints with load balancing
- Configuration Pinning: Automatically inject model parameters into requests (like OllamaFlow)
- Session Affinity: Pin clients to specific backend endpoints based on IP address, API key, or custom headers to minimize context drops and model swapping
- Load Balancing: Round-robin, random, or first-available endpoint selection with weighted distribution and optional session affinity
- Health Checking: Automatic background health monitoring of endpoints with configurable thresholds
- RigMonitor Telemetry: Optionally enrich endpoint health with cached host CPU, memory, disk, network, GPU, and Ollama telemetry from RigMonitor sidecars
- Policy-Based Routing: Create first-class load-balancing policies that filter or rank endpoints using health, capacity, and RigMonitor metrics
- Explainable Routing: Simulate representative requests, inspect candidate elimination, review policy evidence, and persist routing explanations into request history
- Preflight Validation: Validate endpoints, model definitions, model configurations, load-balancing policies, and VMRs before saving them
- Effective Configuration Preview: Resolve the endpoint set, request permissions, policy attachment, model pinning, and session-affinity settings that a VMR will actually use
- Operational Metrics: Export Prometheus-friendly latency, denial, fallback, session-affinity, saturation, and telemetry-freshness signals
- Drain And Quarantine Controls: Keep endpoints visible for health diagnostics while intentionally excluding them from new routing
- Rate Limiting: Per-endpoint maximum parallel request limits with automatic capacity management
- Request History: Optional per-VMR request/response capture for debugging and auditing with configurable retention, redaction, and metadata-only retention modes
- React Dashboard: Full-featured UI for managing all entities including real-time health status
cd docker
docker compose up -dThe server will be available at http://localhost:9000 and the dashboard at http://localhost:9100.
The Compose file builds the server and dashboard from the local repository Dockerfiles.
- .NET 10 SDK
- Node.js 20+
cd src/Conductor.Server
dotnet runcd dashboard
npm install
npm run devConductor's automated tests use Touchstone so the same shared test cases can run through multiple hosts.
src/Test.Shared/contains the authoritative test definitions.src/Test.Xunit/exposes the shared suite through xUnit.src/Test.Nunit/exposes the same suite through NUnit.src/Test.Automated/runs the suite through the Touchstone console runner.
Common commands:
# Run framework-hosted tests
dotnet test src/Conductor.sln
# Run the console host
dotnet run --project src/Test.Automated/Test.Automated.csprojSee TESTING.md for the full testing guide.
Conductor ships lightweight SDKs for common management-plane workflows:
sdk/javascript/for Node.js and browser-adjacent toolingsdk/python/for Python automation and ops scripts
Both SDKs include helpers for:
- validation routes
- VMR effective configuration preview
- explain-routing simulations
- endpoint drain, resume, and quarantine actions
- request-history search, summary, detail, and bulk deletion
- observability metrics summary and text export
Conductor currently supports four model runner provider types in both the backend proxy and the dashboard:
| Provider Type | Runner Type in UI | Proxied API Shape | Notes |
|---|---|---|---|
| OpenAI | OpenAI |
OpenAI REST API | Supports OpenAI-style chat, embeddings, and model listing |
| vLLM | vLLM |
OpenAI-compatible REST API | First-class runner type in the UI; uses the OpenAI-compatible API surface |
| Gemini | Gemini |
Gemini REST API | Supports Gemini-style models/{model}:generateContent, streaming, embeddings, and model listing |
| Ollama | Ollama |
Ollama REST API | Supports Ollama-style /api/generate, /api/chat, and embeddings flows |
Conductor supports two authentication methods:
- Header-based: Include
x-tenant-id,x-email, andx-passwordheaders - Bearer Token: Include
Authorization: Bearer {token}header
Users have three permission levels:
| Permission | Description |
|---|---|
Global Admin (IsAdmin=true) |
Full cross-tenant access to all resources |
Tenant Admin (IsTenantAdmin=true) |
Can manage users and credentials within their own tenant |
| Standard User | Can only access model configurations, endpoints, runners, and virtual runners in their tenant |
- Global Admins can operate on any tenant by specifying
TenantIdin their requests - Tenant Admins have elevated privileges within their assigned tenant
- Standard Users have read/write access to non-administrative resources
| Entity | Prefix | API Endpoint |
|---|---|---|
| Administrator | admin_ |
/v1.0/administrators |
| Tenant | ten_ |
/v1.0/tenants |
| User | usr_ |
/v1.0/users |
| Credential | cred_ |
/v1.0/credentials |
| Model Runner Endpoint | mre_ |
/v1.0/modelrunnerendpoints |
| Model Definition | md_ |
/v1.0/modeldefinitions |
| Model Configuration | mc_ |
/v1.0/modelconfigurations |
| Load Balancing Policy | lbp_ |
/v1.0/loadbalancingpolicies |
| Virtual Model Runner | vmr_ |
/v1.0/virtualmodelrunners |
| Request History | req_ |
/v1.0/requesthistory |
| Request History Summary | - | /v1.0/requesthistory/summary |
| Observability Metrics | - | /v1.0/observability/metrics |
Model runner endpoints can optionally declare a RigMonitor sidecar configuration. Conductor collects that data during the normal endpoint health-check loop, caches it in memory, and exposes it through endpoint health and telemetry routes. The proxy path never performs live RigMonitor calls while handling client traffic.
Each ModelRunnerEndpoint can include a RigMonitor object with fields such as:
EnabledHostnameOverridePortUseSslTimeoutMsCollectDuringHealthCheckRequireReadyzHealthAffectedByRigMonitorMaxTelemetryAgeMsCapabilitiesRefreshIntervalMsTelemetryProfileTelemetrySelectors
Useful routes:
GET /v1.0/modelrunnerendpoints/healthGET /v1.0/modelrunnerendpoints/{id}/healthGET /v1.0/modelrunnerendpoints/{id}/rigmonitor
Load-balancing policies are tenant-scoped resources attached to a VMR by LoadBalancingPolicyId.
- Policy CRUD:
/v1.0/loadbalancingpolicies - Metrics catalog:
GET /v1.0/loadbalancingpolicies/metrics - VMR attachment: set
LoadBalancingPolicyIdon/v1.0/virtualmodelrunners
Policies combine:
Filters: hard eligibility checks such ashealth.isHealthy == trueorrig.gpu.available == trueRanking: weighted numeric comparisons such as lowest CPU, lowest GPU utilization, or fewest in-flight requestsFallbackMode: use the VMR's legacy load-balancing mode or fail closedTieBreaker: round-robin, random, or first available when scores are equal
Example policy payload:
{
"Name": "Lowest GPU Utilization",
"MaxTelemetryAgeMs": 30000,
"Filters": [
{ "Metric": "health.isHealthy", "Operator": "Equal", "ValueType": "Boolean", "Value": "true" },
{ "Metric": "health.hasCapacity", "Operator": "Equal", "ValueType": "Boolean", "Value": "true" },
{ "Metric": "rig.gpu.available", "Operator": "Equal", "ValueType": "Boolean", "Value": "true" }
],
"Ranking": [
{ "Metric": "rig.gpu.avgUtilizationPercent", "Direction": "Ascending", "Weight": 1.0 }
],
"FallbackMode": "UseLegacyLoadBalancingMode",
"TieBreaker": "RoundRobin",
"Active": true
}Example VMR attachment:
{
"Name": "GPU Chat VMR",
"BasePath": "/v1.0/api/gpu-chat/",
"LoadBalancingMode": "RoundRobin",
"LoadBalancingPolicyId": "lbp_example",
"ModelRunnerEndpointIds": ["mre_a", "mre_b"],
"Active": true
}The management plane now exposes first-class safety and explainability routes:
POST /v1.0/modelrunnerendpoints/validatePOST /v1.0/modeldefinitions/validatePOST /v1.0/modelconfigurations/validatePOST /v1.0/loadbalancingpolicies/validatePOST /v1.0/virtualmodelrunners/validateGET /v1.0/virtualmodelrunners/{id}/effectivePOST /v1.0/virtualmodelrunners/{id}/explain-routing
Recommended operator flow:
- Validate drafts before saving.
- Inspect the effective VMR preview to confirm endpoint coverage, request permissions, policy attachment, and model pinning.
- Use explain-routing with a representative request body when you need to understand why a request would route, mutate, reuse a session pin, or be denied.
Request-history detail responses also expose the structured routing decision when history is enabled for the VMR.
- Keep unauthenticated RigMonitor sidecars on trusted networks only.
TelemetryProfilenow defaults toFull; narrow it only if you need to reduce health-check telemetry cost.- Prefer
FallbackMode = UseLegacyLoadBalancingModefirst, then move selected VMRs toFailClosedonce telemetry freshness and sidecar reliability are proven. - Stale or missing telemetry can make a telemetry-dependent endpoint ineligible for policy evaluation.
Virtual model runners expose an API at their configured base path. For example, a VMR with base path /v1.0/api/my-vmr/ would expose:
- OpenAI API:
/v1.0/api/my-vmr/v1/chat/completions,/v1.0/api/my-vmr/v1/embeddings - vLLM API:
/v1.0/api/my-vmr/v1/chat/completions,/v1.0/api/my-vmr/v1/embeddings - Gemini API:
/v1.0/api/my-vmr/v1beta/models/gemini-2.5-flash:generateContent,/v1.0/api/my-vmr/v1beta/models/text-embedding-004:embedContent - Ollama API:
/v1.0/api/my-vmr/api/generate,/v1.0/api/my-vmr/api/chat
{
"Webserver": {
"Hostname": "localhost",
"Port": 9000,
"Ssl": false,
"Cors": {
"Enabled": true,
"AllowedOrigins": ["http://localhost:9100"],
"AllowedMethods": ["GET", "POST", "PUT", "DELETE", "OPTIONS"],
"AllowedHeaders": ["Content-Type", "Authorization", "x-tenant-id", "x-email", "x-password", "x-admin-apikey", "x-admin-email", "x-admin-password"],
"ExposedHeaders": [],
"AllowCredentials": false,
"MaxAgeSeconds": 86400
}
},
"Database": {
"Type": "Sqlite",
"Filename": "./conductor.db"
},
"Logging": {
"Servers": [],
"LogDirectory": "./logs/",
"LogFilename": "conductor.log",
"ConsoleLogging": true,
"MinimumSeverity": 0
},
"RequestHistory": {
"Enabled": true,
"Directory": "./request-history/",
"RetentionDays": 7,
"MetadataRetentionDays": 30,
"BodyRetentionDays": 7,
"CleanupIntervalMinutes": 60,
"CaptureRequestBody": true,
"CaptureResponseBody": true,
"RedactedHeaders": ["authorization", "x-password", "x-admin-password", "x-goog-api-key"],
"RedactedJsonFields": ["authorization", "api_key", "apikey", "password", "token", "bearertoken"],
"MaxRequestBodyBytes": 65536,
"MaxResponseBodyBytes": 65536
}
}- SQLite (default):
"Type": "Sqlite", "Filename": "./conductor.db" - PostgreSQL:
"Type": "PostgreSql", "ConnectionString": "Host=..." - SQL Server:
"Type": "SqlServer", "ConnectionString": "Server=..." - MySQL:
"Type": "MySql", "ConnectionString": "Server=..."
Cross-Origin Resource Sharing (CORS) can be enabled to allow browser-based applications to access the Conductor API.
| Property | Type | Default | Description |
|---|---|---|---|
Enabled |
bool | false |
Enable or disable CORS support |
AllowedOrigins |
string[] | [] |
List of allowed origins. Use ["*"] for all origins |
AllowedMethods |
string[] | ["GET", "POST", "PUT", "DELETE", "OPTIONS"] |
Allowed HTTP methods |
AllowedHeaders |
string[] | ["Content-Type", "Authorization", ...] |
Allowed request headers |
ExposedHeaders |
string[] | [] |
Headers exposed to the browser |
AllowCredentials |
bool | false |
Allow credentials (cookies, auth headers). Cannot be used with AllowedOrigins: ["*"] |
MaxAgeSeconds |
int | 86400 |
Preflight cache duration (0-86400 seconds) |
Example: Allow all origins (development)
{
"Webserver": {
"Cors": {
"Enabled": true,
"AllowedOrigins": ["*"]
}
}
}Example: Restrict to specific origins (production)
{
"Webserver": {
"Cors": {
"Enabled": true,
"AllowedOrigins": ["https://app.example.com", "https://admin.example.com"],
"AllowCredentials": true
}
}
}Request history captures request/response data for Virtual Model Runners with RequestHistoryEnabled set to true. This is useful for debugging, auditing, troubleshooting, and latency analysis. Each completed entry records total response time and time to first token/byte (FirstTokenTimeMs). For non-streaming responses, FirstTokenTimeMs is set to the same value as ResponseTimeMs.
| Property | Type | Default | Description |
|---|---|---|---|
Enabled |
bool | true |
Enable or disable request history globally |
Directory |
string | "./request-history/" |
Directory for storing request detail JSON files |
RetentionDays |
int | 30 |
Legacy retention knob used as a fallback when the newer retention settings are omitted |
MetadataRetentionDays |
int | 30 |
Number of days to retain searchable ledger metadata before cleanup (1-365) |
BodyRetentionDays |
int | 30 |
Number of days to retain request and response bodies inside detail files before they are scrubbed (1-365) |
CleanupIntervalMinutes |
int | 60 |
Interval between cleanup runs in minutes (1-1440) |
CaptureRequestBody |
bool | true |
Persist request bodies when request history is enabled for the VMR |
CaptureResponseBody |
bool | true |
Persist response bodies when request history is enabled for the VMR |
RedactedHeaders |
string[] | built-in sensitive headers | Header names redacted before persistence |
RedactedJsonFields |
string[] | built-in sensitive JSON fields | JSON field names redacted recursively before persistence |
MaxRequestBodyBytes |
int | 65536 |
Maximum request body bytes to capture (1-10485760) |
MaxResponseBodyBytes |
int | 65536 |
Maximum response body bytes to capture (1-10485760) |
Note: Request history must be enabled both globally (in conductor.json) and per-VMR (via the RequestHistoryEnabled property).
Captured request history entries include the VMR, routed model runner endpoint, matched model definition, matched model configuration, policy attachment, requested/effective model names, routing outcome, denial reason, mutation summary, HTTP status, body lengths, transfer type, total response time (ResponseTimeMs), and time to first token/byte (FirstTokenTimeMs).
When BodyRetentionDays is shorter than MetadataRetentionDays, Conductor scrubs request and response bodies from detail files while preserving the searchable routing and latency ledger.
The summary endpoint returns aggregated request counts grouped by time buckets, useful for charting request volume and success/failure rates over time.
GET /v1.0/requesthistory/summary?startUtc={ISO8601}&endUtc={ISO8601}&interval={hour|day}&vmrGuid={guid}
| Parameter | Type | Required | Description |
|---|---|---|---|
startUtc |
string | No | Start of time range (UTC, ISO 8601). Default: 1 hour ago |
endUtc |
string | No | End of time range (UTC, ISO 8601). Default: now |
interval |
string | No | Bucket interval: minute, 15minute, hour, 6hour, or day. Default: hour |
vmrGuid |
string | No | Filter by Virtual Model Runner GUID |
endpointGuid |
string | No | Filter by routed model runner endpoint GUID |
requestorUserGuid |
string | No | Filter by authenticated user GUID |
credentialGuid |
string | No | Filter by credential GUID |
loadBalancingPolicyGuid |
string | No | Filter by attached load-balancing policy GUID |
modelName |
string | No | Filter by requested or effective model |
mutationSummary |
string | No | Filter by mutation-summary substring |
denialReasonCode |
string | No | Filter by denial reason |
sessionAffinityOutcome |
string | No | Filter by session-affinity outcome |
statusClass |
string | No | Filter by status class such as 2xx, 4xx, or 5xx |
sourceIp |
string | No | Filter by requestor source IP |
httpStatus |
integer | No | Filter by exact HTTP status code |
Response:
{
"Data": [
{
"TimestampUtc": "2026-03-20T10:00:00Z",
"SuccessCount": 42,
"FailureCount": 3,
"TotalCount": 45
}
],
"StartUtc": "2026-03-20T10:00:00Z",
"EndUtc": "2026-03-20T11:00:00Z",
"Interval": "hour",
"TotalSuccess": 42,
"TotalFailure": 3,
"StatusClassCounts": {
"2xx": 42,
"5xx": 3
},
"DenialReasonCounts": {
"AllEndpointsAtCapacity": 2,
"PolicyRejected": 1
},
"SessionAffinityOutcomeCounts": {
"Hit": 20,
"Miss": 25
},
"TotalRequests": 45
}Success is defined as HTTP status 100-399; failure is HTTP status 400-599 or null (incomplete requests).
Model configurations can define pinned properties that are automatically merged into incoming requests:
{
"Name": "Low Temperature Config",
"PinnedCompletionsProperties": {
"temperature": 0.3,
"top_p": 0.9,
"max_tokens": 2048
},
"PinnedEmbeddingsProperties": {
"model": "text-embedding-ada-002"
}
}When a request comes through a virtual model runner, the pinned properties are merged with the request body, allowing you to enforce specific model parameters.
Model Runner Endpoints support comprehensive health checking with the following properties:
| Property | Type | Default | Description |
|---|---|---|---|
HealthCheckUrl |
string | / |
URL path appended to endpoint base URL for health checks |
HealthCheckMethod |
enum | GET |
HTTP method (GET or HEAD) |
HealthCheckIntervalMs |
int | 5000 |
Milliseconds between health checks |
HealthCheckTimeoutMs |
int | 5000 |
Timeout for health check requests |
HealthCheckExpectedStatusCode |
int | 200 |
Expected HTTP status code for healthy |
UnhealthyThreshold |
int | 2 |
Consecutive failures before marking unhealthy |
HealthyThreshold |
int | 2 |
Consecutive successes before marking healthy |
HealthCheckUseAuth |
bool | false |
Include API key (Bearer token) in health check requests |
MaxParallelRequests |
int | 4 |
Maximum concurrent requests (0 = unlimited) |
Weight |
int | 1 |
Relative weight for load balancing (1-1000) |
ServiceState |
enum | Normal |
Operator-controlled traffic state: Normal, Draining, or Quarantined |
Note for OpenAI and vLLM APIs: When using api.openai.com or another OpenAI-compatible backend that requires authentication for model listing, set HealthCheckUseAuth to true and HealthCheckUrl to /v1/models.
Note for Gemini API: When using generativelanguage.googleapis.com, set HealthCheckUseAuth to true and HealthCheckUrl to /v1beta/models. Gemini uses the x-goog-api-key header rather than bearer token authentication.
- Endpoints start in an unhealthy state and transition to healthy after meeting the
HealthyThreshold - Background tasks continuously monitor each active endpoint at the configured interval
- The proxy automatically excludes unhealthy endpoints from request routing
- Draining endpoints continue to be probed and remain available for already-pinned session-affinity traffic, but they do not receive new assignments
- Quarantined endpoints continue to be probed for diagnostics, but they are excluded from all routing, including pinned-session reuse
- When all endpoints are unhealthy, requests return
502 Bad Gateway - When all endpoints are at capacity, requests return
429 Too Many Requests - When all configured endpoints are quarantined or draining, requests are denied with an explicit service-state-specific error
- Each endpoint tracks in-flight requests in real-time
- The
MaxParallelRequestsproperty enforces a per-endpoint concurrency limit - Set to
0for unlimited concurrent requests - Requests are counted from start until the response completes (including streaming)
- The
Weightproperty influences endpoint selection in round-robin and random modes - Higher weight = more traffic directed to that endpoint
- Example: Endpoint A (weight=3) receives 3x more traffic than Endpoint B (weight=1)
Monitor endpoint health via the REST API:
# Health of all endpoints in tenant
GET /v1.0/modelrunnerendpoints/health
# Put an endpoint into maintenance drain mode
POST /v1.0/modelrunnerendpoints/{id}/drain
# Resume normal traffic
POST /v1.0/modelrunnerendpoints/{id}/resume
# Exclude an endpoint from all routing while keeping health visibility
POST /v1.0/modelrunnerendpoints/{id}/quarantine
# Health of endpoints for a specific VMR
GET /v1.0/virtualmodelrunners/{id}/healthResponse includes:
- Current health state (healthy/unhealthy)
- Operator-managed service state (
Normal,Draining,Quarantined) - In-flight request count
- Total uptime/downtime
- Uptime percentage
- Last check timestamp
- Last error message (if any)
The included Docker Compose setup uses local build contexts:
- Server:
src/Conductor.Server/Dockerfile - Dashboard:
dashboard/Dockerfile
# Build server
./build-server.sh # or build-server.bat on Windows
# Build dashboard
./build-dashboard.sh # or build-dashboard.bat on WindowsMIT License - see LICENSE.md for details.