Context
We run ~270 API endpoints behind a shared Vercel Edge chokepoint in server/gateway.ts and currently have no reliable usage visibility:
- no per-endpoint req/s or p95 latency
- no stable per-customer or per-key usage attribution
- no long-term queryable request log
- no trustworthy upstream-provider usage view for ACLED / Wingbits / EIA / etc.
That makes capacity planning, enterprise support, and historical drill-downs unnecessarily hard.
This issue is to align on the implementation direction with @koala73 before building it.
Decision Proposal
Use Axiom as the primary backend for this phase.
Key points:
- emit structured usage events from the Edge gateway via
fetch
- keep usage telemetry fully separate from existing
console.log(...)
- do not add Redis, S3, Parquet, or self-hosted ClickHouse in phase 1
- accept small best-effort loss from Edge -> Axiom direct ingest
- only add a tiny Railway relay later if direct Edge -> Axiom proves insufficient in canary
Final Plan
-
Instrument the chokepoint in server/gateway.ts.
- Create a request-scoped usage context at the start of each request.
- Finalize telemetry after the response status and duration are known.
-
Add server/_shared/usage-context.ts backed by AsyncLocalStorage.
- Store
request_id, route, domain, customer_id, user_id, api_key_id, tier, and auth_kind.
- This keeps deep provider helpers from needing the
Request object threaded through every call.
-
Add server/_shared/usage.ts.
- Provide a dedicated
emitUsageEvents(events) helper.
- Direct
fetch to Axiom.
- 1-2s timeout, no retry, no
console.log inside, and not awaited on the request path.
-
Start with one Axiom dataset: wm_api_usage.
- Use flat rows with
event_type rather than nested arrays.
event_type=request
event_type=upstream
-
Emit one request event per inbound API call with fields like:
_time, event_type, request_id
domain, route, method, status, duration_ms
req_bytes, res_bytes (best effort)
customer_id, user_id, api_key_id, auth_kind, tier
country, ua_hash, cache_status
-
Emit one upstream event per real outbound provider call with fields like:
_time, event_type, request_id
customer_id, route, tier
provider, operation, host
status, duration_ms, request_bytes, response_bytes
cache_status='miss' | 'fresh'
-
Instrument upstream calls in two passes.
- First, add coverage in shared helpers like
server/_shared/redis.ts on the fresh-fetch path only, plus a shared trackedFetch() helper for raw provider calls.
- Then migrate the highest-cost bypasses first: ACLED, Wingbits, EIA / economic shared fetches, then the rest incrementally.
-
Keep existing console.log usage untouched.
- Usage telemetry goes only to Axiom.
- No log scraping.
-
Keep phase 1 direct.
- No Redis spool.
- No object-store archive.
- No Parquet generation from Edge.
- Fallback path only if needed: a small Railway relay that batches to Axiom.
-
Roll out behind USAGE_TELEMETRY=1.
- ship dark to a test dataset
- enable request events first
- add upstream instrumentation next
- build dashboards after real data lands
- canary in prod and compare counts vs Vercel request volume
- Build initial dashboards in Axiom:
- req/s and p95 by route
- top customers by request count
- top customers by upstream provider consumption
- top providers by call volume and error rate
- long-range drill-down by
customer_id + provider + month
Why this shape
This is the adversarial consensus version of two proposed approaches:
- keep the strongest part of the Axiom-first plan: one managed backend, direct Edge emission, no new infra by default
- keep the strongest part of the stricter observability plan: stable customer attribution, explicit upstream-provider tracking, and event shapes that can answer questions like:
- "which customer is burning provider capacity?"
- "how much did Acme call EIA last March?"
The main rejected idea is introducing Redis as a spool in phase 1. That adds moving parts before we know direct Edge -> Axiom is actually a problem.
Blocking Questions Before Build
Acceptance Criteria
- We can view live request rate, p95, top routes, top customers, and top providers.
- We can query historical usage by
customer_id, route, and provider over prior months.
- Provider usage reflects real outbound calls, not just inbound request volume.
- Telemetry failures do not materially affect API availability or latency.
Suggested Rollout
Context
We run ~270 API endpoints behind a shared Vercel Edge chokepoint in
server/gateway.tsand currently have no reliable usage visibility:That makes capacity planning, enterprise support, and historical drill-downs unnecessarily hard.
This issue is to align on the implementation direction with @koala73 before building it.
Decision Proposal
Use Axiom as the primary backend for this phase.
Key points:
fetchconsole.log(...)Final Plan
Instrument the chokepoint in
server/gateway.ts.Add
server/_shared/usage-context.tsbacked byAsyncLocalStorage.request_id,route,domain,customer_id,user_id,api_key_id,tier, andauth_kind.Requestobject threaded through every call.Add
server/_shared/usage.ts.emitUsageEvents(events)helper.fetchto Axiom.console.loginside, and not awaited on the request path.Start with one Axiom dataset:
wm_api_usage.event_typerather than nested arrays.event_type=requestevent_type=upstreamEmit one
requestevent per inbound API call with fields like:_time,event_type,request_iddomain,route,method,status,duration_msreq_bytes,res_bytes(best effort)customer_id,user_id,api_key_id,auth_kind,tiercountry,ua_hash,cache_statusEmit one
upstreamevent per real outbound provider call with fields like:_time,event_type,request_idcustomer_id,route,tierprovider,operation,hoststatus,duration_ms,request_bytes,response_bytescache_status='miss' | 'fresh'Instrument upstream calls in two passes.
server/_shared/redis.tson the fresh-fetch path only, plus a sharedtrackedFetch()helper for raw provider calls.Keep existing
console.logusage untouched.Keep phase 1 direct.
Roll out behind
USAGE_TELEMETRY=1.customer_id + provider + monthWhy this shape
This is the adversarial consensus version of two proposed approaches:
The main rejected idea is introducing Redis as a spool in phase 1. That adds moving parts before we know direct Edge -> Axiom is actually a problem.
Blocking Questions Before Build
What is the canonical
customer_idfor enterprise API traffic?Which request/query fields are explicitly allowed in telemetry vs must be redacted or omitted?
Acceptance Criteria
customer_id,route, andproviderover prior months.Suggested Rollout
usage-context.tsandusage.tsUSAGE_TELEMETRY