Skip to content

Add Axiom-based API usage observability at the Edge gateway #3381

@SebastienMelki

Description

@SebastienMelki

Context

We run ~270 API endpoints behind a shared Vercel Edge chokepoint in server/gateway.ts and currently have no reliable usage visibility:

  • no per-endpoint req/s or p95 latency
  • no stable per-customer or per-key usage attribution
  • no long-term queryable request log
  • no trustworthy upstream-provider usage view for ACLED / Wingbits / EIA / etc.

That makes capacity planning, enterprise support, and historical drill-downs unnecessarily hard.

This issue is to align on the implementation direction with @koala73 before building it.

Decision Proposal

Use Axiom as the primary backend for this phase.

Key points:

  • emit structured usage events from the Edge gateway via fetch
  • keep usage telemetry fully separate from existing console.log(...)
  • do not add Redis, S3, Parquet, or self-hosted ClickHouse in phase 1
  • accept small best-effort loss from Edge -> Axiom direct ingest
  • only add a tiny Railway relay later if direct Edge -> Axiom proves insufficient in canary

Final Plan

  1. Instrument the chokepoint in server/gateway.ts.

    • Create a request-scoped usage context at the start of each request.
    • Finalize telemetry after the response status and duration are known.
  2. Add server/_shared/usage-context.ts backed by AsyncLocalStorage.

    • Store request_id, route, domain, customer_id, user_id, api_key_id, tier, and auth_kind.
    • This keeps deep provider helpers from needing the Request object threaded through every call.
  3. Add server/_shared/usage.ts.

    • Provide a dedicated emitUsageEvents(events) helper.
    • Direct fetch to Axiom.
    • 1-2s timeout, no retry, no console.log inside, and not awaited on the request path.
  4. Start with one Axiom dataset: wm_api_usage.

    • Use flat rows with event_type rather than nested arrays.
    • event_type=request
    • event_type=upstream
  5. Emit one request event per inbound API call with fields like:

    • _time, event_type, request_id
    • domain, route, method, status, duration_ms
    • req_bytes, res_bytes (best effort)
    • customer_id, user_id, api_key_id, auth_kind, tier
    • country, ua_hash, cache_status
  6. Emit one upstream event per real outbound provider call with fields like:

    • _time, event_type, request_id
    • customer_id, route, tier
    • provider, operation, host
    • status, duration_ms, request_bytes, response_bytes
    • cache_status='miss' | 'fresh'
  7. Instrument upstream calls in two passes.

    • First, add coverage in shared helpers like server/_shared/redis.ts on the fresh-fetch path only, plus a shared trackedFetch() helper for raw provider calls.
    • Then migrate the highest-cost bypasses first: ACLED, Wingbits, EIA / economic shared fetches, then the rest incrementally.
  8. Keep existing console.log usage untouched.

    • Usage telemetry goes only to Axiom.
    • No log scraping.
  9. Keep phase 1 direct.

    • No Redis spool.
    • No object-store archive.
    • No Parquet generation from Edge.
    • Fallback path only if needed: a small Railway relay that batches to Axiom.
  10. Roll out behind USAGE_TELEMETRY=1.

  • ship dark to a test dataset
  • enable request events first
  • add upstream instrumentation next
  • build dashboards after real data lands
  • canary in prod and compare counts vs Vercel request volume
  1. Build initial dashboards in Axiom:
  • req/s and p95 by route
  • top customers by request count
  • top customers by upstream provider consumption
  • top providers by call volume and error rate
  • long-range drill-down by customer_id + provider + month

Why this shape

This is the adversarial consensus version of two proposed approaches:

  • keep the strongest part of the Axiom-first plan: one managed backend, direct Edge emission, no new infra by default
  • keep the strongest part of the stricter observability plan: stable customer attribution, explicit upstream-provider tracking, and event shapes that can answer questions like:
    • "which customer is burning provider capacity?"
    • "how much did Acme call EIA last March?"

The main rejected idea is introducing Redis as a spool in phase 1. That adds moving parts before we know direct Edge -> Axiom is actually a problem.

Blocking Questions Before Build

  • What is the canonical customer_id for enterprise API traffic?

    • org/account
    • contract/customer record
    • something else
  • Which request/query fields are explicitly allowed in telemetry vs must be redacted or omitted?

Acceptance Criteria

  • We can view live request rate, p95, top routes, top customers, and top providers.
  • We can query historical usage by customer_id, route, and provider over prior months.
  • Provider usage reflects real outbound calls, not just inbound request volume.
  • Telemetry failures do not materially affect API availability or latency.

Suggested Rollout

  • Define event schema and redaction rules
  • Add usage-context.ts and usage.ts
  • Instrument gateway request events behind USAGE_TELEMETRY
  • Send to Axiom test dataset and validate event quality
  • Instrument shared upstream paths
  • Canary in production
  • Create first operational dashboards

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions