Add Axiom-based API usage observability at the Edge gateway

## Context

We run ~270 API endpoints behind a shared Vercel Edge chokepoint in `server/gateway.ts` and currently have no reliable usage visibility:

- no per-endpoint req/s or p95 latency
- no stable per-customer or per-key usage attribution
- no long-term queryable request log
- no trustworthy upstream-provider usage view for ACLED / Wingbits / EIA / etc.

That makes capacity planning, enterprise support, and historical drill-downs unnecessarily hard.

This issue is to align on the implementation direction with @koala73 before building it.

## Decision Proposal

Use **Axiom as the primary backend** for this phase.

Key points:

- emit structured usage events from the Edge gateway via `fetch`
- keep usage telemetry fully separate from existing `console.log(...)`
- do **not** add Redis, S3, Parquet, or self-hosted ClickHouse in phase 1
- accept small best-effort loss from Edge -> Axiom direct ingest
- only add a tiny Railway relay later **if** direct Edge -> Axiom proves insufficient in canary

## Final Plan

1. Instrument the chokepoint in `server/gateway.ts`.
   - Create a request-scoped usage context at the start of each request.
   - Finalize telemetry after the response status and duration are known.

2. Add `server/_shared/usage-context.ts` backed by `AsyncLocalStorage`.
   - Store `request_id`, `route`, `domain`, `customer_id`, `user_id`, `api_key_id`, `tier`, and `auth_kind`.
   - This keeps deep provider helpers from needing the `Request` object threaded through every call.

3. Add `server/_shared/usage.ts`.
   - Provide a dedicated `emitUsageEvents(events)` helper.
   - Direct `fetch` to Axiom.
   - 1-2s timeout, no retry, no `console.log` inside, and not awaited on the request path.

4. Start with **one Axiom dataset**: `wm_api_usage`.
   - Use flat rows with `event_type` rather than nested arrays.
   - `event_type=request`
   - `event_type=upstream`

5. Emit one `request` event per inbound API call with fields like:
   - `_time`, `event_type`, `request_id`
   - `domain`, `route`, `method`, `status`, `duration_ms`
   - `req_bytes`, `res_bytes` (best effort)
   - `customer_id`, `user_id`, `api_key_id`, `auth_kind`, `tier`
   - `country`, `ua_hash`, `cache_status`

6. Emit one `upstream` event per real outbound provider call with fields like:
   - `_time`, `event_type`, `request_id`
   - `customer_id`, `route`, `tier`
   - `provider`, `operation`, `host`
   - `status`, `duration_ms`, `request_bytes`, `response_bytes`
   - `cache_status='miss' | 'fresh'`

7. Instrument upstream calls in two passes.
   - First, add coverage in shared helpers like `server/_shared/redis.ts` on the fresh-fetch path only, plus a shared `trackedFetch()` helper for raw provider calls.
   - Then migrate the highest-cost bypasses first: ACLED, Wingbits, EIA / economic shared fetches, then the rest incrementally.

8. Keep existing `console.log` usage untouched.
   - Usage telemetry goes only to Axiom.
   - No log scraping.

9. Keep phase 1 direct.
   - No Redis spool.
   - No object-store archive.
   - No Parquet generation from Edge.
   - Fallback path only if needed: a small Railway relay that batches to Axiom.

10. Roll out behind `USAGE_TELEMETRY=1`.
   - ship dark to a test dataset
   - enable request events first
   - add upstream instrumentation next
   - build dashboards after real data lands
   - canary in prod and compare counts vs Vercel request volume

11. Build initial dashboards in Axiom:
   - req/s and p95 by route
   - top customers by request count
   - top customers by upstream provider consumption
   - top providers by call volume and error rate
   - long-range drill-down by `customer_id + provider + month`

## Why this shape

This is the adversarial consensus version of two proposed approaches:

- keep the strongest part of the Axiom-first plan: one managed backend, direct Edge emission, no new infra by default
- keep the strongest part of the stricter observability plan: stable customer attribution, explicit upstream-provider tracking, and event shapes that can answer questions like:
  - "which customer is burning provider capacity?"
  - "how much did Acme call EIA last March?"

The main rejected idea is introducing Redis as a spool in phase 1. That adds moving parts before we know direct Edge -> Axiom is actually a problem.

## Blocking Questions Before Build

- What is the canonical `customer_id` for enterprise API traffic?
  - org/account
  - contract/customer record
  - something else

- Which request/query fields are explicitly allowed in telemetry vs must be redacted or omitted?

## Acceptance Criteria

- We can view live request rate, p95, top routes, top customers, and top providers.
- We can query historical usage by `customer_id`, `route`, and `provider` over prior months.
- Provider usage reflects real outbound calls, not just inbound request volume.
- Telemetry failures do not materially affect API availability or latency.

## Suggested Rollout

- [ ] Define event schema and redaction rules
- [ ] Add `usage-context.ts` and `usage.ts`
- [ ] Instrument gateway request events behind `USAGE_TELEMETRY`
- [ ] Send to Axiom test dataset and validate event quality
- [ ] Instrument shared upstream paths
- [ ] Canary in production
- [ ] Create first operational dashboards


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Axiom-based API usage observability at the Edge gateway #3381

Context

Decision Proposal

Final Plan

Why this shape

Blocking Questions Before Build

Acceptance Criteria

Suggested Rollout

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add Axiom-based API usage observability at the Edge gateway #3381

Description

Context

Decision Proposal

Final Plan

Why this shape

Blocking Questions Before Build

Acceptance Criteria

Suggested Rollout

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions