Skip to content

fix(billing): harden concurrency safety and payment failure handling#491

Merged
GenerQAQ merged 1 commit intodevfrom
fix/billing-concurrency-safety
Mar 25, 2026
Merged

fix(billing): harden concurrency safety and payment failure handling#491
GenerQAQ merged 1 commit intodevfrom
fix/billing-concurrency-safety

Conversation

@GenerQAQ
Copy link
Copy Markdown
Contributor

Why we need this PR?

The billing system had several concurrency vulnerabilities and incomplete payment failure handling:

  1. Race conditions in usage counter increments (read-modify-write pattern) could cause lost updates under concurrent sync-usage invocations
  2. No distributed lock on PushMetrics allowed concurrent calls to double-report usage to Stripe
  3. Monthly reset checkpoint used a non-atomic check-then-set pattern, allowing duplicate resets
  4. Payment failures were only handled for promo-expired subscriptions, leaving general payment failures (card decline, insufficient funds) untracked
  5. Timing attack vulnerability in MetricsAuth token comparison

Describe your solution

Atomic DB operations (Supabase)

  • New increment_project_usage Postgres RPC: atomic INSERT ... ON CONFLICT DO UPDATE replaces read-then-write pattern for usage counters
  • New try_claim_monthly_reset Postgres RPC: atomic checkpoint claim ensures only one concurrent caller performs the reset
  • updateStorageUsage converted to use .upsert() with onConflict

Distributed lock (Go API)

  • Redis SetNX lock on PushMetrics (5min TTL) prevents concurrent Stripe double-reporting
  • SaveMetrics wrapped in DB transaction with SELECT FOR UPDATE + sorted lock ordering to prevent deadlocks

Payment status tracking

  • New payment_status column on organization_billing: ok | past_due | blocked
  • Metered invoice failure → blocked (hard-blocks all writes via excess=true on all metric tags)
  • Plan invoice failure → past_due + schedule downgrade to free
  • Payment success selectively clears only the matching failure type (metered clears blocked, plan clears past_due)

Safety improvements

  • downgrade_to metadata validated against allowlist ["free", "pro", "team"]
  • Failed monthly resets tracked in resetFailedOrgs — all subsequent steps skip those orgs
  • subtle.ConstantTimeCompare for MetricsAuth token check
  • Auth failure spans now consistently set authenticated=false

Implementation Tasks

  • Add payment_status column to organization_billing schema
  • Create increment_project_usage Postgres RPC function
  • Create try_claim_monthly_reset Postgres RPC function
  • Replace read-modify-write in incrementUsage with RPC call
  • Replace read-modify-write in updateStorageUsage with upsert
  • Replace hasOrgResetThisMonth/markOrgMonthAsReset with atomic tryClaimMonthlyReset
  • Add Redis distributed lock to PushMetrics handler
  • Wrap SaveMetrics in transaction with SELECT FOR UPDATE
  • Implement metered vs plan invoice failure differentiation
  • Add payment-blocked org enforcement (force excess=true)
  • Add resetFailedOrgs tracking and skip logic
  • Validate downgrade_to metadata against plan allowlist
  • Fix constant-time token comparison in MetricsAuth
  • Add authenticated=false span attributes on auth failure paths

Impact Areas

  • API Server
  • Dashboard
  • Client SDK (Python)
  • Client SDK (TypeScript)
  • Core Service
  • CLI Tool
  • Documentation
  • Other: ...

Checklist

  • Open your pull request against the dev branch.
  • All tests pass in available continuous integration systems (e.g., GitHub Actions).
  • Tests are added or modified as needed to cover code changes.

- Add payment_status field (ok/past_due/blocked) to organization_billing
- Replace read-modify-write with atomic Postgres RPCs (increment_project_usage, try_claim_monthly_reset)
- Add Redis distributed lock to PushMetrics to prevent concurrent Stripe double-reporting
- Wrap SaveMetrics in DB transaction with SELECT FOR UPDATE + consistent lock ordering
- Differentiate metered vs plan invoice failures in Stripe webhook
- Force excess=true for all projects of payment-blocked orgs
- Skip processing for orgs with failed monthly reset to prevent stale-counter billing
- Use constant-time comparison for MetricsAuth token check
- Validate downgrade_to metadata against plan allowlist
- Sync period_end/subscription_id on scheduled cancellations
@GenerQAQ GenerQAQ requested a review from a team as a code owner March 25, 2026 07:19
@GenerQAQ GenerQAQ merged commit cdd47ed into dev Mar 25, 2026
4 checks passed
@GenerQAQ GenerQAQ deleted the fix/billing-concurrency-safety branch March 25, 2026 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant