Skip to content

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958

Open
bootjp wants to merge 5 commits into
mainfrom
design/scaling-roadmap-2026-06
Open

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958
bootjp wants to merge 5 commits into
mainfrom
design/scaling-roadmap-2026-06

Conversation

@bootjp

@bootjp bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Proposed roadmap doc covering 4 scaling subsystems elastickv will need within one growth cycle. Each subsystem ships its own *_proposed_*.md milestone doc when its work queues up — this file is the shared north star + sequencing constraint.

Subsystems:

  • §3 Routing (1M routes target): delta-watcher streaming, B-tree RouteEngine, batched splits
  • §4 Multi-region (2-3 region active-active): WAN raft tuning, per-region HLC ceiling with monotone merge, region-local catalog mirrors
  • §5 Storage (5-10 TB/shard): SST-ingest snapshot transfer, shared block cache + per-shard tuning, per-shard compactor, S3 snapshot offload
  • §6 Coordinator (5-10x QPS): per-group HLC ceiling (kill default-group SPOF), follower reads, cross-shard 2PC, partitioned lock resolver, leader-proxy circuit breaker

Cross-cutting: every monotone-merge across new boundaries reuses M2 hotspot-split's SetPhysicalCeiling + Observe primitive. Capability bit rollout follows the existing cap_migration_v2 contract.

Sequencing (§8): §6 M1 (per-group HLC) → §3 M1 (delta watcher) → §5 M1/M2 (SST ingest + shared cache) → §3 M2/M3 → §4 (regions) → §6 M2-M5 → §5 M4. §6 M1 first because it removes the default-group HLC SPOF that every later milestone otherwise inherits.

10 open questions in §10 — the apply-pipeline cost of per-group HLC, partitioned-region migration policy, and cross-shard txn lock ordering need design-level decisions before the first milestone PR opens.

Self-review

  1. Data loss — N/A doc-only; the doc itself preserves M2 hotspot-split's monotone-merge contract for every new boundary
  2. Concurrency / distributed failures — surfaced as OQ (per-group HLC propose cost, partitioned-region policy, cross-shard txn deadlock prevention)
  3. Performance — SLO targets quantified per subsystem (§3.1, §4.1, §5.1, §6.1)
  4. Data consistency — explicitly: every new HLC boundary uses the same SetPhysicalCeiling primitive; capability bits gate every wire change
  5. Test coverage — §9 table maps each subsystem to unit/property tests + Jepsen workloads (existing route-shuffle extended to 100k routes; new multi-region partition workload; existing workloads with 10x data)

Test plan

  • doc reads coherent end-to-end
  • sequencing graph in §8 has no cycles
  • cross-refs to existing M2 hotspot-split design verified
  • reviewer feedback on §10 open questions (these are the design-level decisions blocking first milestone PR)

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive scaling roadmap detailing phased plans for multi-region active‑active, routing scale‑out, storage tier capacity/retention, coordinator/API gateway scaling, rollout sequencing, observability guidance, testing strategies, and open questions.
  • Tests

    • Improved learner-promotion test to use a robust retry/wait approach and explicit success criteria to better handle transient catch‑up conditions during node promotion.

…e tier, coordinator

elastickv today is sized for single DC, low-thousands of shards,
single-digit-TB per shard, and leader-bound serve paths. Each of
those four limits is reachable from the existing production envelope
within one growth cycle — and the four are coupled.

This roadmap captures the SLO targets, current-state breakage points,
per-subsystem milestone designs (high-level), cross-cutting concerns
(HLC ceiling composability, capability bits, observability), and the
sequencing constraint between them. Each milestone below ships its
own *_proposed_*.md doc when its work is queued — this file is the
shared north star.

Subsystems covered:

1. Routing / shard scale-out (target 1M routes)
   - M1 delta-watcher + streaming WatchCatalog (replace 100ms poll +
     full snapshot rebuild)
   - M2 B-tree RouteEngine + per-group index + COW history overlay
   - M3 BatchSplitRange for high-rate hotspot-split automation

2. Multi-region / cross-DC (target 2-3 region active-active)
   - M1 WAN raft tuning + region-aware membership
   - M2 per-region HLC ceiling with monotone merge on migration
     (reuses M2 hotspot-split SetPhysicalCeiling + Observe primitive)
   - M3 per-region catalog mirror via raft learners
   - M4 cross-region disaster recovery

3. Storage tier (target 5-10 TB per shard)
   - M1 SST-ingest snapshot transfer (kill the full-iter stream)
   - M2 shared block cache + per-shard pebble tuning + L0
     backpressure → ErrShardWriteBackpressure
   - M3 per-shard compactor jittered + 100k version cap + hot-key
     priority
   - M4 S3 snapshot offload for DR

4. Coordinator / API gateway (target 5-10x QPS)
   - M1 per-group HLC ceiling (kill default-group single point of
     failure); cap_per_group_hlc_v1
   - M2 follower-read path with max_staleness_ms contract
   - M3 cross-shard 2PC (lifts ErrCrossShardTransactionNotSupported)
   - M4 partitioned lock resolver (followers do the scan, leader
     applies)
   - M5 leader-proxy circuit breaker (kill 200-retry storm)

The sequencing (§8) puts §6 M1 first because it removes the
default-group HLC single point of failure that every later milestone
otherwise inherits. §3 M1 second because route-count growth feeds
both §4 (regions) and §6 M3 (cross-shard txn rebalancing freedom).

Cross-cutting invariant: every monotone-merge across the new
boundaries (per-shard, per-region) reuses the SetPhysicalCeiling +
Observe primitive that M2 hotspot-split (PR #945) specifies. The
capability bit rollout pattern follows the same M2 cap_migration_v2
contract.

10 open questions in §10 are reviewer-targets — particularly the
per-group HLC apply-pipeline cost (§10.1), the partitioned-region
migration policy (§10.2), and the cross-shard txn lock ordering
(§10.4) need design-level decisions before the first milestone PR
opens.
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 53 minutes and 58 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3096108-f694-41b4-ba01-ce7d8c86e698

📥 Commits

Reviewing files that changed from the base of the PR and between 2a8c266 and b44a61d.

📒 Files selected for processing (1)
  • docs/design/2026_06_12_proposed_scaling_roadmap.md
📝 Walkthrough

Walkthrough

This PR adds a comprehensive multi-subsystem scaling roadmap design document for elastickv spanning routing scale-out, multi-region active-active, storage tier, and coordinator improvements sequenced across milestones, and makes the learner promotion test more robust by retrying promotion on transient catch-up errors.

Changes

Multi-Subsystem Scaling Roadmap Design

Layer / File(s) Summary
Roadmap header, motivation, and overview
docs/design/2026_06_12_proposed_scaling_roadmap.md
Adds document metadata and introduces motivation, current scaling envelope, subsystem coupling, and sequencing needs (lines 1–178).
Routing scale-out milestones and SLOs
docs/design/2026_06_12_proposed_scaling_roadmap.md
Defines routing scale-out target SLOs and M1–M3 milestones: versioned diffs, indexed route engine, and batched catalog split/move proposals (lines 179–256).
Multi-region milestones and SLOs
docs/design/2026_06_12_proposed_scaling_roadmap.md
Defines multi-region target SLOs and M1–M4 milestones: WAN-safe Raft, per-region HLC with monotone-merge, catalog mirroring, and DR concepts (lines 257–343).
Storage tier milestones and SLOs
docs/design/2026_06_12_proposed_scaling_roadmap.md
Defines storage tier target SLOs and M1–M4 milestones: SST-level snapshot ingest, shared cache, sharded MVCC retention, and S3 snapshot offload (lines 344–412).
Coordinator/API gateway milestones and SLOs
docs/design/2026_06_12_proposed_scaling_roadmap.md
Defines coordinator target SLOs and M1–M5 milestones: per-group HLC renewal, follower reads, cross-shard 2PC, partitioned lock resolution, and bounded leader-proxy circuit-breaker (lines 413–510).
Cross-cutting invariants and observability
docs/design/2026_06_12_proposed_scaling_roadmap.md
Adds HLC/monotone-merge composability invariant, capability-bit rollout pattern, and observability metric constraints (lines 511–565).
Sequencing and rollout phases
docs/design/2026_06_12_proposed_scaling_roadmap.md
Provides recommended phased rollout order and dependency-driven flexibility for milestone ordering (lines 566–609).
Testing strategy and open questions
docs/design/2026_06_12_proposed_scaling_roadmap.md
Maps milestones to unit/property tests and Jepsen workloads, lists detailed open questions, and clarifies scope boundaries (lines 610–697).

Learner Promotion Test Robustness

Layer / File(s) Summary
Promotion retry with warm commit index
internal/raftengine/etcd/learner_test.go (lines 98–120)
Updates TestPromoteLearnerSwapsRoleToVoter to capture the leader’s warm proposal CommitIndex as the catch-up bar and retry PromoteLearner using require.Eventually, treating errPromoteLearnerNotCaughtUp as a transient wait condition.

Possibly related PRs

  • bootjp/elastickv#676: Introduced the learner and PromoteLearner test suite that this PR's test improvement builds upon.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A roadmap stitched in thoughtful lines,
Regions, routes, and storage that realigns,
When learners lag and need a shove,
A patient retry fits like a glove,
Hops forward — steady, strong, and kind.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically summarizes the main change — a design document introducing a scaling roadmap across four subsystems (multi-region, route scale-out, storage tier, coordinator).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/scaling-roadmap-2026-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dda7d0b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +437 to +438
- Followers serve reads at a bounded staleness, with the staleness
guaranteed by the existing lease infrastructure. A follower-read

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't base follower reads on leader-only leases

When implementing §6 M2, followers cannot use the existing lease path: Coordinate.LeaseRead in kv/coordinator.go only returns the fast path when the engine state is StateLeader, and its secondary lease check also requires leader state. In this scenario the proposed follower path either remains leader-bound and misses the 5x read-QPS target, or bypasses the existing checks and serves reads without follower-visible quorum/closed-timestamp evidence. Please make this milestone depend on a follower-safe timestamp/read-index mechanism instead of the current leader-local lease infrastructure.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a comprehensive scaling roadmap for elastickv, detailing milestones across routing scale-out, multi-region active-active support, storage tier optimizations, and coordinator throughput improvements. The reviewer provided valuable feedback on several architectural details: warning that spawning a goroutine per Raft group for HLC lease renewal will cause severe CPU thrashing at scale; highlighting a potential state machine stall if transaction status lookups are performed synchronously during lock-resolver delegation; pointing out a conflict between the proposed 100k version cap and the 30-minute retention window for high-throughput keys; and recommending atomic copy-on-write catalog updates to prevent readers from observing transiently inconsistent route states during delta application.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +431 to +432
- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
(one goroutine per group, each ticking 1 s independently).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Running a dedicated goroutine per Raft group for HLC lease renewal (ShardedCoordinator.RunHLCLeaseRenewal) will not scale well when the cluster grows to the targeted 100k–1M routes/groups. Spawning and waking up hundreds of thousands of goroutines every second will cause severe CPU thrashing and scheduler overhead.

Instead of a per-group loop with its own goroutine, consider managing the renewals via a centralized timer wheel or a single background worker goroutine that iterates over or batches the active groups led by the node.

Suggested change
- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
(one goroutine per group, each ticking 1 s independently).
- ShardedCoordinator.RunHLCLeaseRenewal managed via a centralized timer wheel or a single background worker (rather than a dedicated goroutine per group) to avoid scheduler overhead at scale.

Comment on lines +468 to +469
- Net: the leader is no longer the single scan point — followers
do the scan, leaders just apply the resolved batches.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When delegating the lock-resolver scan to a follower, the design needs to clarify where the primary transaction status lookup (which requires querying the primary shard, potentially on a remote node/region) is performed:

  1. If the follower performs the lookup before proposing the batch, the follower must execute remote RPCs.
  2. If the leader performs the lookup after receiving the proposed batch of expired locks, the leader must execute those RPCs.

In either case, performing these lookups synchronously within the Raft apply loop is a major hazard that can stall the state machine. The design should explicitly state that transaction status lookups must be performed asynchronously outside the Raft execution path, with only the final, fully-resolved commit/abort decisions proposed/applied to the log.

Suggested change
- Net: the leader is no longer the single scan point — followers
do the scan, leaders just apply the resolved batches.
- Net: the leader is no longer the single scan point — followers do the scan and asynchronously resolve transaction statuses against primary shards before proposing the final resolved batches.

Comment on lines +378 to +380
- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)
down to 100 k AND a compactor cycle that prioritises hot keys
(closes the 555 writes/s 30-min hot-key cap regression).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Lowering the per-key version cap from 1M down to 100k (line 378) while maintaining a 30-minute retention window (line 374) introduces a conflict for high-throughput keys.

At 100k versions, any key written at more than 100,000 / 1800 ≈ 55 writes/second will exceed the cap within the 30-minute window. If the compactor enforces the 100k cap by discarding older versions, it will violate the 30-minute retention guarantee for those hot keys. The design should clarify how the system resolves the conflict between the hard version cap and the time-based retention window for high-write-rate keys.

Suggested change
- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)
down to 100 k AND a compactor cycle that prioritises hot keys
(closes the 555 writes/s 30-min hot-key cap regression).
- Per-key version cap: dynamically scale or bound the cap based on the write rate to ensure the 30-min retention window is not violated for hot keys, while prioritizing hot-key compaction.

Comment on lines +201 to +202
- Watcher pulls deltas, applies under fine-grained per-route lock
(per-route version vector instead of a global slice rebuild).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Applying deltas under fine-grained per-route locks (lines 201-202) can lead to transiently inconsistent catalog states where some route updates are visible to readers while others are not.

Since the Composed-1 cross-version-read gate and the history ring (M2) rely on consistent point-in-time RouteHistorySnapshot views, any non-atomic application of deltas could cause routing anomalies or spurious validation failures. The delta application should be batched and committed atomically to the RouteEngine (e.g., via a copy-on-write version transition) so that readers always observe a fully consistent catalog version.

Suggested change
- Watcher pulls deltas, applies under fine-grained per-route lock
(per-route version vector instead of a global slice rebuild).
- Watcher pulls deltas, applies them to a copy-on-write version overlay, and publishes the new catalog version atomically to prevent readers from observing transiently inconsistent route states.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/design/2026_06_12_proposed_scaling_roadmap.md (1)

612-613: 🏗️ Heavy lift

Add explicit HLC / mirror-consistency coverage to the multi-region test row.

The current plan only names WAN-tuning unit tests for internal/raftengine/etcd, but the real risk in §4.2 is the per-region HLC ceiling + monotone-merge + region-local catalog mirror contract. Please add direct kv/hlc.go / kv/coordinator.go property tests and a partition/stale-mirror case so the correctness invariant is exercised before rollout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_12_proposed_scaling_roadmap.md` around lines 612 - 613,
Add explicit property tests and a partition/stale-mirror workload to exercise
the per-region HLC ceiling, monotone-merge, and region-local catalog mirror
contract: add unit/property tests in the kv/hlc and kv/coordinator modules that
assert HLC ceiling behavior and monotone-merge invariants under concurrent
writes, and implement a new multi-region jepsen-style workload (e.g.,
"stale-mirror-during-write" or "partition-stale-mirror") that injects network
partitions and simulated stale region-local catalog mirrors during writes to
validate mirror-consistency; update the §4 multi-region test row in the roadmap
to list these kv/hlc and kv/coordinator property tests and the new
partition/stale-mirror workload.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_06_12_proposed_scaling_roadmap.md`:
- Around line 565-591: The fenced block containing the phase roadmap (the block
starting with "Phase 1 (data-plane scale, no new region):" and listing §6 M1, §3
M1, etc.) lacks a language tag which triggers markdownlint MD040; fix by adding
a language identifier (e.g., change the opening ``` to ```text) so the fence is
explicitly marked as plain text and the lint warning is resolved.

---

Nitpick comments:
In `@docs/design/2026_06_12_proposed_scaling_roadmap.md`:
- Around line 612-613: Add explicit property tests and a partition/stale-mirror
workload to exercise the per-region HLC ceiling, monotone-merge, and
region-local catalog mirror contract: add unit/property tests in the kv/hlc and
kv/coordinator modules that assert HLC ceiling behavior and monotone-merge
invariants under concurrent writes, and implement a new multi-region
jepsen-style workload (e.g., "stale-mirror-during-write" or
"partition-stale-mirror") that injects network partitions and simulated stale
region-local catalog mirrors during writes to validate mirror-consistency;
update the §4 multi-region test row in the roadmap to list these kv/hlc and
kv/coordinator property tests and the new partition/stale-mirror workload.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ec122dfd-8b2e-4c6a-b78b-06c9a0e7170f

📥 Commits

Reviewing files that changed from the base of the PR and between 688ba8a and 2e3c4f3.

📒 Files selected for processing (2)
  • docs/design/2026_06_12_proposed_scaling_roadmap.md
  • internal/raftengine/etcd/learner_test.go

Comment thread docs/design/2026_06_12_proposed_scaling_roadmap.md Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: df26571662

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +197 to +200
- Catalog rows already carry a `catalog_version` per
`RouteDescriptor` and version bumps; expose
`GetCatalogChanges(from_version, to_version) → []RouteDelta`
with `RouteDelta = {op: ADD|MODIFY|REMOVE, RouteDescriptor}`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Define delta storage before relying on per-route versions

This milestone depends on a per-route catalog_version, but the current catalog schema does not store one: distribution.RouteDescriptor only carries route id/range/group/state/parent, and CatalogStore.Save writes a single global CatalogVersionKey plus the changed route rows. With the existing storage, GetCatalogChanges(from_version, to_version) cannot discover ADD/MODIFY/REMOVE deltas by scanning route versions; it would have to reconstruct and diff full snapshots, reintroducing the 1M-route full-catalog cost this milestone is supposed to remove. Please add an explicit change log or per-route version schema to the roadmap before making delta fan-out the dependency.

Useful? React with 👍 / 👎.

Comment on lines +453 to +455
`served_at_ts < read_ts - max_staleness_ms`.
- Per-shard "follower ready" gate: follower must have applied at
least up to `max_applied_ts ≥ read_ts - max_staleness_ms`.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Convert HLC timestamps before applying staleness budgets

If this formula is implemented literally, max_staleness_ms is subtracted from an encoded HLC timestamp, not from milliseconds. kv/hlc.go stores timestamps as physical_ms << HLCLogicalBits | logical, so a 1000 ms budget subtracts only 1000 logical ticks (usually less than one physical millisecond) and healthy follower reads would be rejected, missing the planned read-QPS gain. The contract should either compare physical milliseconds or define served_at_ts/read_ts as decoded millisecond values here.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2a8c2665ed

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +296 to +299
migration path) carry the source-region ceiling on the
`ExportRangeVersionsRequest` and the target region's
`SetPhysicalCeiling` applies max(local, imported) — the same
monotone-merge primitive M2 already specifies. This bounds the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Carry the source HLC value during region cutover

If this cutover contract is implemented literally, the target only imports the source physical ceiling. That is not enough to preserve HLC monotonicity: SetPhysicalCeiling advances only the physical ceiling, while Observe is what advances the last issued packed HLC value. When the source has already issued (ceiling<<16)|n in the same physical millisecond, the target can start at (ceiling<<16)|0, moving timestamps backwards for the migrated route and weakening MVCC/OCC ordering. Please make the migration payload carry and observe the source max HLC, or advance the physical ceiling past it.

Useful? React with 👍 / 👎.

Comment on lines +397 to +399
- Periodic per-shard Pebble snapshot uploaded to an S3-compatible
bucket (the S3 adapter already speaks the protocol).
- Restore is `s3 fetch → pebble.Ingest`; combined with M1's

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the missing outbound S3 offload dependency

The current S3 adapter is an inbound S3-compatible server backed by the local KV coordinator, not an outbound client for writing to an external bucket. If M4 depends on reusing it as the offload path, DR snapshots would either be stored back inside the same cluster or have no upload path at all. This milestone should call out the required external S3 client/credentials/KMS integration instead of treating the existing adapter as sufficient.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant