Deterministic state syncs#2177
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
There was a problem hiding this comment.
Pull request overview
Introduces a fork-gated “deterministic state sync” flow for Bor by resolving a cutoff-time → Heimdall height, then querying state-sync events “visible at” that fixed Heimdall height to avoid non-determinism from querying a moving Heimdall head.
Changes:
- Added Bor config fork flag
DeterministicStateSyncBlockand plumbing to include it in fork ID gathering. - Extended Heimdall client interfaces (HTTP/gRPC/in-process app + failover wrapper) with
GetBlockHeightByTimeandStateSyncEventsAtHeight, and switchedBor.CommitStatesto use them post-fork. - Added/updated tests and mocks to cover fork switching and URL formatting.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
consensus/bor/bor.go |
Switches state sync fetching logic to deterministic (height-anchored) mode post-fork. |
consensus/bor/heimdall.go |
Extends IHeimdallClient with the new deterministic state-sync APIs. |
consensus/bor/heimdall/client.go |
Implements the new endpoints for the HTTP Heimdall client + URL builders. |
consensus/bor/heimdallgrpc/state_sync.go |
Implements gRPC equivalents for height-anchored state sync and height-by-time lookup. |
consensus/bor/heimdallapp/state_sync.go |
Implements in-process HeimdallApp equivalents using keeper/query server. |
consensus/bor/heimdall/failover_client.go |
Extends failover endpoint interface and wrapper calls for the new APIs. |
consensus/bor/heimdall/metrics.go |
Adds request-type metrics for the new endpoints. |
consensus/bor/heimdall/state_sync_url_test.go |
Adds URL-format tests for the new HTTP paths (and validates RFC3339 formatting). |
params/config.go |
Adds DeterministicStateSyncBlock and IsDeterministicStateSync. |
params/config_test.go |
Unit tests for IsDeterministicStateSync. |
core/forkid/forkid.go |
Includes the new fork block in Bor fork gathering for fork ID computation. |
consensus/bor/bor_test.go |
Adds tests verifying pre-/post-fork behavior and failure semantics post-fork. |
tests/bor/mocks/IHeimdallClient.go |
Regenerated gomock for updated IHeimdallClient. |
consensus/bor/span_store_test.go |
Updates test mocks to satisfy the extended Heimdall client interface. |
eth/handler_bor_test.go |
Updates test mock to satisfy the extended Heimdall client interface. |
eth/ethconfig/config_test.go |
Updates test mock to satisfy the extended Heimdall client interface. |
consensus/bor/heimdall/failover_client_test.go |
Updates mock endpoint used by failover tests to satisfy the extended interface. |
go.mod |
Pins/upgrades github.com/0xPolygon/heimdall-v2 to a newer pseudo-version (via replace). |
go.sum |
Updates checksums for the new heimdall-v2 version. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude review |
|
Claude finished @marcello33's task in 1m 50s —— View job Code Review: Deterministic State Syncs
Overall this is a well-structured PR. The fork-gating approach is correct, the error handling post-fork is properly strict (returning errors instead of silently swallowing), and the test coverage for the fork switch is good. Below are findings organized by severity. HIGH: HTTP client does client-side filtering that gRPC/app clients do notIn if e.Id >= fromID && e.RecordTime.Before(time.Unix(toTime, 0)) {But the gRPC implementation ( This means the three transport implementations have different behavior if the Heimdall endpoint returns events outside the expected range. For a deterministic state sync feature, all paths must produce identical results. Either:
The existing MEDIUM:
|
|
test |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (24.71%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2177 +/- ##
===========================================
- Coverage 51.93% 51.91% -0.02%
===========================================
Files 884 884
Lines 155446 155616 +170
===========================================
+ Hits 80731 80795 +64
- Misses 69509 69620 +111
+ Partials 5206 5201 -5
... and 19 files with indirect coverage changes
🚀 New features to boost your workflow:
|
|
@claude addressed your points.
Re-review. |
|
Claude finished @marcello33's task in 1m 14s —— View job Re-review: Deterministic State Syncs
All six items from the previous review have been addressed correctly:
Remaining finding: Sort inconsistency across transportsThe HTTP client sorts events after fetching ( sort.SliceStable(eventRecords, func(i, j int) bool {
return eventRecords[i].ID < eventRecords[j].ID
})But neither the gRPC client ( Either all three should sort, or none should (if the server guarantees ordering). Given this feature targets determinism, I'd recommend all three apply the same sort for defense-in-depth. Everything else looks good. The core design is sound and the error handling is correct. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude addressed your points. Review once more and resolve the related open conversations if the comments are addressed. |
Code Review1 issue found. Checked for bugs and CLAUDE.md compliance. Bug: Missing global timeout on StateSyncEventsAtHeight HTTP client pagination loop File: consensus/bor/heimdall/client.go L310-313 StateSyncEventsAtHeight has no context.WithTimeout bounding its pagination loop. Every sibling method added in this PR wraps its loop with a 1-minute timeout (HTTP StateSyncEventsByTime at L365-368, gRPC StateSyncEventsAtHeight at state_sync.go L95-97, gRPC StateSyncEventsByTime at state_sync.go L161-163). This method is the only one missing it. If the Heimdall server consistently returns full pages, the loop runs indefinitely. The comment on L364-365 of StateSyncEventsByTime states the intent to match the gRPC implementation, suggesting this was an oversight. Suggested fix: Add a global timeout at the top of StateSyncEventsAtHeight: ctx, cancel := context.WithTimeout(ctx, 1*time.Minute) / defer cancel() Flagged per consensus-security.md (timeout handling for Heimdall RPC calls) and security-common.md (context propagation for operations that can block). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
There was a problem hiding this comment.
LGTM — all review feedback addressed including the failover pagination timeout fix (1-minute budget set at MultiHeimdallClient level before callWithFailover, per the recommendation in the prior inline comments).
Extended reasoning...
Overview
19-file PR introducing deterministic state syncs for the Bor consensus layer. Core change: post-fork, CommitStates calls the new combined StateSyncEventsByTime endpoint (which resolves Heimdall height internally) instead of the non-deterministic (fromID, toTime) pattern. Changes span config (DeterministicStateSyncBlock), consensus logic, all three Heimdall transports (HTTP, gRPC, in-process app), the failover client, fork ID, and tests.
Security risks
No auth or crypto changes. The new API calls go to Heimdall endpoints (trusted internal dependency). The main risk is liveness: post-fork, Heimdall connectivity failures cause block proposals with empty state sync sets rather than hard crashes, matching the pre-fork resilience posture. This was an explicit design decision documented in code comments and accepted by the team.
Level of scrutiny
High scrutiny was appropriate and applied across 5+ review rounds. All raised issues were addressed:
- Sort consistency across HTTP/gRPC/app transports
- URL encoding with
url.Values(notfmt.Sprintf) - 1-minute global pagination timeout at
MultiHeimdallClientlevel (set beforecallWithFailoverapplies its 30s per-attempt cap) StateSyncEventsByTimemethod in all mocks- Comprehensive failover tests for all three new methods
DeterministicStateSyncBlockin startup banner, fork ID, and JSON serialization- err shadowing, context chain growth, and other code quality issues
Other factors
The feature is gated by DeterministicStateSyncBlock: nil on mainnet/amoy, so it cannot activate until explicitly scheduled. The author confirmed manual testing on kurtosis devnets with chaos testing (2-vs-2 partition) with zero errors and state roots matching Erigon. The 19:53 inline comments from the previous approver run were about an intermediate implementation; the final diff implements the correct fix (timeout at MultiHeimdallClient level), as described in those same comments as the recommended fix.
There was a problem hiding this comment.
My previous comment flagging a missing timeout in StateSyncEventsAtHeight was incorrect — the 1-minute context.WithTimeout is already present in the current diff at client.go:313. This PR is ready for human sign-off.
Extended reasoning...
Overview
This PR adds deterministic state sync support to Bor's consensus engine. It introduces DeterministicStateSyncBlock fork gating, a new StateSyncEventsByTime combined endpoint, StateSyncEventsAtHeight and GetBlockHeightByTime for backward-compatibility, and wires all three through the HTTP/gRPC/in-process-app transports plus the MultiHeimdallClient failover wrapper. 19 files changed.
Security risks
The PR touches consensus-critical code (CommitStates), Heimdall client authentication, and fork-ID peer compatibility. The error-handling choice (log + return empty post-fork, matching pre-fork behavior) was explicitly discussed and accepted. No auth bypass or injection risks identified.
Level of scrutiny
This is a high-scrutiny change. It modifies how validators derive state sync event sets — any divergence between validators produces consensus failures. The PR has gone through six rounds of review; all previously raised issues (filtering inconsistency, sort ordering, URL encoding, context shadowing, timeout propagation through callWithFailover, missing mock stubs) have been addressed. My last inline comment was a false positive — the fix was already in place.
Other factors
- Codecov shows 11.8% patch coverage on the new transport code; the gRPC and app-client paths are 0% covered by automated tests. SonarQube flagged 26.4% duplication, primarily from the repeated mock stubs across test files.
- The pre-existing dead code in
internalFetchWithTimeout(unconditional 30s override) was acknowledged but not fixed in this PR. go.moduses a pseudo-version replace directive forheimdall-v2; a TODO marks it for cleanup post-release.DeterministicStateSyncBlockisnilon all live networks, so the new path is not yet active in production.
A human reviewer should confirm the overall design and sign off before merge.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 19 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Review1 issue found. Checked for bugs and CLAUDE.md compliance. Bug: Per-attempt timeout (30s) silently caps the intended 1-minute pagination budgetFile: consensus/bor/heimdall/failover_client.go:113-122
Additionally, HeimdallClient.StateSyncEventsByTime (client.go:278) sets its own redundant 1-minute timeout inside the already-tighter 30s context - this inner timeout can never fire. Effect: With default settings, the paginated fetch for all pages must complete within 30s per failover attempt, not the intended 1 minute. During periods with many queued state sync events requiring multiple pages, this could cause premature timeouts, triggering unnecessary failover cascading and ultimately returning empty state syncs (the proposer misses a slot). Note: The test Suggested fix: Remove the 1-minute timeout from |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 20 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
This PR is stale because it has been open 21 days with no activity. Remove stale label or comment or this will be closed in 14 days. |


Description
Implements the bor side of deterministic state syncs. Post-fork, Bor switches from the non-deterministic
(from_id, to_time)state sync query pattern to the combinedGetStateSyncsByTimeendpoint that resolves the Heimdall height internally and returns events in a single call. This ensures all validators derive identical state-sync sets regardless of when they query Heimdall.Changes
DeterministicStateSyncBlocktoBorConfigwithIsDeterministicStateSync()checkCommitStatesbranches on fork activation — post-fork uses the combinedStateSyncEventsByTimeendpoint. On errors, logs and returns empty (matching pre-fork resilience) instead of crashing block productionStateSyncEventsByTimeto all three transports (HTTP, gRPC, in-process app). Consistent sorting and retry behavior across all transports.MultiHeimdallClientwraps all new methods with the existing failover mechanism. Fixed flaky failover tests that used global hit counters instead of method-specific tracking.DeterministicStateSyncBlockincluded inGatherForks()for peer compatibilityChainConfig.Description()Activation
DeterministicStateSyncBlock:nil(disabled) on mainnet/amoy until HF is scheduledChanges
Breaking changes
HF needed.
Checklist
Cross repository changes
Testing
Manual tests
Tested on kurtosis devnets with chaos testing (2-vs-2 network partition) and self-heal recovery. Many rounds of chaos, zero errors, state roots match erigon.