RATIS-2548. Stabilize timing-sensitive Ratis tests by CRZbulabula · Pull Request #1475 · apache/ratis

CRZbulabula · 2026-05-29T05:37:02Z

What changes were proposed in this pull request?

This PR stabilizes several timing-sensitive tests by replacing fixed sleeps or immediate assertions with waits for the concrete condition each test needs.

The changes include:

Wait for restart futures before printing logs and continuing to log assertions in RaftBasicTests.
Keep the kill-leader restart scheduled before client messages are sent, and assert that the expected append messages are present in order while tolerating an extra retry/failover state-machine entry.
Track async append replies with List<CompletableFuture<RaftClientReply>> and join each reply directly.
Wait for commit index / state machine count in linearizable read tests instead of relying on fixed sleeps.
Wait for the transaction context map to become empty in RaftLogTruncateTests.
Pause the current leader before election stepDown and wait until a different leader is elected.
Allow more time for the install-snapshot follower next-index assertion on slow CI.

Why are the changes needed?

These tests can fail on slower CI machines when asynchronous restart, log cleanup, state machine application, leadership transition, or install snapshot progress completes later than the fixed delay assumed by the test.

How was this patch tested?

mvn -pl ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test

szetszwo

@CRZbulabula , thanks for fixing the tests! Please see the comments inlined.

szetszwo · 2026-05-29T15:59:55Z

+    if (conf.isSingleMode(server.getId())) {
+      return true;
+    }


Let's do this change separately. Then, this PR changes only the test code.

Done. Removed the LeaderStateImpl change from this PR, so the current PR diff is test-only now. I will handle that leadership check separately.

szetszwo · 2026-05-29T16:20:18Z

+      if (killLeader) {
+        log.info("killAndRestart leader " + leader.getId());
+        killAndRestartLeader = killAndRestartServer(leader.getId(), 0, 4000, cluster, log);
      }


Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.

Before this change, killLeader is in the beginning. This change moves it to the end. It makes the test easier to pass but not fixing a bug.

It is good to test killLeader before client sending messages. So, let's don't make this change?

Done. Restored the kill-leader timing so the leader restart is scheduled before client messages are sent. The final kill-leader assertion now checks that the expected messages appear in order, while allowing an extra state-machine entry produced by retry/failover.

szetszwo · 2026-05-29T16:28:53Z

+    int ret = shell.run("election", "pause", "-peers", sb.toString(), "-address",
+        leader.getPeer().getAddress());
+    Assertions.assertEquals(0, ret);
+
+    ret = shell.run("election", "stepDown", "-peers", sb.toString());


This change is good. Could you also remove the redundant toString() calls?

int ret = shell.run("election", "pause", "-peers", sb, "-address", leader.getPeer().getAddress()); Assertions.assertEquals(0, ret); ret = shell.run("election", "stepDown", "-peers", sb); Assertions.assertEquals(0, ret);

Done. Removed the redundant toString() calls.

szetszwo · 2026-05-29T16:55:38Z

+      Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
+    } finally {
+      CompletableFuture.allOf(killAndRestartFollower, killAndRestartLeader).join();
    }
-    Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
    log.info(cluster.printAllLogs());
-    killAndRestartFollower.join();
-    killAndRestartLeader.join();


Wait for restart futures before continuing to log assertions in RaftBasicTests.

You are right that we should join before printing the log.

How about we simply move cluster.printAllLogs() up? The try-finally make the code harder to read.

Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100); - log.info(cluster.printAllLogs()); killAndRestartFollower.join(); killAndRestartLeader.join(); + log.info(cluster.printAllLogs());

Done. Removed the try/finally restructuring and now join the restart futures before cluster.printAllLogs().

szetszwo · 2026-05-29T16:59:09Z

-            } else if (asyncReplyCount.incrementAndGet() == messages.length) {
-              f.complete(null);
-            }
+          CompletableFuture.allOf(asyncReplies.toArray(new CompletableFuture<?>[0])).join();


Since join() is called below. This allOf is not needed. Let's remove it.

BTW, changing

final AtomicInteger asyncReplyCount = new AtomicInteger(); final CompletableFuture<Void> f = new CompletableFuture<>();

to

final List<CompletableFuture<RaftClientReply>> asyncReplies = new ArrayList<>();

does make the code easier to understand (although the original code is also correct.)

Done. Removed the redundant CompletableFuture.allOf(...).join() and kept the reply list cleanup.

CRZbulabula · 2026-05-31T04:41:31Z

Thanks for the review. I updated the PR to address the inline comments:

Removed the LeaderStateImpl change from this PR; this PR is test-only now.
Restored the kill-leader timing so leader restart is scheduled before client messages are sent.
Kept the async reply list cleanup and removed the redundant allOf join.
Joined restart futures before cluster.printAllLogs() without the try/finally restructuring.
Removed the redundant toString() calls in the election command test.
For kill-leader append tests, the final assertion now verifies the expected messages appear in order while tolerating an extra retry/failover state-machine entry.

Local verification:

mvn -pl ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test

CRZbulabula · 2026-05-31T06:35:07Z

Pushed fb3ee4ee to address the latest CI failures.

Summary:

TestManualRestoreSnapshot now verifies that the restarted follower has restored and applied at least the saved snapshot index, instead of racing against the leader current lastApplied value.
testAddServerForWaitReady now waits until the new server impls report running after the ADD configuration change, and always cleans up the START_COMPLETE injection in a finally block.

Local verification:

mvn -pl ratis-examples -am -Dtest=TestManualRestoreSnapshot test
mvn -pl ratis-test -am -Dtest=TestLeaderElectionWithSimulatedRpc#testAddServerForWaitReady,TestLeaderElectionWithGrpc#testAddServerForWaitReady,TestLeaderElectionWithNetty#testAddServerForWaitReady test

GitHub Actions for fb3ee4ee are green; coverage is skipped as expected.

RATIS-2548. Stabilize timing-sensitive Ratis tests

ee03331

CRZbulabula mentioned this pull request May 29, 2026

RATIS-2529. Bound gRPC worker EventLoopGroup thread count. #1466

Open

3 tasks

CRZbulabula added 2 commits May 29, 2026 15:30

RATIS-2548. Stabilize async append kill-leader test

83627dc

RATIS-2548. Preserve single-mode leadership checks

0a56010

szetszwo reviewed May 29, 2026

View reviewed changes

RATIS-2548. Address review comments

82718bc

CRZbulabula force-pushed the ratis-2548 branch from d20f735 to 82718bc Compare May 31, 2026 04:58

CRZbulabula added 2 commits May 31, 2026 13:29

RATIS-2548. Stabilize test state machine snapshots

51b60c3

RATIS-2548. Stabilize snapshot and leader election tests

fb3ee4e

Conversation

CRZbulabula commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Uh oh!

szetszwo left a comment

Choose a reason for hiding this comment

Uh oh!

szetszwo May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CRZbulabula commented May 31, 2026

Uh oh!

CRZbulabula commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CRZbulabula commented May 29, 2026 •

edited

Loading

szetszwo May 29, 2026 •

edited

Loading