Skip to content

RATIS-2548. Stabilize timing-sensitive Ratis tests#1475

Open
CRZbulabula wants to merge 6 commits into
apache:masterfrom
CRZbulabula:ratis-2548
Open

RATIS-2548. Stabilize timing-sensitive Ratis tests#1475
CRZbulabula wants to merge 6 commits into
apache:masterfrom
CRZbulabula:ratis-2548

Conversation

@CRZbulabula
Copy link
Copy Markdown
Contributor

@CRZbulabula CRZbulabula commented May 29, 2026

What changes were proposed in this pull request?

This PR stabilizes several timing-sensitive tests by replacing fixed sleeps or immediate assertions with waits for the concrete condition each test needs.

The changes include:

  • Wait for restart futures before printing logs and continuing to log assertions in RaftBasicTests.
  • Keep the kill-leader restart scheduled before client messages are sent, and assert that the expected append messages are present in order while tolerating an extra retry/failover state-machine entry.
  • Track async append replies with List<CompletableFuture<RaftClientReply>> and join each reply directly.
  • Wait for commit index / state machine count in linearizable read tests instead of relying on fixed sleeps.
  • Wait for the transaction context map to become empty in RaftLogTruncateTests.
  • Pause the current leader before election stepDown and wait until a different leader is elected.
  • Allow more time for the install-snapshot follower next-index assertion on slow CI.

Why are the changes needed?

These tests can fail on slower CI machines when asynchronous restart, log cleanup, state machine application, leadership transition, or install snapshot progress completes later than the fixed delay assumed by the test.

How was this patch tested?

mvn -pl ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test

Copy link
Copy Markdown
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CRZbulabula , thanks for fixing the tests! Please see the comments inlined.

Comment on lines +1157 to +1159
if (conf.isSingleMode(server.getId())) {
return true;
}
Copy link
Copy Markdown
Contributor

@szetszwo szetszwo May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this change separately. Then, this PR changes only the test code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the LeaderStateImpl change from this PR, so the current PR diff is test-only now. I will handle that leadership check separately.

Comment on lines 157 to 160
if (killLeader) {
log.info("killAndRestart leader " + leader.getId());
killAndRestartLeader = killAndRestartServer(leader.getId(), 0, 4000, cluster, log);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.

Before this change, killLeader is in the beginning. This change moves it to the end. It makes the test easier to pass but not fixing a bug.

It is good to test killLeader before client sending messages. So, let's don't make this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Restored the kill-leader timing so the leader restart is scheduled before client messages are sent. The final kill-leader assertion now checks that the expected messages appear in order, while allowing an extra state-machine entry produced by retry/failover.

Comment on lines +156 to +160
int ret = shell.run("election", "pause", "-peers", sb.toString(), "-address",
leader.getPeer().getAddress());
Assertions.assertEquals(0, ret);

ret = shell.run("election", "stepDown", "-peers", sb.toString());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is good. Could you also remove the redundant toString() calls?

    int ret = shell.run("election", "pause", "-peers", sb, "-address", leader.getPeer().getAddress());
    Assertions.assertEquals(0, ret);

    ret = shell.run("election", "stepDown", "-peers", sb);
    Assertions.assertEquals(0, ret);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the redundant toString() calls.

Comment on lines -161 to -169
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
} finally {
CompletableFuture.allOf(killAndRestartFollower, killAndRestartLeader).join();
}
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
log.info(cluster.printAllLogs());
killAndRestartFollower.join();
killAndRestartLeader.join();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait for restart futures before continuing to log assertions in RaftBasicTests.

You are right that we should join before printing the log.

How about we simply move cluster.printAllLogs() up? The try-finally make the code harder to read.

     Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
-    log.info(cluster.printAllLogs());
     killAndRestartFollower.join();
     killAndRestartLeader.join();
+    log.info(cluster.printAllLogs());

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the try/finally restructuring and now join the restart futures before cluster.printAllLogs().

} else if (asyncReplyCount.incrementAndGet() == messages.length) {
f.complete(null);
}
CompletableFuture.allOf(asyncReplies.toArray(new CompletableFuture<?>[0])).join();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since join() is called below. This allOf is not needed. Let's remove it.

BTW, changing

      final AtomicInteger asyncReplyCount = new AtomicInteger();
      final CompletableFuture<Void> f = new CompletableFuture<>();

to

      final List<CompletableFuture<RaftClientReply>> asyncReplies = new ArrayList<>();

does make the code easier to understand (although the original code is also correct.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed the redundant CompletableFuture.allOf(...).join() and kept the reply list cleanup.

@CRZbulabula
Copy link
Copy Markdown
Contributor Author

Thanks for the review. I updated the PR to address the inline comments:

  • Removed the LeaderStateImpl change from this PR; this PR is test-only now.
  • Restored the kill-leader timing so leader restart is scheduled before client messages are sent.
  • Kept the async reply list cleanup and removed the redundant allOf join.
  • Joined restart futures before cluster.printAllLogs() without the try/finally restructuring.
  • Removed the redundant toString() calls in the election command test.
  • For kill-leader append tests, the final assertion now verifies the expected messages appear in order while tolerating an extra retry/failover state-machine entry.

Local verification:

mvn -pl ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc test

mvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test

@CRZbulabula
Copy link
Copy Markdown
Contributor Author

Pushed fb3ee4ee to address the latest CI failures.

Summary:

  • TestManualRestoreSnapshot now verifies that the restarted follower has restored and applied at least the saved snapshot index, instead of racing against the leader current lastApplied value.
  • testAddServerForWaitReady now waits until the new server impls report running after the ADD configuration change, and always cleans up the START_COMPLETE injection in a finally block.

Local verification:

  • mvn -pl ratis-examples -am -Dtest=TestManualRestoreSnapshot test
  • mvn -pl ratis-test -am -Dtest=TestLeaderElectionWithSimulatedRpc#testAddServerForWaitReady,TestLeaderElectionWithGrpc#testAddServerForWaitReady,TestLeaderElectionWithNetty#testAddServerForWaitReady test

GitHub Actions for fb3ee4ee are green; coverage is skipped as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants