RATIS-2548. Stabilize timing-sensitive Ratis tests#1475
Conversation
szetszwo
left a comment
There was a problem hiding this comment.
@CRZbulabula , thanks for fixing the tests! Please see the comments inlined.
| if (conf.isSingleMode(server.getId())) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Let's do this change separately. Then, this PR changes only the test code.
There was a problem hiding this comment.
Done. Removed the LeaderStateImpl change from this PR, so the current PR diff is test-only now. I will handle that leadership check separately.
| if (killLeader) { | ||
| log.info("killAndRestart leader " + leader.getId()); | ||
| killAndRestartLeader = killAndRestartServer(leader.getId(), 0, 4000, cluster, log); | ||
| } |
There was a problem hiding this comment.
Wait for async append replies before injecting the kill-leader restart in RaftBasicTests.
Before this change, killLeader is in the beginning. This change moves it to the end. It makes the test easier to pass but not fixing a bug.
It is good to test killLeader before client sending messages. So, let's don't make this change?
There was a problem hiding this comment.
Done. Restored the kill-leader timing so the leader restart is scheduled before client messages are sent. The final kill-leader assertion now checks that the expected messages appear in order, while allowing an extra state-machine entry produced by retry/failover.
| int ret = shell.run("election", "pause", "-peers", sb.toString(), "-address", | ||
| leader.getPeer().getAddress()); | ||
| Assertions.assertEquals(0, ret); | ||
|
|
||
| ret = shell.run("election", "stepDown", "-peers", sb.toString()); |
There was a problem hiding this comment.
This change is good. Could you also remove the redundant toString() calls?
int ret = shell.run("election", "pause", "-peers", sb, "-address", leader.getPeer().getAddress());
Assertions.assertEquals(0, ret);
ret = shell.run("election", "stepDown", "-peers", sb);
Assertions.assertEquals(0, ret);There was a problem hiding this comment.
Done. Removed the redundant toString() calls.
| Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100); | ||
| } finally { | ||
| CompletableFuture.allOf(killAndRestartFollower, killAndRestartLeader).join(); | ||
| } | ||
| Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100); | ||
| log.info(cluster.printAllLogs()); | ||
| killAndRestartFollower.join(); | ||
| killAndRestartLeader.join(); |
There was a problem hiding this comment.
Wait for restart futures before continuing to log assertions in RaftBasicTests.
You are right that we should join before printing the log.
How about we simply move cluster.printAllLogs() up? The try-finally make the code harder to read.
Thread.sleep(cluster.getTimeoutMax().toIntExact(TimeUnit.MILLISECONDS) + 100);
- log.info(cluster.printAllLogs());
killAndRestartFollower.join();
killAndRestartLeader.join();
+ log.info(cluster.printAllLogs());There was a problem hiding this comment.
Done. Removed the try/finally restructuring and now join the restart futures before cluster.printAllLogs().
| } else if (asyncReplyCount.incrementAndGet() == messages.length) { | ||
| f.complete(null); | ||
| } | ||
| CompletableFuture.allOf(asyncReplies.toArray(new CompletableFuture<?>[0])).join(); |
There was a problem hiding this comment.
Since join() is called below. This allOf is not needed. Let's remove it.
BTW, changing
final AtomicInteger asyncReplyCount = new AtomicInteger();
final CompletableFuture<Void> f = new CompletableFuture<>();to
final List<CompletableFuture<RaftClientReply>> asyncReplies = new ArrayList<>();does make the code easier to understand (although the original code is also correct.)
There was a problem hiding this comment.
Done. Removed the redundant CompletableFuture.allOf(...).join() and kept the reply list cleanup.
|
Thanks for the review. I updated the PR to address the inline comments:
Local verification:
|
|
Pushed Summary:
Local verification:
GitHub Actions for |
What changes were proposed in this pull request?
This PR stabilizes several timing-sensitive tests by replacing fixed sleeps or immediate assertions with waits for the concrete condition each test needs.
The changes include:
RaftBasicTests.List<CompletableFuture<RaftClientReply>>and join each reply directly.RaftLogTruncateTests.Why are the changes needed?
These tests can fail on slower CI machines when asynchronous restart, log cleanup, state machine application, leadership transition, or install snapshot progress completes later than the fixed delay assumed by the test.
How was this patch tested?
mvn -pl ratis-test -am -Dtest=TestLinearizableReadRepliedIndexWithGrpc,TestRaftAsyncWithGrpc#testBasicAppendEntriesAsyncKillLeader,TestElectionCommandIntegrationWithGrpc,TestRaftLogTruncateWithGrpc,TestRaftWithGrpc testmvn -pl ratis-test -am -Dtest=TestRaftReconfigurationWithSimulatedRpc#testKillLeaderDuringReconf test