[SPARK-56586][CONNECT][TESTS] Retry flaky python foreachBatch termination test by LuciferYang · Pull Request #55786 · apache/spark

LuciferYang · 2026-05-10T04:39:19Z

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

… test with retry + timeout Wrap the body of "python foreachBatch process: process terminates after query is stopped" with SparkFunSuite.retry(n = 2) and failAfter(2.minutes) to bound the impact when the test hangs. The underlying hang (blocking readInt against a Python foreachBatch worker that does not send its response) is untouched; the wrapper is best-effort since failAfter uses Thread.interrupt which cannot unblock a non-interruptible socket read. Co-authored-by: Isaac

Refactor-only: keeps the diff vs master minimal by moving the body out from under the retry/failAfter wrappers, avoiding re-indentation. Co-authored-by: Isaac

The successful CI run (zhengruifeng/spark actions run 24756414638) shows the test completing in 5.57s. A 1-minute cap gives ~10x margin while keeping the 3-attempt retry budget at 3 minutes worst case. Co-authored-by: Isaac

Replace failAfter(1.minute) with a fresh daemon thread + Thread.join(timeoutMillis). The test hangs inside a blocking socket read (StreamingForeachBatchHelper.scala:172 dataIn.readInt()); Thread.interrupt, which failAfter uses, cannot unblock that, so the previous wrap was ineffective. Running the body on a separate thread lets the test thread proceed to a TimeoutException after 1 minute, letting retry fire. Co-authored-by: Isaac

Previous commit failed the sql/connect scalafmt check (inline try/catch inside an anonymous Runnable). Switch to a SAM lambda with explicit Runnable type and multi-line try/catch. Co-authored-by: Isaac

Previous CI run showed the retry mechanism firing correctly (TimeoutException on attempt 1, RETRY #1 and RETRY #2), but attempts 2 and 3 failed with IllegalArgumentException because attempt 1's leaked thread kept q2 alive in spark.streams.active, so re-creating a query with the same name failed. Suffix query names with an atomic counter so each attempt uses fresh names, and relax the "no running query" assertion to only check this attempt's queries (a leaked query from a timed-out attempt cannot be synchronously cleaned up). Drop the listener-count assertion since leaked listeners pollute it on retry. Co-authored-by: Isaac

Replace the AtomicInteger counter with System.nanoTime(); equivalent uniqueness across retries without the extra class-level field. Co-authored-by: Isaac

SparkFunSuite.retry emits its "===== RETRY #N =====" line via log4j, which in our setup only writes to target/unit-tests.log (visible only as a downloaded artifact, not in the live job log). Replace with a small local retry helper that prints to stdout and preserves the same semantics (afterEach/beforeEach reset between attempts, up to maxAttempts total). Co-authored-by: Isaac

…ckets Previous CI run showed the retry mechanism bounding the foreachBatch test at 3 minutes as designed, but the (up to 3) leaked worker threads each held a running SparkConnectService, active streaming queries, and open Python foreachBatch sockets. Downstream SparkConnectServiceE2ESuite tests then hung for ~10 minutes each against that polluted session, burning the 150-min job budget. Plumb the SessionHolder to the retry wrapper so that on timeout we can call streamingForeachBatchRunnerCleanerCache.cleanUpAll(), which eventually closes the Python worker SocketChannels. That makes the hung dataIn.readInt() throw AsynchronousCloseException, which unwinds the leaked thread through its finally block (stops SparkConnectService, removes listeners). The wrapper then joins the worker for up to 30s to let the cleanup complete before the next retry attempt starts. Co-authored-by: Isaac

Adding the onTimeout parameter in the previous commit pushed the method signature into a layout that scalafmt 3.8.6 rejects. Put the parameter list back on a single line so scalafmt is happy again. Generated-by: Claude Code (Anthropic Claude Opus 4.7)

…agnostics Address feedback on the original change: HIGH - Snapshot baseline listeners before the body and capture the live SparkConnectService.server reference inside the body. The finally only stops the service when its identity still matches and only removes listeners this attempt registered, so a leaked finally from a previously timed-out attempt can no longer tear down the live service or strip listeners belonging to a concurrent attempt. - Restore an attempt-scoped variant of the listener-count assertion: exactly one new listener (the cleaner listener) should be registered per attempt over the captured baseline. MEDIUM - If the body finishes during the 30s post-cleanup grace window, attach its original error as the TimeoutException's cause instead of dropping it, so a slow assertion failure is not misreported as a pure hang. - Raise the per-attempt cap from 1 minute to 2 minutes for slow-CI headroom while still strictly bounding the original 150-minute hang (worst case 3 * (2 min + 30s grace) ~= 7.5 minutes). - Add a TODO next to retryWithVisibleLog flagging consolidation with SparkFunSuite.retry once that helper supports console-visible retry notices. LOW - Use NonFatal instead of Throwable in retryWithVisibleLog so fatal errors propagate. - Dump the worker's name, state, and stack trace on timeout for post-mortem diagnosis. - Drop the unused default for awaitTestBodyInNewThread's onTimeout parameter; the only caller always supplies a non-trivial cleanup. - Print suppressed onTimeout cleanUpAll errors via println instead of swallowing them silently. - Mention the 30s grace period in the TimeoutException message.

…on timeout CI on the previous push hit a different hang mode than the original fix targets: the worker thread was stuck inside StreamExecution.interruptAndAwaitExecutionThreadTermination -> Thread.join (via query1.stop()), not inside dataIn.readInt(). Closing the cleaner cache via onTimeout does not unblock that join, and the default spark.sql.streaming.stopTimeout is 0 ("wait forever"), so query.stop() hangs indefinitely. Two changes: - Wrap the test body in withSQLConf(STREAMING_STOP_TIMEOUT = 30s) so query.stop() falls through with a TimeoutException instead of waiting forever; the outer 2-minute attempt cap can then recover via retry. - After onTimeout() in awaitTestBodyInNewThread, also call worker.interrupt() so any other interruptible blocking call (Thread.join, Object.wait, Thread.sleep) wakes up. onTimeout still handles non-interruptible socket reads via the cleaner-cache close. Also fixes the scalafmt issues that the previous push tripped: - Reorder the SQLConf import to its alphabetical position. - Shorten one comment line to fit under maxColumn = 98. - Align the closing scalastyle:on comment indent inside the catch.

zhengruifeng and others added 14 commits April 29, 2026 09:20

[TESTS] Extract foreachBatch termination test body into a private method

0f15ed5

Refactor-only: keeps the diff vs master minimal by moving the body out from under the retry/failAfter wrappers, avoiding re-indentation. Co-authored-by: Isaac

[TESTS] Tighten per-attempt timeout from 2 minutes to 1 minute

f60eaa5

The successful CI run (zhengruifeng/spark actions run 24756414638) shows the test completing in 5.57s. A 1-minute cap gives ~10x margin while keeping the 3-attempt retry budget at 3 minutes worst case. Co-authored-by: Isaac

[TESTS] Reformat awaitTestBodyInNewThread for scalafmt

f1df328

Previous commit failed the sql/connect scalafmt check (inline try/catch inside an anonymous Runnable). Switch to a SAM lambda with explicit Runnable type and multi-line try/catch. Co-authored-by: Isaac

[TESTS] Use System.nanoTime() for the query-name suffix

01f89f1

Replace the AtomicInteger counter with System.nanoTime(); equivalent uniqueness across retries without the extra class-level field. Co-authored-by: Isaac

format

e86b3cc

Merge branch 'apache:master' into fix-foreach-batch-flaky-test

803609f

LuciferYang marked this pull request as draft May 10, 2026 04:39

LuciferYang changed the title ~~Test Fix foreach batch flaky test~~ [SPARK-56586][CONNECT][TESTS] Retry flaky python foreachBatch termination test May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56586][CONNECT][TESTS] Retry flaky python foreachBatch termination test#55786

[SPARK-56586][CONNECT][TESTS] Retry flaky python foreachBatch termination test#55786
LuciferYang wants to merge 14 commits intoapache:masterfrom
LuciferYang:fix-foreach-batch-flaky-test

LuciferYang commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LuciferYang commented May 10, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants