ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs#4360
Merged
Merged
Conversation
…apache#4327) The workflow sets DEDICATED_JVM_SBT_TESTS for the V1/V2 Parquet and Orc source suites on Spark 4.0/JDK 21, expecting Spark's SparkParallelTestGrouping to fork a dedicated JVM per suite. But the same workflow also unconditionally sets SERIAL_SBT_TESTS=1, and project/SparkBuild.scala only installs SparkParallelTestGrouping when SERIAL_SBT_TESTS is unset. As a result every suite shares a single forked JVM, accumulated state from ~11k prior tests trips DebugFilesystem.assertNoOpenStreams() in afterEach, and the four problem suites are aborted with "There are 1 possibly leaked file streams." Drop SERIAL_SBT_TESTS only for the 4.0.2/JDK 21 matrix row so the grouping is installed and DEDICATED_JVM_SBT_TESTS actually takes effect. Override concurrentRestrictions to cap parallel forked test JVMs at 1 so peak memory stays within the 7 GB standard-runner budget (each test JVM has -Xmx2g).
mbutrovich
approved these changes
May 18, 2026
Contributor
mbutrovich
left a comment
There was a problem hiding this comment.
Worth a shot! Approved pending CI. It's so odd that of the 4.x suites it's 4.0.2 that gives grief and not 4.1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4327.
Rationale for this change
The
spark-sql-auto-sql_core-1/spark-4.0.2-jdk21job has been failing frequently with:This is Spark's
DebugFilesystem.assertNoOpenStreams()inSharedSparkSession.afterEach, retried viaeventuallyfor ~10s.These four suites are flaky on JDK 21 even in apache/spark's own CI; the upstream workaround is
DEDICATED_JVM_SBT_TESTS, which asks SBT to fork a dedicated JVM per listed suite. Comet's workflow already sets that env var for the Spark 4.0 row, but it has no effect because the workflow also unconditionally setsSERIAL_SBT_TESTS=1(added in #4285 to cap peak memory on standard runners). Inproject/SparkBuild.scala:SparkParallelTestGrouping.settingsis the only consumer ofDEDICATED_JVM_SBT_TESTS, so whenSERIAL_SBT_TESTSis set the env var is read into an unused set and every suite runs in one shared forked JVM. Cross-suite state accumulation from ~11k prior tests is what trips the file-stream leak detection.Evidence from a failing log (run 26004020697):
readingParquetFooters-ForkJoinPool-12260-worker-1(high pool counter = many prior suites in this JVM)AnalysisConfOverrideSuite,TPCDSModifiedPlanStabilityWithStatsSuite, state-store suites) appear in the same JVM right before these Parquet/ORC suitesWhat changes are included in this PR?
SERIAL_SBT_TESTS=1only for the Spark 4.0.2/JDK 21 matrix row, so Spark'sSparkParallelTestGroupingis installed andDEDICATED_JVM_SBT_TESTSactually forks a separate JVM per listed suite. Other rows keepSERIAL_SBT_TESTS=1so their memory profile is unchanged.Global / concurrentRestrictionsto cap parallel forked test JVMs at 1. Spark's defaults would otherwise allow up to 4 in parallel (each-Xmx2g), which would exceed the 7 GB runner budget. The cap is a no-op for rows that still haveSERIAL_SBT_TESTS=1.How are these changes tested?
CI on this PR will exercise the new path. A successful run for
spark-sql-auto-sql_core-1/spark-4.0.2-jdk21confirms that the dedicated-JVM workaround is now in effect and the four problem suites no longer abort.