ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs by andygrove · Pull Request #4360 · apache/datafusion-comet

andygrove · 2026-05-18T16:18:05Z

Which issue does this PR close?

Closes #4327.

Rationale for this change

The spark-sql-auto-sql_core-1/spark-4.0.2-jdk21 job has been failing frequently with:

[info] ParquetFileFormatV{1,2}Suite (or OrcSourceV{1,2}Suite) *** ABORTED ***
[info]   \"There are 1 possibly leaked file streams.\" (SharedSparkSession.scala:189)

This is Spark's DebugFilesystem.assertNoOpenStreams() in SharedSparkSession.afterEach, retried via eventually for ~10s.

These four suites are flaky on JDK 21 even in apache/spark's own CI; the upstream workaround is DEDICATED_JVM_SBT_TESTS, which asks SBT to fork a dedicated JVM per listed suite. Comet's workflow already sets that env var for the Spark 4.0 row, but it has no effect because the workflow also unconditionally sets SERIAL_SBT_TESTS=1 (added in #4285 to cap peak memory on standard runners). In project/SparkBuild.scala:

if (!sys.env.contains(\"SERIAL_SBT_TESTS\")) {
  allProjects.foreach(enable(SparkParallelTestGrouping.settings))
}

SparkParallelTestGrouping.settings is the only consumer of DEDICATED_JVM_SBT_TESTS, so when SERIAL_SBT_TESTS is set the env var is read into an unused set and every suite runs in one shared forked JVM. Cross-suite state accumulation from ~11k prior tests is what trips the file-stream leak detection.

Evidence from a failing log (run 26004020697):

thread-leak warning mentions readingParquetFooters-ForkJoinPool-12260-worker-1 (high pool counter = many prior suites in this JVM)
unrelated suites (AnalysisConfOverrideSuite, TPCDSModifiedPlanStabilityWithStatsSuite, state-store suites) appear in the same JVM right before these Parquet/ORC suites
no SBT fork-restart markers between unrelated suites

What changes are included in this PR?

Drop SERIAL_SBT_TESTS=1 only for the Spark 4.0.2/JDK 21 matrix row, so Spark's SparkParallelTestGrouping is installed and DEDICATED_JVM_SBT_TESTS actually forks a separate JVM per listed suite. Other rows keep SERIAL_SBT_TESTS=1 so their memory profile is unchanged.
Override Global / concurrentRestrictions to cap parallel forked test JVMs at 1. Spark's defaults would otherwise allow up to 4 in parallel (each -Xmx2g), which would exceed the 7 GB runner budget. The cap is a no-op for rows that still have SERIAL_SBT_TESTS=1.

How are these changes tested?

CI on this PR will exercise the new path. A successful run for spark-sql-auto-sql_core-1/spark-4.0.2-jdk21 confirms that the dedicated-JVM workaround is now in effect and the four problem suites no longer abort.

…apache#4327) The workflow sets DEDICATED_JVM_SBT_TESTS for the V1/V2 Parquet and Orc source suites on Spark 4.0/JDK 21, expecting Spark's SparkParallelTestGrouping to fork a dedicated JVM per suite. But the same workflow also unconditionally sets SERIAL_SBT_TESTS=1, and project/SparkBuild.scala only installs SparkParallelTestGrouping when SERIAL_SBT_TESTS is unset. As a result every suite shares a single forked JVM, accumulated state from ~11k prior tests trips DebugFilesystem.assertNoOpenStreams() in afterEach, and the four problem suites are aborted with "There are 1 possibly leaked file streams." Drop SERIAL_SBT_TESTS only for the 4.0.2/JDK 21 matrix row so the grouping is installed and DEDICATED_JVM_SBT_TESTS actually takes effect. Override concurrentRestrictions to cap parallel forked test JVMs at 1 so peak memory stays within the 7 GB standard-runner budget (each test JVM has -Xmx2g).

mbutrovich

Worth a shot! Approved pending CI. It's so odd that of the 4.x suites it's 4.0.2 that gives grief and not 4.1.

mbutrovich approved these changes May 18, 2026

View reviewed changes

mbutrovich merged commit ffcebaa into apache:main May 18, 2026
190 of 192 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs#4360

ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs#4360
mbutrovich merged 1 commit into
apache:mainfrom
andygrove:fix-4327-dedicated-jvm-spark-4

andygrove commented May 18, 2026

Uh oh!

mbutrovich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andygrove commented May 18, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants