Skip to content

ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs#4360

Merged
mbutrovich merged 1 commit into
apache:mainfrom
andygrove:fix-4327-dedicated-jvm-spark-4
May 18, 2026
Merged

ci: fix Spark 4.0.2/JDK 21 flake by enabling per-suite dedicated JVMs#4360
mbutrovich merged 1 commit into
apache:mainfrom
andygrove:fix-4327-dedicated-jvm-spark-4

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4327.

Rationale for this change

The spark-sql-auto-sql_core-1/spark-4.0.2-jdk21 job has been failing frequently with:

[info] ParquetFileFormatV{1,2}Suite (or OrcSourceV{1,2}Suite) *** ABORTED ***
[info]   \"There are 1 possibly leaked file streams.\" (SharedSparkSession.scala:189)

This is Spark's DebugFilesystem.assertNoOpenStreams() in SharedSparkSession.afterEach, retried via eventually for ~10s.

These four suites are flaky on JDK 21 even in apache/spark's own CI; the upstream workaround is DEDICATED_JVM_SBT_TESTS, which asks SBT to fork a dedicated JVM per listed suite. Comet's workflow already sets that env var for the Spark 4.0 row, but it has no effect because the workflow also unconditionally sets SERIAL_SBT_TESTS=1 (added in #4285 to cap peak memory on standard runners). In project/SparkBuild.scala:

if (!sys.env.contains(\"SERIAL_SBT_TESTS\")) {
  allProjects.foreach(enable(SparkParallelTestGrouping.settings))
}

SparkParallelTestGrouping.settings is the only consumer of DEDICATED_JVM_SBT_TESTS, so when SERIAL_SBT_TESTS is set the env var is read into an unused set and every suite runs in one shared forked JVM. Cross-suite state accumulation from ~11k prior tests is what trips the file-stream leak detection.

Evidence from a failing log (run 26004020697):

  • thread-leak warning mentions readingParquetFooters-ForkJoinPool-12260-worker-1 (high pool counter = many prior suites in this JVM)
  • unrelated suites (AnalysisConfOverrideSuite, TPCDSModifiedPlanStabilityWithStatsSuite, state-store suites) appear in the same JVM right before these Parquet/ORC suites
  • no SBT fork-restart markers between unrelated suites

What changes are included in this PR?

  • Drop SERIAL_SBT_TESTS=1 only for the Spark 4.0.2/JDK 21 matrix row, so Spark's SparkParallelTestGrouping is installed and DEDICATED_JVM_SBT_TESTS actually forks a separate JVM per listed suite. Other rows keep SERIAL_SBT_TESTS=1 so their memory profile is unchanged.
  • Override Global / concurrentRestrictions to cap parallel forked test JVMs at 1. Spark's defaults would otherwise allow up to 4 in parallel (each -Xmx2g), which would exceed the 7 GB runner budget. The cap is a no-op for rows that still have SERIAL_SBT_TESTS=1.

How are these changes tested?

CI on this PR will exercise the new path. A successful run for spark-sql-auto-sql_core-1/spark-4.0.2-jdk21 confirms that the dedicated-JVM workaround is now in effect and the four problem suites no longer abort.

…apache#4327)

The workflow sets DEDICATED_JVM_SBT_TESTS for the V1/V2 Parquet and
Orc source suites on Spark 4.0/JDK 21, expecting Spark's
SparkParallelTestGrouping to fork a dedicated JVM per suite. But the
same workflow also unconditionally sets SERIAL_SBT_TESTS=1, and
project/SparkBuild.scala only installs SparkParallelTestGrouping when
SERIAL_SBT_TESTS is unset. As a result every suite shares a single
forked JVM, accumulated state from ~11k prior tests trips
DebugFilesystem.assertNoOpenStreams() in afterEach, and the four
problem suites are aborted with "There are 1 possibly leaked file
streams."

Drop SERIAL_SBT_TESTS only for the 4.0.2/JDK 21 matrix row so the
grouping is installed and DEDICATED_JVM_SBT_TESTS actually takes
effect. Override concurrentRestrictions to cap parallel forked test
JVMs at 1 so peak memory stays within the 7 GB standard-runner
budget (each test JVM has -Xmx2g).
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth a shot! Approved pending CI. It's so odd that of the 4.x suites it's 4.0.2 that gives grief and not 4.1.

@mbutrovich mbutrovich merged commit ffcebaa into apache:main May 18, 2026
190 of 192 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Frequent CI failures for Spark 4.0.2 / JDK 21

2 participants