[SPARK-56535][BUILD] Fix CI & base image build issues by holdenk · Pull Request #55432 · apache/spark

holdenk · 2026-04-20T18:32:22Z

What changes were proposed in this pull request?

Update the base image build for the CI infra/docker file to a supported ubuntu and do automatic apt-get update on apt-get install failures.

Why are the changes needed?

Two reasons:

Ubuntu focal is EOL, we're already using 22.04 in the GHA directly, so we need to migrate to a non-EOL Ubuntu version for testing. This means that currently the build fails if there is cache miss because it can not do an apt-get install
Docker caching means that the apt-get update can be cached BUT be stale resulting in a subsequent install failing.

Does this PR introduce any user-facing change?

No, CI only.

How was this patch tested?

Running through CI

Was this patch authored or co-authored using generative AI tooling?

Auto-complete with copilot was turned on but none of it's suggestsions were useful except for some comments.

Claude was used to add adds resilient retry logic to Docker operations in the JDBC integration test suite to handle transient failures from Docker registries and daemons, which has been flaky during the test (added here instead of in 4 and backporting since the classes have been rewritten in 4).

Also used claude to suggest versions to pin back for roxygen issues during build.

…o that we don't get a partial cache fetch error. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…k python packages that don't work in 3.9 Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

…it and building from src fails Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk · 2026-05-01T18:54:18Z

CC @devin-petersohn who's probably got a good handle on old versions of Python does this look reasonable-ish?

devin-petersohn · 2026-05-01T19:47:45Z

+# Image for building and testing Spark branches. Based on Ubuntu 22.04.
 # See also in https://hub.docker.com/_/ubuntu
-FROM ubuntu:focal-20221019
+FROM ubuntu:jammy


Should we pin this?

I was going back and forth on this, given we do an apt-get update anyways personally I think pinning it is actually counter productive.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

… log the package

…hich sort of comes under the WTF view of package management so lets do more explicit installs and also build up in such a way that the install actually works.

… ignore)

…ute on a 'builtin'

Docker Hub occasionally returns transient 5xx responses (e.g. 502 Bad Gateway on manifest HEAD requests), which currently aborts suites like PostgresKrbIntegrationSuite. Wrap the pull/inspect calls in an exponential-backoff retry so flaky GitHub CI runs survive these blips. https://claude.ai/code/session_01Py9jZBMMdNaCBJ4vvHc3kd Co-authored-by: Claude <noreply@anthropic.com>

…heck.

mypy 0.991 (pinned for Python 3.8/3.9 support on this branch) crashes during cache serialization on pydantic v2's recursive JsonValue type. pydantic isn't a direct PySpark dep but gets pulled in transitively via mlflow in the lint env. follow_imports = skip prevents mypy from analyzing pydantic at all, sidestepping the assertion. Co-authored-by: Claude <noreply@anthropic.com>

* Workaround roxygen2 'cannot set an attribute on a builtin' in create-rd.sh When roxygen2 processes @family members for topics like dim.Rd, it calls add_s3_metadata to mark s3 generics. For SparkR, the lookup resolves to base R primitives (dim, nrow, ncol, ifelse, ...) that SparkR registers S4 methods for. R disallows setting attributes on builtins, so `class(val) <- c("s3generic", "function")` aborts with "cannot set an attribute on a 'builtin'", failing the whole Rd build. Monkey-patch roxygen2's internal add_s3_metadata in create-rd.sh to swallow that specific error and return the primitive unchanged, so documentation generation can proceed regardless of the installed roxygen2 version. * Skip cleanClosure for primitive functions in SparkR When SparkR's RDD machinery wraps a user closure, cleanClosure() walks the closure and calls environment(func) <- newEnv. For primitive functions like `+`, `max`, `min`, recent R versions raise the warning "setting environment(<primitive function>) is not possible and trying it is deprecated", which can be promoted to an error and breaks reduce/reduceByKey-style RDD ops (test_rdd.R count by values, maximum, minimum). Primitives have no R-level closure to clean, so return them unchanged. --------- Co-authored-by: Claude <noreply@anthropic.com>

… for releaswe we'll use conda anyways

aajisaka · 2026-05-11T09:59:29Z

Thank you @holdenk. I didn't notice this PR while I was working for fixing branch-3.5 build in #55764. Now I don't think it's useful, but just FYI.

* Fix remaining CI issues - Correct version spec typo in dev/requirements.txt (=> -> >=) so pip can parse pyarrow and pandas constraints. The previous form aborted every pyspark-* build during the install step. - Restore the roxygen2 add_s3_metadata workaround in R/create-rd.sh and the primitive-function bypass in R/pkg/R/utils.R. The linter job still calls R/install-dev.sh, so the SparkR-disable change left it failing on the original "cannot set an attribute on a 'builtin'" error. - Use snake_case kwargs (error_class/message_parameters) in PySparkValueError raised from UserDefinedType.fromJson; the PySparkException constructor only accepts the snake_case names. - Re-add # type: ignore[attr-defined] on typing.GenericAlias and typing._GenericAlias references so mypy 0.982 stops failing on attributes the stubs do not expose. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Drop jinja2<3.0.0 / markupsafe pins from requirements.txt pandas>=1.3 requires jinja2>=3.0.0 for Styler. With jinja2<3.0.0 pinned in requirements.txt, the pyspark-pandas tests aborted with "Pandas requires version '3.0.0' or newer of 'jinja2'" as soon as they touched DataFrame.style. The Sphinx<3.1.0 doc build still pins jinja2<3.0.0 and markupsafe==2.0.1 in its own pip install step, so the docs job is unaffected. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Stop installing dev/requirements.txt for python3.8 in CI python/run-tests.py was already changed to skip python3.8, so the matching install step in build_and_test.yml is dead weight. Worse, many packages in requirements.txt have dropped 3.8 support (mlflow 3.x, torch 2.1+, numpy 2.x, etc.), so pip's resolver burns minutes backtracking and ultimately fails with exit code 1, taking the whole pyspark-* matrix down. Drop the python3.8 -m pip install line (and its companion pip list) so the install step only runs for python3.9. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Revert the GenericAlias # type: ignore in typehints.py The ignores I added in 5dc31f4 are correct under mypy with Python 3.11 semantics, but CI runs mypy under python3.9. There, sys.version_info < (3, 11) is statically True, so mypy treats the else branch as unreachable and never checks the typing.GenericAlias / typing._GenericAlias accesses. With warn_unused_ignores=True, the ignore comments themselves become the lint error. Drop them. The branch they protect is dead code from mypy 0.982's perspective on Python 3.9. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Make typehints.py's GenericAlias ignores tolerant of both mypy modes The typing.GenericAlias / typing._GenericAlias accesses sit inside a sys.version_info >= (3, 11) branch. Mypy under python_version 3.11+ needs the # type: ignore[attr-defined] (the stubs don't expose those attrs); mypy under python_version 3.9 / 3.10 considers the branch unreachable and flags the same ignores as unused. Re-add the ignores and silence warn_unused_ignores for pyspark.pandas.typedef.typehints so both modes pass. Locally verified against --python-version 3.9 and --python-version 3.11. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Resolve typing.GenericAlias via getattr to dodge mypy attr-defined The static typing.GenericAlias / typing._GenericAlias accesses tripped mypy across multiple Python versions: * python_version 3.11+ stubs don't expose them, so an attr-defined error fires. * python_version 3.9 / 3.10 treats the else branch as unreachable, so any # type: ignore[attr-defined] on those lines becomes "unused". Resolve both classes through getattr with a safe fallback to type(None) so mypy never sees the static attribute access, while the runtime on Python 3.11+ still binds the real classes. Drop the previous [mypy-pyspark.pandas.typedef.typehints] override that was tracking the warn_unused_ignores half of this dance. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Skip mypy follow-imports for sqlalchemy mypy 0.982 raises INTERNAL ERROR while analyzing sqlalchemy/engine/default.py:334, which it sees because sqlalchemy is pulled in transitively (e.g. via mlflow). Skip following sqlalchemy imports, mirroring the existing pydantic carve-out. CI lint log: starting mypy annotations test... annotations failed mypy checks: /usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/default.py:334: error: INTERNAL ERROR -- Please try using mypy master on GitHub version: 0.982 https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Fix mypy data test failures CI lint log identified 10 failures in pytest-mypy-plugins (mypy data tests): 1. Cascade: many tests (test_session, test_udf, test_rdd) saw an unexpected output line: pyspark/sql/functions:71: error: Cannot determine type of "has_numpy" [has-type] The "has_numpy = False" + try/import/has_numpy = True idiom in pyspark/sql/utils.py left mypy unable to infer has_numpy's type when functions.py imports it. Annotate has_numpy: bool = False so mypy can resolve it and the cascade goes away. 2. mypy 0.982 output drift on overload notes: - test_feature.yml::stringIndexerOverloads expected "def StringIndexer(self, ...)" in the overload notes, but mypy 0.982 prints "def __init__(self, ...)". Update the expected output. - test_functions.yml::varargFunctionsOverloads expected "def [ColumnOrName_] array(Union[...])" but mypy 0.982 includes the positional-only "__cols" param name. Update each note. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Always skip R API doc generation in the docs build SparkR is disabled in this fork's CI (precondition forces sparkr=false), but the docs build still tried to run the SparkR API doc generation because dev/is-changed.py -m sparkr returns true whenever the branch touches R/* files (it did, for the roxygen2 workaround). create-docs.sh then tries to load pkgdown via SparkR, and the R session has no pkgdown installed in its library path: Error in loadNamespace(x) : there is no package called 'pkgdown' R doc generation failed (RuntimeError) jekyll build aborts. Force SKIP_RDOC=1 unconditionally so jekyll skips the R API doc step. PySpark doc skipping stays gated on is-changed.py as before. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE --------- Co-authored-by: Claude <noreply@anthropic.com>

* Fix processClosure failure on primitive functions in SparkR `environment()` returns NULL for primitives, so `parent.env(environment(func))` errored with "argument is not an environment" in test_rdd.R's "count by values and keys". Skip the parent-env check for primitives — identity is sufficient — and drop the redundant `ifelse(cond, TRUE, FALSE)`, which also dispatched through SparkR's S4 ifelse generic. https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC * Always assign captured function to cleaned closure env The earlier primitive fix exposed a latent bug: when `found` was true, `break` skipped `assign(nodeChar, obj, envir = newEnv)`, so the captured function was never added to the cleaned closure. This was harmless before because the parent-env equality check rarely matched non-primitive closures, but for primitives like `+` shared across closures it does match and caused 'object combineFunc not found' on workers in test_rdd.R countByValue. Skip only the recursive cleanClosure work when already examined; always emit the binding into newEnv. https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC --------- Co-authored-by: Claude <noreply@anthropic.com>

sfc-gh-hkarau and others added 7 commits April 17, 2026 16:01

Move the apt-get installs up and more consistnetnly use APT_INSTALL s…

c6483e3

…o that we don't get a partial cache fetch error. Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Fall back on install failure of poorely cached apt-get update

da16bb3

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Try and fix the PKGS ref not flowing throw to the ohterside of the ||

dbb00f5

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Ok focal is dead dead, lets move to jammy

4b65d11

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Use deadsnakes for Python 3.8

f80cad7

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Ok we need gpg-agent for add-apt-repository?

8acd5f3

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Use APT_INSTALL so we don't block forver (mybad)

737dd17

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk force-pushed the SPARK-56535-fix-base-image-build branch from 7bb3ffe to 737dd17 Compare April 20, 2026 18:53

sfc-gh-hkarau and others added 5 commits April 20, 2026 12:04

Fix aptinstall usage

3d36203

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Use 3.8 pip bootstrap, install 3.9 venv and 3.8 venv support, pin bac…

08d60da

…k python packages that don't work in 3.9 Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Pin back some more

de4e4be

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

Keep python3.8 but not via pypy3.8 since there's no pandas wheel for …

78227d3

…it and building from src fails Co-authored-by: Holden Karau <holden@pigscanfly.ca>

While fixing it the pypa pip bootstrap switched to 3.10 oldest version.

7faca4e

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk changed the title ~~[WIP][SPARK-56535][BUILD] Fix base image build~~ [SPARK-56535][BUILD] Fix base image build May 1, 2026

holdenk marked this pull request as ready for review May 1, 2026 18:53

devin-petersohn reviewed May 1, 2026

View reviewed changes

ugh apt-get flakes

6813916

Co-authored-by: Holden Karau <holden@pigscanfly.ca>

holdenk force-pushed the SPARK-56535-fix-base-image-build branch from 9abfefa to 6813916 Compare May 2, 2026 07:01

sfc-gh-hkarau added 11 commits May 4, 2026 19:47

This is kind of hacky but gh keeps timing out on add-apt-repository.

c3fca50

Add a comment explaining the DDoS

e661592

And try and fallback to Python src build since DDoS

0a24e25

Fix src bld

a92bfb9

Add /usr/local/bin to end of path for alt install.

ce3e152

oh we also get 3.9 from deadsnakes....

9e9660d

Ok fall back to fcix mirror iff regular archive is dead

1699976

When we install Python from src we don't get setuptools or venv

e70f9fc

I wonder if maybe just the mariadb 10.5.12 container is too dead.

28bd5fb

Cleanup

b1f4d29

Apparently R package installs can just silently fail, love that, lets…

9ba7f3d

… log the package

sfc-gh-hkarau added 6 commits May 5, 2026 11:43

Ok R apparently just silently fails and marks packages as installed w…

2db7bbd

…hich sort of comes under the WTF view of package management so lets do more explicit installs and also build up in such a way that the install actually works.

hmm mysql scheme auth

494fb33

Bump mypy for the iceberg type erasure issue (otherwise we'll mark as…

ba94573

… ignore)

Python3.8 list

469c1f1

Use raw Python3.8 if present too.

abd303b

pin back some roxygen2 deps to work around the ! cannot set an attrib…

7ec1542

…ute on a 'builtin'

holdenk changed the title ~~[SPARK-56535][BUILD] Fix base image build~~ [SPARK-56535][BUILD] Fix CI & base image build issues May 6, 2026

sfc-gh-hkarau and others added 12 commits May 6, 2026 13:19

Add all dev deps for testing in 3.8/3.9

94fa0b6

typo

9dd5e6d

Add pyarrow to base container image and bump mypy version in the CI c…

b00d985

…heck.

Back to previous version of mypy

5b7d2b0

Install python reqs

1bf34ac

Disable SparkR in CI it's broken and has been for awhile, in practice…

3eb8acc

… for releaswe we'll use conda anyways

Typo

de6ab4f

hmmm does it pass without 3.8? It's just type errors in 3.8

2379399

Pin back pandas and plotly to probably supported versions

1ec7e70

holdenk mentioned this pull request May 10, 2026

[SPARK-56808][INFRA][3.5] Fix branch-3.5 base image build against Ubuntu focal archive rotation #55785

Closed

Change version spec in req file

df10c5a

LuciferYang mentioned this pull request May 11, 2026

[SPARK-56540][INFRA][3.5] Add libuv1-dev to CI Docker images to fix R package fs build failure #55757

Closed

holdenk and others added 2 commits May 11, 2026 15:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56535][BUILD] Fix CI & base image build issues#55432

[SPARK-56535][BUILD] Fix CI & base image build issues#55432
holdenk wants to merge 45 commits into
apache:branch-3.5from
holdenk:SPARK-56535-fix-base-image-build

holdenk commented Apr 20, 2026 •

edited

Loading

Uh oh!

holdenk commented May 1, 2026

Uh oh!

devin-petersohn May 1, 2026

Uh oh!

holdenk May 2, 2026

Uh oh!

aajisaka commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

holdenk commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

holdenk commented May 1, 2026

Uh oh!

devin-petersohn May 1, 2026

Choose a reason for hiding this comment

Uh oh!

holdenk May 2, 2026

Choose a reason for hiding this comment

Uh oh!

aajisaka commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

holdenk commented Apr 20, 2026 •

edited

Loading