Skip to content

[SPARK-56535][BUILD] Fix CI & base image build issues#55432

Open
holdenk wants to merge 45 commits into
apache:branch-3.5from
holdenk:SPARK-56535-fix-base-image-build
Open

[SPARK-56535][BUILD] Fix CI & base image build issues#55432
holdenk wants to merge 45 commits into
apache:branch-3.5from
holdenk:SPARK-56535-fix-base-image-build

Conversation

@holdenk
Copy link
Copy Markdown
Contributor

@holdenk holdenk commented Apr 20, 2026

What changes were proposed in this pull request?

Update the base image build for the CI infra/docker file to a supported ubuntu and do automatic apt-get update on apt-get install failures.

Why are the changes needed?

Two reasons:

  1. Ubuntu focal is EOL, we're already using 22.04 in the GHA directly, so we need to migrate to a non-EOL Ubuntu version for testing. This means that currently the build fails if there is cache miss because it can not do an apt-get install
  2. Docker caching means that the apt-get update can be cached BUT be stale resulting in a subsequent install failing.

Does this PR introduce any user-facing change?

No, CI only.

How was this patch tested?

Running through CI

Was this patch authored or co-authored using generative AI tooling?

Auto-complete with copilot was turned on but none of it's suggestsions were useful except for some comments.

Claude was used to add adds resilient retry logic to Docker operations in the JDBC integration test suite to handle transient failures from Docker registries and daemons, which has been flaky during the test (added here instead of in 4 and backporting since the classes have been rewritten in 4).

Also used claude to suggest versions to pin back for roxygen issues during build.

sfc-gh-hkarau and others added 7 commits April 17, 2026 16:01
…o that we don't get a partial cache fetch error.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@holdenk holdenk force-pushed the SPARK-56535-fix-base-image-build branch from 7bb3ffe to 737dd17 Compare April 20, 2026 18:53
sfc-gh-hkarau and others added 5 commits April 20, 2026 12:04
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…k python packages that don't work in 3.9

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…it and building from src fails

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@holdenk holdenk changed the title [WIP][SPARK-56535][BUILD] Fix base image build [SPARK-56535][BUILD] Fix base image build May 1, 2026
@holdenk holdenk marked this pull request as ready for review May 1, 2026 18:53
@holdenk
Copy link
Copy Markdown
Contributor Author

holdenk commented May 1, 2026

CC @devin-petersohn who's probably got a good handle on old versions of Python does this look reasonable-ish?

Comment thread dev/infra/Dockerfile
# Image for building and testing Spark branches. Based on Ubuntu 22.04.
# See also in https://hub.docker.com/_/ubuntu
FROM ubuntu:focal-20221019
FROM ubuntu:jammy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pin this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going back and forth on this, given we do an apt-get update anyways personally I think pinning it is actually counter productive.

Co-authored-by: Holden Karau <holden@pigscanfly.ca>
@holdenk holdenk force-pushed the SPARK-56535-fix-base-image-build branch from 9abfefa to 6813916 Compare May 2, 2026 07:01
@holdenk holdenk changed the title [SPARK-56535][BUILD] Fix base image build [SPARK-56535][BUILD] Fix CI & base image build issues May 6, 2026
sfc-gh-hkarau and others added 12 commits May 6, 2026 13:19
Docker Hub occasionally returns transient 5xx responses (e.g. 502 Bad
Gateway on manifest HEAD requests), which currently aborts suites like
PostgresKrbIntegrationSuite. Wrap the pull/inspect calls in an
exponential-backoff retry so flaky GitHub CI runs survive these blips.

https://claude.ai/code/session_01Py9jZBMMdNaCBJ4vvHc3kd

Co-authored-by: Claude <noreply@anthropic.com>
mypy 0.991 (pinned for Python 3.8/3.9 support on this branch) crashes
during cache serialization on pydantic v2's recursive JsonValue type.
pydantic isn't a direct PySpark dep but gets pulled in transitively via
mlflow in the lint env. follow_imports = skip prevents mypy from
analyzing pydantic at all, sidestepping the assertion.

Co-authored-by: Claude <noreply@anthropic.com>
* Workaround roxygen2 'cannot set an attribute on a builtin' in create-rd.sh

When roxygen2 processes @family members for topics like dim.Rd, it
calls add_s3_metadata to mark s3 generics. For SparkR, the lookup
resolves to base R primitives (dim, nrow, ncol, ifelse, ...) that
SparkR registers S4 methods for. R disallows setting attributes on
builtins, so `class(val) <- c("s3generic", "function")` aborts with
"cannot set an attribute on a 'builtin'", failing the whole Rd build.

Monkey-patch roxygen2's internal add_s3_metadata in create-rd.sh to
swallow that specific error and return the primitive unchanged, so
documentation generation can proceed regardless of the installed
roxygen2 version.

* Skip cleanClosure for primitive functions in SparkR

When SparkR's RDD machinery wraps a user closure, cleanClosure() walks
the closure and calls environment(func) <- newEnv. For primitive
functions like `+`, `max`, `min`, recent R versions raise the warning
"setting environment(<primitive function>) is not possible and trying
it is deprecated", which can be promoted to an error and breaks
reduce/reduceByKey-style RDD ops (test_rdd.R count by values, maximum,
minimum).

Primitives have no R-level closure to clean, so return them unchanged.

---------

Co-authored-by: Claude <noreply@anthropic.com>
@aajisaka
Copy link
Copy Markdown
Member

Thank you @holdenk. I didn't notice this PR while I was working for fixing branch-3.5 build in #55764. Now I don't think it's useful, but just FYI.

holdenk and others added 2 commits May 11, 2026 15:26
* Fix remaining CI issues

- Correct version spec typo in dev/requirements.txt (=> -> >=) so pip
  can parse pyarrow and pandas constraints. The previous form aborted
  every pyspark-* build during the install step.
- Restore the roxygen2 add_s3_metadata workaround in R/create-rd.sh
  and the primitive-function bypass in R/pkg/R/utils.R. The linter
  job still calls R/install-dev.sh, so the SparkR-disable change left
  it failing on the original "cannot set an attribute on a 'builtin'"
  error.
- Use snake_case kwargs (error_class/message_parameters) in
  PySparkValueError raised from UserDefinedType.fromJson; the
  PySparkException constructor only accepts the snake_case names.
- Re-add # type: ignore[attr-defined] on typing.GenericAlias and
  typing._GenericAlias references so mypy 0.982 stops failing on
  attributes the stubs do not expose.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Drop jinja2<3.0.0 / markupsafe pins from requirements.txt

pandas>=1.3 requires jinja2>=3.0.0 for Styler. With jinja2<3.0.0
pinned in requirements.txt, the pyspark-pandas tests aborted with
"Pandas requires version '3.0.0' or newer of 'jinja2'" as soon as
they touched DataFrame.style.

The Sphinx<3.1.0 doc build still pins jinja2<3.0.0 and
markupsafe==2.0.1 in its own pip install step, so the docs job is
unaffected.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Stop installing dev/requirements.txt for python3.8 in CI

python/run-tests.py was already changed to skip python3.8, so the
matching install step in build_and_test.yml is dead weight. Worse,
many packages in requirements.txt have dropped 3.8 support (mlflow
3.x, torch 2.1+, numpy 2.x, etc.), so pip's resolver burns minutes
backtracking and ultimately fails with exit code 1, taking the
whole pyspark-* matrix down.

Drop the python3.8 -m pip install line (and its companion pip list)
so the install step only runs for python3.9.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Revert the GenericAlias # type: ignore in typehints.py

The ignores I added in 5dc31f4 are correct under mypy with
Python 3.11 semantics, but CI runs mypy under python3.9. There,
sys.version_info < (3, 11) is statically True, so mypy treats the
else branch as unreachable and never checks the typing.GenericAlias
/ typing._GenericAlias accesses. With warn_unused_ignores=True, the
ignore comments themselves become the lint error.

Drop them. The branch they protect is dead code from mypy 0.982's
perspective on Python 3.9.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Make typehints.py's GenericAlias ignores tolerant of both mypy modes

The typing.GenericAlias / typing._GenericAlias accesses sit inside a
sys.version_info >= (3, 11) branch. Mypy under python_version 3.11+
needs the # type: ignore[attr-defined] (the stubs don't expose those
attrs); mypy under python_version 3.9 / 3.10 considers the branch
unreachable and flags the same ignores as unused.

Re-add the ignores and silence warn_unused_ignores for
pyspark.pandas.typedef.typehints so both modes pass. Locally
verified against --python-version 3.9 and --python-version 3.11.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Resolve typing.GenericAlias via getattr to dodge mypy attr-defined

The static typing.GenericAlias / typing._GenericAlias accesses tripped
mypy across multiple Python versions:
  * python_version 3.11+ stubs don't expose them, so an
    attr-defined error fires.
  * python_version 3.9 / 3.10 treats the else branch as unreachable, so
    any # type: ignore[attr-defined] on those lines becomes "unused".

Resolve both classes through getattr with a safe fallback to type(None)
so mypy never sees the static attribute access, while the runtime on
Python 3.11+ still binds the real classes. Drop the previous
[mypy-pyspark.pandas.typedef.typehints] override that was tracking the
warn_unused_ignores half of this dance.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Skip mypy follow-imports for sqlalchemy

mypy 0.982 raises INTERNAL ERROR while analyzing
sqlalchemy/engine/default.py:334, which it sees because sqlalchemy is
pulled in transitively (e.g. via mlflow). Skip following sqlalchemy
imports, mirroring the existing pydantic carve-out.

CI lint log:
  starting mypy annotations test...
  annotations failed mypy checks:
  /usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/default.py:334:
    error: INTERNAL ERROR -- Please try using mypy master on GitHub
  version: 0.982

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Fix mypy data test failures

CI lint log identified 10 failures in pytest-mypy-plugins (mypy data
tests):

1. Cascade: many tests (test_session, test_udf, test_rdd) saw an
   unexpected output line:
     pyspark/sql/functions:71: error:
       Cannot determine type of "has_numpy"  [has-type]
   The "has_numpy = False" + try/import/has_numpy = True idiom in
   pyspark/sql/utils.py left mypy unable to infer has_numpy's type
   when functions.py imports it. Annotate has_numpy: bool = False so
   mypy can resolve it and the cascade goes away.

2. mypy 0.982 output drift on overload notes:
   - test_feature.yml::stringIndexerOverloads expected
     "def StringIndexer(self, ...)" in the overload notes, but
     mypy 0.982 prints "def __init__(self, ...)". Update the
     expected output.
   - test_functions.yml::varargFunctionsOverloads expected
     "def [ColumnOrName_] array(Union[...])" but mypy 0.982 includes
     the positional-only "__cols" param name. Update each note.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

* Always skip R API doc generation in the docs build

SparkR is disabled in this fork's CI (precondition forces
sparkr=false), but the docs build still tried to run the SparkR API
doc generation because dev/is-changed.py -m sparkr returns true
whenever the branch touches R/* files (it did, for the roxygen2
workaround). create-docs.sh then tries to load pkgdown via SparkR,
and the R session has no pkgdown installed in its library path:

  Error in loadNamespace(x) : there is no package called 'pkgdown'
  R doc generation failed (RuntimeError)
  jekyll build aborts.

Force SKIP_RDOC=1 unconditionally so jekyll skips the R API doc
step. PySpark doc skipping stays gated on is-changed.py as before.

https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE

---------

Co-authored-by: Claude <noreply@anthropic.com>
* Fix processClosure failure on primitive functions in SparkR

`environment()` returns NULL for primitives, so `parent.env(environment(func))`
errored with "argument is not an environment" in test_rdd.R's "count by values
and keys". Skip the parent-env check for primitives — identity is sufficient —
and drop the redundant `ifelse(cond, TRUE, FALSE)`, which also dispatched
through SparkR's S4 ifelse generic.

https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC

* Always assign captured function to cleaned closure env

The earlier primitive fix exposed a latent bug: when `found` was true,
`break` skipped `assign(nodeChar, obj, envir = newEnv)`, so the captured
function was never added to the cleaned closure. This was harmless before
because the parent-env equality check rarely matched non-primitive
closures, but for primitives like `+` shared across closures it does
match and caused 'object combineFunc not found' on workers in
test_rdd.R countByValue. Skip only the recursive cleanClosure work when
already examined; always emit the binding into newEnv.

https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC

---------

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants