[SPARK-56535][BUILD] Fix CI & base image build issues#55432
Open
holdenk wants to merge 45 commits into
Open
Conversation
…o that we don't get a partial cache fetch error. Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
7bb3ffe to
737dd17
Compare
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…k python packages that don't work in 3.9 Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
…it and building from src fails Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Contributor
Author
|
CC @devin-petersohn who's probably got a good handle on old versions of Python does this look reasonable-ish? |
| # Image for building and testing Spark branches. Based on Ubuntu 22.04. | ||
| # See also in https://hub.docker.com/_/ubuntu | ||
| FROM ubuntu:focal-20221019 | ||
| FROM ubuntu:jammy |
Contributor
There was a problem hiding this comment.
Should we pin this?
Contributor
Author
There was a problem hiding this comment.
I was going back and forth on this, given we do an apt-get update anyways personally I think pinning it is actually counter productive.
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
9abfefa to
6813916
Compare
…hich sort of comes under the WTF view of package management so lets do more explicit installs and also build up in such a way that the install actually works.
…ute on a 'builtin'
Docker Hub occasionally returns transient 5xx responses (e.g. 502 Bad Gateway on manifest HEAD requests), which currently aborts suites like PostgresKrbIntegrationSuite. Wrap the pull/inspect calls in an exponential-backoff retry so flaky GitHub CI runs survive these blips. https://claude.ai/code/session_01Py9jZBMMdNaCBJ4vvHc3kd Co-authored-by: Claude <noreply@anthropic.com>
mypy 0.991 (pinned for Python 3.8/3.9 support on this branch) crashes during cache serialization on pydantic v2's recursive JsonValue type. pydantic isn't a direct PySpark dep but gets pulled in transitively via mlflow in the lint env. follow_imports = skip prevents mypy from analyzing pydantic at all, sidestepping the assertion. Co-authored-by: Claude <noreply@anthropic.com>
* Workaround roxygen2 'cannot set an attribute on a builtin' in create-rd.sh When roxygen2 processes @family members for topics like dim.Rd, it calls add_s3_metadata to mark s3 generics. For SparkR, the lookup resolves to base R primitives (dim, nrow, ncol, ifelse, ...) that SparkR registers S4 methods for. R disallows setting attributes on builtins, so `class(val) <- c("s3generic", "function")` aborts with "cannot set an attribute on a 'builtin'", failing the whole Rd build. Monkey-patch roxygen2's internal add_s3_metadata in create-rd.sh to swallow that specific error and return the primitive unchanged, so documentation generation can proceed regardless of the installed roxygen2 version. * Skip cleanClosure for primitive functions in SparkR When SparkR's RDD machinery wraps a user closure, cleanClosure() walks the closure and calls environment(func) <- newEnv. For primitive functions like `+`, `max`, `min`, recent R versions raise the warning "setting environment(<primitive function>) is not possible and trying it is deprecated", which can be promoted to an error and breaks reduce/reduceByKey-style RDD ops (test_rdd.R count by values, maximum, minimum). Primitives have no R-level closure to clean, so return them unchanged. --------- Co-authored-by: Claude <noreply@anthropic.com>
… for releaswe we'll use conda anyways
Member
* Fix remaining CI issues - Correct version spec typo in dev/requirements.txt (=> -> >=) so pip can parse pyarrow and pandas constraints. The previous form aborted every pyspark-* build during the install step. - Restore the roxygen2 add_s3_metadata workaround in R/create-rd.sh and the primitive-function bypass in R/pkg/R/utils.R. The linter job still calls R/install-dev.sh, so the SparkR-disable change left it failing on the original "cannot set an attribute on a 'builtin'" error. - Use snake_case kwargs (error_class/message_parameters) in PySparkValueError raised from UserDefinedType.fromJson; the PySparkException constructor only accepts the snake_case names. - Re-add # type: ignore[attr-defined] on typing.GenericAlias and typing._GenericAlias references so mypy 0.982 stops failing on attributes the stubs do not expose. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Drop jinja2<3.0.0 / markupsafe pins from requirements.txt pandas>=1.3 requires jinja2>=3.0.0 for Styler. With jinja2<3.0.0 pinned in requirements.txt, the pyspark-pandas tests aborted with "Pandas requires version '3.0.0' or newer of 'jinja2'" as soon as they touched DataFrame.style. The Sphinx<3.1.0 doc build still pins jinja2<3.0.0 and markupsafe==2.0.1 in its own pip install step, so the docs job is unaffected. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Stop installing dev/requirements.txt for python3.8 in CI python/run-tests.py was already changed to skip python3.8, so the matching install step in build_and_test.yml is dead weight. Worse, many packages in requirements.txt have dropped 3.8 support (mlflow 3.x, torch 2.1+, numpy 2.x, etc.), so pip's resolver burns minutes backtracking and ultimately fails with exit code 1, taking the whole pyspark-* matrix down. Drop the python3.8 -m pip install line (and its companion pip list) so the install step only runs for python3.9. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Revert the GenericAlias # type: ignore in typehints.py The ignores I added in 5dc31f4 are correct under mypy with Python 3.11 semantics, but CI runs mypy under python3.9. There, sys.version_info < (3, 11) is statically True, so mypy treats the else branch as unreachable and never checks the typing.GenericAlias / typing._GenericAlias accesses. With warn_unused_ignores=True, the ignore comments themselves become the lint error. Drop them. The branch they protect is dead code from mypy 0.982's perspective on Python 3.9. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Make typehints.py's GenericAlias ignores tolerant of both mypy modes The typing.GenericAlias / typing._GenericAlias accesses sit inside a sys.version_info >= (3, 11) branch. Mypy under python_version 3.11+ needs the # type: ignore[attr-defined] (the stubs don't expose those attrs); mypy under python_version 3.9 / 3.10 considers the branch unreachable and flags the same ignores as unused. Re-add the ignores and silence warn_unused_ignores for pyspark.pandas.typedef.typehints so both modes pass. Locally verified against --python-version 3.9 and --python-version 3.11. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Resolve typing.GenericAlias via getattr to dodge mypy attr-defined The static typing.GenericAlias / typing._GenericAlias accesses tripped mypy across multiple Python versions: * python_version 3.11+ stubs don't expose them, so an attr-defined error fires. * python_version 3.9 / 3.10 treats the else branch as unreachable, so any # type: ignore[attr-defined] on those lines becomes "unused". Resolve both classes through getattr with a safe fallback to type(None) so mypy never sees the static attribute access, while the runtime on Python 3.11+ still binds the real classes. Drop the previous [mypy-pyspark.pandas.typedef.typehints] override that was tracking the warn_unused_ignores half of this dance. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Skip mypy follow-imports for sqlalchemy mypy 0.982 raises INTERNAL ERROR while analyzing sqlalchemy/engine/default.py:334, which it sees because sqlalchemy is pulled in transitively (e.g. via mlflow). Skip following sqlalchemy imports, mirroring the existing pydantic carve-out. CI lint log: starting mypy annotations test... annotations failed mypy checks: /usr/local/lib/python3.9/dist-packages/sqlalchemy/engine/default.py:334: error: INTERNAL ERROR -- Please try using mypy master on GitHub version: 0.982 https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Fix mypy data test failures CI lint log identified 10 failures in pytest-mypy-plugins (mypy data tests): 1. Cascade: many tests (test_session, test_udf, test_rdd) saw an unexpected output line: pyspark/sql/functions:71: error: Cannot determine type of "has_numpy" [has-type] The "has_numpy = False" + try/import/has_numpy = True idiom in pyspark/sql/utils.py left mypy unable to infer has_numpy's type when functions.py imports it. Annotate has_numpy: bool = False so mypy can resolve it and the cascade goes away. 2. mypy 0.982 output drift on overload notes: - test_feature.yml::stringIndexerOverloads expected "def StringIndexer(self, ...)" in the overload notes, but mypy 0.982 prints "def __init__(self, ...)". Update the expected output. - test_functions.yml::varargFunctionsOverloads expected "def [ColumnOrName_] array(Union[...])" but mypy 0.982 includes the positional-only "__cols" param name. Update each note. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE * Always skip R API doc generation in the docs build SparkR is disabled in this fork's CI (precondition forces sparkr=false), but the docs build still tried to run the SparkR API doc generation because dev/is-changed.py -m sparkr returns true whenever the branch touches R/* files (it did, for the roxygen2 workaround). create-docs.sh then tries to load pkgdown via SparkR, and the R session has no pkgdown installed in its library path: Error in loadNamespace(x) : there is no package called 'pkgdown' R doc generation failed (RuntimeError) jekyll build aborts. Force SKIP_RDOC=1 unconditionally so jekyll skips the R API doc step. PySpark doc skipping stays gated on is-changed.py as before. https://claude.ai/code/session_01Rd1fWuMdJ8seM8WknhbkxE --------- Co-authored-by: Claude <noreply@anthropic.com>
* Fix processClosure failure on primitive functions in SparkR `environment()` returns NULL for primitives, so `parent.env(environment(func))` errored with "argument is not an environment" in test_rdd.R's "count by values and keys". Skip the parent-env check for primitives — identity is sufficient — and drop the redundant `ifelse(cond, TRUE, FALSE)`, which also dispatched through SparkR's S4 ifelse generic. https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC * Always assign captured function to cleaned closure env The earlier primitive fix exposed a latent bug: when `found` was true, `break` skipped `assign(nodeChar, obj, envir = newEnv)`, so the captured function was never added to the cleaned closure. This was harmless before because the parent-env equality check rarely matched non-primitive closures, but for primitives like `+` shared across closures it does match and caused 'object combineFunc not found' on workers in test_rdd.R countByValue. Skip only the recursive cleanClosure work when already examined; always emit the binding into newEnv. https://claude.ai/code/session_01USAV7aq8egCyxnpD2rDaSC --------- Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Update the base image build for the CI infra/docker file to a supported ubuntu and do automatic apt-get update on apt-get install failures.
Why are the changes needed?
Two reasons:
Does this PR introduce any user-facing change?
No, CI only.
How was this patch tested?
Running through CI
Was this patch authored or co-authored using generative AI tooling?
Auto-complete with copilot was turned on but none of it's suggestsions were useful except for some comments.
Claude was used to add adds resilient retry logic to Docker operations in the JDBC integration test suite to handle transient failures from Docker registries and daemons, which has been flaky during the test (added here instead of in 4 and backporting since the classes have been rewritten in 4).
Also used claude to suggest versions to pin back for roxygen issues during build.