Skip to content

feat(python/sedonadb): add DataFrame.union and union_distinct#965

Merged
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-union
Jun 24, 2026
Merged

feat(python/sedonadb): add DataFrame.union and union_distinct#965
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-union

Conversation

@jiayuasu

Copy link
Copy Markdown
Member

Adds two relational set-union methods to the Python DataFrame, continuing the Ibis/DuckDB-style relational surface (after distinct/distinct_on in #961).

API

a.union(b)            # vertical concat, keeps duplicates   (SQL UNION ALL)
a.union_distinct(b)   # vertical concat, drops duplicates   (SQL UNION)
  • union keeps duplicate rows (matches SQL UNION ALL and PySpark's union).
  • union_distinct removes duplicate rows (SQL UNION).
  • Columns are matched by position; the inputs must share a schema. A mismatch surfaces as a SedonaError at call time (DataFusion builds the plan eagerly), e.g. "UNION queries have different number of columns".

Naming note: the engine-neutral convention (SQL ALL-vs-not, Spark) is that bare union keeps duplicates and the _distinct variant dedupes — documented prominently in the docstrings since a SQL user might expect bare UNION (dedup) semantics from union. A by-name variant (DataFusion's union_by_name) can follow later if wanted.

Implementation

File Change
python/sedonadb/src/dataframe.rs InternalDataFrame::union / union_distinct — thin wrappers over DataFusion's DataFrame::union / union_distinct.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.union / union_distinct with isinstance validation.

Test plan

9 tests in tests/expr/test_dataframe_union.py:

  • union keeps duplicates; multi-column; position-based column matching (result takes the left's names).
  • union_distinct dedupes across inputs and within each input.
  • Lazy return for both.
  • Errors: non-DataFrame argument (both methods); column-count mismatch.

Local: 9 unit + 29 doctests + full expr suite + ruff + cargo fmt --check + clippy -Dwarnings all clean.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the Python sedonadb.DataFrame relational API with set-union operations, mirroring SQL semantics and aligning with the existing Ibis/DuckDB-style surface.

Changes:

  • Add DataFrame.union(other) implementing SQL UNION ALL semantics (keeps duplicates).
  • Add DataFrame.union_distinct(other) implementing SQL UNION semantics (drops duplicates).
  • Add unit tests validating duplicate behavior, positional column matching, laziness, and key error cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
python/sedonadb/python/sedonadb/dataframe.py Adds public Python DataFrame.union / union_distinct methods with type validation and docstring examples.
python/sedonadb/src/dataframe.rs Adds Rust InternalDataFrame::union / union_distinct thin wrappers over DataFusion.
python/sedonadb/tests/expr/test_dataframe_union.py New tests covering correctness and error handling for the new APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jiayuasu jiayuasu marked this pull request as ready for review June 22, 2026 18:24
@jiayuasu jiayuasu requested a review from paleolimbot June 22, 2026 18:24

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

One suggestion for a test to verify what I think is the case but other than that this is good to go!

Comment on lines +82 to +88
def test_union_column_count_mismatch_raises(con):
from sedonadb._lib import SedonaError

a = con.create_data_frame(pd.DataFrame({"x": [1]}))
b = con.create_data_frame(pd.DataFrame({"x": [1], "y": [2]}))
with pytest.raises(SedonaError, match="different number of columns"):
a.union(b)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also check that unioning pd.DataFrame({"x": [1]}) with pd.DataFrame({"y": [1]}) (i.e., same number of columns but different names) is an error? (If it is not an error, this should be opt-in because it's a footgun otherwise).

Can you ensure that this test and the one I suggested also applies to union_distinct()?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — it was not an error before (DataFusion's union aligns by position and silently takes the left's names), so I made matching column names the requirement rather than leaving the footgun in. Both union and union_distinct now raise a clear ValueError when the two inputs don't have the same column names in the same order; a positional union of differently-named columns is opt-in (align names first, e.g. a.union(b.select(col('y').alias('x')))).

Tests added/updated in d599900, parametrized over both methods: different-names raises, different-count raises, plus a test for the opt-in alignment path. Also rebased on latest main.

Two relational set-union methods, continuing the Ibis/DuckDB-style
relational surface:

- `df.union(other)` — vertical concatenation keeping duplicate rows
  (SQL `UNION ALL`; matches PySpark's `union`).
- `df.union_distinct(other)` — concatenation with duplicate rows removed
  (SQL `UNION`).

Both are thin wrappers over DataFusion's `DataFrame::union` /
`union_distinct`. To avoid the positional-union footgun (silently
aligning differently-named columns), both require the two inputs to have
the same column names in the same order and raise a clear ValueError
otherwise. A positional union of differently-named columns is opt-in:
align the names first (e.g. with `select`).

Naming follows the engine-neutral convention (SQL/Spark): `union` keeps
duplicates, `union_distinct` dedupes.

Tests: 12 in tests/expr/test_dataframe_union.py — union keeps dupes,
multi-column, union_distinct dedupes (across and within inputs), the
opt-in positional-alignment path, lazy return, non-DataFrame arg errors,
and (parametrized over both methods) the different-names and
different-count error paths.

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@jiayuasu jiayuasu merged commit f1fa759 into apache:main Jun 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants