feat(python/sedonadb): add DataFrame.union and union_distinct by jiayuasu · Pull Request #965 · apache/sedona-db

jiayuasu · 2026-06-16T05:33:28Z

Adds two relational set-union methods to the Python DataFrame, continuing the Ibis/DuckDB-style relational surface (after distinct/distinct_on in #961).

API

a.union(b)            # vertical concat, keeps duplicates   (SQL UNION ALL)
a.union_distinct(b)   # vertical concat, drops duplicates   (SQL UNION)

union keeps duplicate rows (matches SQL UNION ALL and PySpark's union).
union_distinct removes duplicate rows (SQL UNION).
Columns are matched by position; the inputs must share a schema. A mismatch surfaces as a SedonaError at call time (DataFusion builds the plan eagerly), e.g. "UNION queries have different number of columns".

Naming note: the engine-neutral convention (SQL ALL-vs-not, Spark) is that bare union keeps duplicates and the _distinct variant dedupes — documented prominently in the docstrings since a SQL user might expect bare UNION (dedup) semantics from union. A by-name variant (DataFusion's union_by_name) can follow later if wanted.

Implementation

File	Change
`python/sedonadb/src/dataframe.rs`	`InternalDataFrame::union` / `union_distinct` — thin wrappers over DataFusion's `DataFrame::union` / `union_distinct`.
`python/sedonadb/python/sedonadb/dataframe.py`	`DataFrame.union` / `union_distinct` with `isinstance` validation.

Test plan

9 tests in tests/expr/test_dataframe_union.py:

union keeps duplicates; multi-column; position-based column matching (result takes the left's names).
union_distinct dedupes across inputs and within each input.
Lazy return for both.
Errors: non-DataFrame argument (both methods); column-count mismatch.

Local: 9 unit + 29 doctests + full expr suite + ruff + cargo fmt --check + clippy -Dwarnings all clean.

Copilot

Pull request overview

This PR extends the Python sedonadb.DataFrame relational API with set-union operations, mirroring SQL semantics and aligning with the existing Ibis/DuckDB-style surface.

Changes:

Add DataFrame.union(other) implementing SQL UNION ALL semantics (keeps duplicates).
Add DataFrame.union_distinct(other) implementing SQL UNION semantics (drops duplicates).
Add unit tests validating duplicate behavior, positional column matching, laziness, and key error cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
python/sedonadb/python/sedonadb/dataframe.py	Adds public Python `DataFrame.union` / `union_distinct` methods with type validation and docstring examples.
python/sedonadb/src/dataframe.rs	Adds Rust `InternalDataFrame::union` / `union_distinct` thin wrappers over DataFusion.
python/sedonadb/tests/expr/test_dataframe_union.py	New tests covering correctness and error handling for the new APIs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paleolimbot

Thank you!

One suggestion for a test to verify what I think is the case but other than that this is good to go!

paleolimbot · 2026-06-23T20:04:08Z

+def test_union_column_count_mismatch_raises(con):
+    from sedonadb._lib import SedonaError
+
+    a = con.create_data_frame(pd.DataFrame({"x": [1]}))
+    b = con.create_data_frame(pd.DataFrame({"x": [1], "y": [2]}))
+    with pytest.raises(SedonaError, match="different number of columns"):
+        a.union(b)


Can you also check that unioning pd.DataFrame({"x": [1]}) with pd.DataFrame({"y": [1]}) (i.e., same number of columns but different names) is an error? (If it is not an error, this should be opt-in because it's a footgun otherwise).

Can you ensure that this test and the one I suggested also applies to union_distinct()?

Good call — it was not an error before (DataFusion's union aligns by position and silently takes the left's names), so I made matching column names the requirement rather than leaving the footgun in. Both union and union_distinct now raise a clear ValueError when the two inputs don't have the same column names in the same order; a positional union of differently-named columns is opt-in (align names first, e.g. a.union(b.select(col('y').alias('x')))).

Tests added/updated in d599900, parametrized over both methods: different-names raises, different-count raises, plus a test for the opt-in alignment path. Also rebased on latest main.

Two relational set-union methods, continuing the Ibis/DuckDB-style relational surface: - `df.union(other)` — vertical concatenation keeping duplicate rows (SQL `UNION ALL`; matches PySpark's `union`). - `df.union_distinct(other)` — concatenation with duplicate rows removed (SQL `UNION`). Both are thin wrappers over DataFusion's `DataFrame::union` / `union_distinct`. To avoid the positional-union footgun (silently aligning differently-named columns), both require the two inputs to have the same column names in the same order and raise a clear ValueError otherwise. A positional union of differently-named columns is opt-in: align the names first (e.g. with `select`). Naming follows the engine-neutral convention (SQL/Spark): `union` keeps duplicates, `union_distinct` dedupes. Tests: 12 in tests/expr/test_dataframe_union.py — union keeps dupes, multi-column, union_distinct dedupes (across and within inputs), the opt-in positional-alignment path, lazy return, non-DataFrame arg errors, and (parametrized over both methods) the different-names and different-count error paths.

paleolimbot

Thank you!

jiayuasu requested a review from Copilot June 16, 2026 07:29

Copilot started reviewing on behalf of jiayuasu June 16, 2026 07:30 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

jiayuasu marked this pull request as ready for review June 22, 2026 18:24

jiayuasu requested a review from paleolimbot June 22, 2026 18:24

paleolimbot approved these changes Jun 23, 2026

View reviewed changes

jiayuasu force-pushed the feature/df-union branch from bec6373 to d599900 Compare June 24, 2026 07:22

paleolimbot approved these changes Jun 24, 2026

View reviewed changes

jiayuasu merged commit f1fa759 into apache:main Jun 24, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python/sedonadb): add DataFrame.union and union_distinct#965

feat(python/sedonadb): add DataFrame.union and union_distinct#965
jiayuasu merged 1 commit into
apache:mainfrom
jiayuasu:feature/df-union

jiayuasu commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

paleolimbot left a comment

Uh oh!

paleolimbot Jun 23, 2026

Uh oh!

jiayuasu Jun 24, 2026

Uh oh!

paleolimbot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jiayuasu commented Jun 16, 2026

API

Implementation

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

paleolimbot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

jiayuasu Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants