feat(python/sedonadb): add DataFrame.union and union_distinct#965
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the Python sedonadb.DataFrame relational API with set-union operations, mirroring SQL semantics and aligning with the existing Ibis/DuckDB-style surface.
Changes:
- Add
DataFrame.union(other)implementing SQLUNION ALLsemantics (keeps duplicates). - Add
DataFrame.union_distinct(other)implementing SQLUNIONsemantics (drops duplicates). - Add unit tests validating duplicate behavior, positional column matching, laziness, and key error cases.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| python/sedonadb/python/sedonadb/dataframe.py | Adds public Python DataFrame.union / union_distinct methods with type validation and docstring examples. |
| python/sedonadb/src/dataframe.rs | Adds Rust InternalDataFrame::union / union_distinct thin wrappers over DataFusion. |
| python/sedonadb/tests/expr/test_dataframe_union.py | New tests covering correctness and error handling for the new APIs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you!
One suggestion for a test to verify what I think is the case but other than that this is good to go!
| def test_union_column_count_mismatch_raises(con): | ||
| from sedonadb._lib import SedonaError | ||
|
|
||
| a = con.create_data_frame(pd.DataFrame({"x": [1]})) | ||
| b = con.create_data_frame(pd.DataFrame({"x": [1], "y": [2]})) | ||
| with pytest.raises(SedonaError, match="different number of columns"): | ||
| a.union(b) |
There was a problem hiding this comment.
Can you also check that unioning pd.DataFrame({"x": [1]}) with pd.DataFrame({"y": [1]}) (i.e., same number of columns but different names) is an error? (If it is not an error, this should be opt-in because it's a footgun otherwise).
Can you ensure that this test and the one I suggested also applies to union_distinct()?
There was a problem hiding this comment.
Good call — it was not an error before (DataFusion's union aligns by position and silently takes the left's names), so I made matching column names the requirement rather than leaving the footgun in. Both union and union_distinct now raise a clear ValueError when the two inputs don't have the same column names in the same order; a positional union of differently-named columns is opt-in (align names first, e.g. a.union(b.select(col('y').alias('x')))).
Tests added/updated in d599900, parametrized over both methods: different-names raises, different-count raises, plus a test for the opt-in alignment path. Also rebased on latest main.
Two relational set-union methods, continuing the Ibis/DuckDB-style relational surface: - `df.union(other)` — vertical concatenation keeping duplicate rows (SQL `UNION ALL`; matches PySpark's `union`). - `df.union_distinct(other)` — concatenation with duplicate rows removed (SQL `UNION`). Both are thin wrappers over DataFusion's `DataFrame::union` / `union_distinct`. To avoid the positional-union footgun (silently aligning differently-named columns), both require the two inputs to have the same column names in the same order and raise a clear ValueError otherwise. A positional union of differently-named columns is opt-in: align the names first (e.g. with `select`). Naming follows the engine-neutral convention (SQL/Spark): `union` keeps duplicates, `union_distinct` dedupes. Tests: 12 in tests/expr/test_dataframe_union.py — union keeps dupes, multi-column, union_distinct dedupes (across and within inputs), the opt-in positional-alignment path, lazy return, non-DataFrame arg errors, and (parametrized over both methods) the different-names and different-count error paths.
bec6373 to
d599900
Compare
Adds two relational set-union methods to the Python
DataFrame, continuing the Ibis/DuckDB-style relational surface (afterdistinct/distinct_onin #961).API
unionkeeps duplicate rows (matches SQLUNION ALLand PySpark'sunion).union_distinctremoves duplicate rows (SQLUNION).SedonaErrorat call time (DataFusion builds the plan eagerly), e.g. "UNION queries have different number of columns".Naming note: the engine-neutral convention (SQL ALL-vs-not, Spark) is that bare
unionkeeps duplicates and the_distinctvariant dedupes — documented prominently in the docstrings since a SQL user might expect bareUNION(dedup) semantics fromunion. A by-name variant (DataFusion'sunion_by_name) can follow later if wanted.Implementation
python/sedonadb/src/dataframe.rsInternalDataFrame::union/union_distinct— thin wrappers over DataFusion'sDataFrame::union/union_distinct.python/sedonadb/python/sedonadb/dataframe.pyDataFrame.union/union_distinctwithisinstancevalidation.Test plan
9 tests in
tests/expr/test_dataframe_union.py:unionkeeps duplicates; multi-column; position-based column matching (result takes the left's names).union_distinctdedupes across inputs and within each input.Local: 9 unit + 29 doctests + full expr suite +
ruff+cargo fmt --check+clippy -Dwarningsall clean.