feat(python/sedonadb): add DataFrame.intersect, intersect_distinct, except_#1006
Open
jiayuasu wants to merge 1 commit into
Open
feat(python/sedonadb): add DataFrame.intersect, intersect_distinct, except_#1006jiayuasu wants to merge 1 commit into
jiayuasu wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR completes the Python DataFrame relational set-operation surface by adding intersect, intersect_distinct, and except_, implemented as thin wrappers over the underlying DataFusion set-op APIs and guarded by a shared compatibility check to prevent accidental positional misalignment.
Changes:
- Add
DataFrame.intersect,DataFrame.intersect_distinct, andDataFrame.except_to the Python API with docstrings and input validation. - Extend the Rust
InternalDataFramepyclass with corresponding set-op methods. - Add a dedicated pytest suite covering semantics (multiplicity vs distinct), multi-column behavior, laziness, and argument/compatibility errors.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/sedonadb/src/dataframe.rs | Adds Rust-side InternalDataFrame wrappers for intersect/intersect_distinct/except_ via DataFusion. |
| python/sedonadb/python/sedonadb/dataframe.py | Adds the new Python DataFrame methods and generalizes the set-op compatibility guard used by union/intersect/except. |
| python/sedonadb/tests/expr/test_dataframe_set_ops.py | Introduces coverage for intersect/except semantics and error handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…xcept_ Completes the relational set-op family (after distinct and union): - `df.intersect(other)` — rows in both, preserving multiplicity (SQL `INTERSECT ALL`). - `df.intersect_distinct(other)` — distinct rows in both (SQL `INTERSECT`). - `df.except_(other)` — distinct rows in this DataFrame not in the other (SQL `EXCEPT`). The trailing underscore avoids the `except` keyword (matching DuckDB's Python API). Thin wrappers over DataFusion's `intersect` / `intersect_distinct` / `except_distinct`. As with union, all three require the two inputs to have the same column names in the same order (shared `_check_set_op_compatible` guard, renamed from the union-only helper) so a positional set op can't silently misalign differently-named columns. No `except_distinct` method: DataFusion's `EXCEPT ALL` does not preserve multiplicity (it returns the same distinct result as `EXCEPT`), so a separate ALL/distinct pair for except would be two methods that behave identically. `except_` therefore exposes the distinct semantics only, documented as such; an `except_all` can be added if/when the engine supports multiplicity-preserving `EXCEPT ALL`. Tests: tests/expr/test_dataframe_set_ops.py — intersect multiplicity vs distinct, multi-column intersect/except, except distinct difference, lazy return, and (parametrized over the three methods) the different-names and non-DataFrame error paths.
1a3294a to
d16c2b5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Completes the relational set-op family on the Python
DataFrame, afterdistinct/distinct_on(#961) andunion/union_distinct(#965).API
intersectpreserves multiplicity;intersect_distinctde-duplicates.except_uses a trailing underscore to avoid Python'sexceptkeyword (matching DuckDB's Python relational API)._check_set_op_compatibleguard, generalized from the union-only helper introduced in feat(python/sedonadb): add DataFrame.union and union_distinct #965), so a positional set op can't silently misalign differently-named columns. A positional combination of differently-named columns is opt-in: align names first withselect.Why no
except_distinctDataFusion's
EXCEPT ALLdoes not preserve multiplicity —[1,1,1] EXCEPT ALL [1]returns[](verified, including via raw SQL), the same result asEXCEPT. So a separate ALL/distinct pair forexceptwould be two public methods that behave identically today.except_therefore exposes the distinct semantics only (wired to DataFusion'sexcept_distinct), documented as such. Anexcept_allcan be added if/when the engine gains multiplicity-preservingEXCEPT ALL.(For
intersect,INTERSECT ALLdoes work, so both variants are kept.)Implementation
python/sedonadb/src/dataframe.rsInternalDataFrame::intersect/intersect_distinct/except_— thin wrappers over DataFusion'sintersect/intersect_distinct/except_distinct.python/sedonadb/python/sedonadb/dataframe.pyDataFrame.intersect/intersect_distinct/except_;_check_union_compatiblerenamed to_check_set_op_compatibleand reused by all set ops (union included).Test plan
tests/expr/test_dataframe_set_ops.py:intersectpreserves multiplicity;intersect_distinctdedupes; multi-column intersect.except_distinct difference; multi-column except.Local: full set-op + union suite + 31 doctests + expr suite +
ruff+cargo fmt --check+clippy -Dwarningsall clean.