Skip to content

feat(python/sedonadb): add DataFrame.intersect, intersect_distinct, except_#1006

Open
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-set-ops
Open

feat(python/sedonadb): add DataFrame.intersect, intersect_distinct, except_#1006
jiayuasu wants to merge 1 commit into
apache:mainfrom
jiayuasu:feature/df-set-ops

Conversation

@jiayuasu

Copy link
Copy Markdown
Member

Completes the relational set-op family on the Python DataFrame, after distinct/distinct_on (#961) and union/union_distinct (#965).

API

a.intersect(b)            # rows in both, multiset      (SQL INTERSECT ALL)
a.intersect_distinct(b)   # distinct rows in both       (SQL INTERSECT)
a.except_(b)              # distinct rows in a not in b (SQL EXCEPT)
  • intersect preserves multiplicity; intersect_distinct de-duplicates.
  • except_ uses a trailing underscore to avoid Python's except keyword (matching DuckDB's Python relational API).
  • All three require the two inputs to have the same column names in the same order (shared _check_set_op_compatible guard, generalized from the union-only helper introduced in feat(python/sedonadb): add DataFrame.union and union_distinct #965), so a positional set op can't silently misalign differently-named columns. A positional combination of differently-named columns is opt-in: align names first with select.

Why no except_distinct

DataFusion's EXCEPT ALL does not preserve multiplicity — [1,1,1] EXCEPT ALL [1] returns [] (verified, including via raw SQL), the same result as EXCEPT. So a separate ALL/distinct pair for except would be two public methods that behave identically today. except_ therefore exposes the distinct semantics only (wired to DataFusion's except_distinct), documented as such. An except_all can be added if/when the engine gains multiplicity-preserving EXCEPT ALL.

(For intersect, INTERSECT ALL does work, so both variants are kept.)

Implementation

File Change
python/sedonadb/src/dataframe.rs InternalDataFrame::intersect / intersect_distinct / except_ — thin wrappers over DataFusion's intersect / intersect_distinct / except_distinct.
python/sedonadb/python/sedonadb/dataframe.py DataFrame.intersect / intersect_distinct / except_; _check_union_compatible renamed to _check_set_op_compatible and reused by all set ops (union included).

Test plan

tests/expr/test_dataframe_set_ops.py:

  • intersect preserves multiplicity; intersect_distinct dedupes; multi-column intersect.
  • except_ distinct difference; multi-column except.
  • Lazy return for all three.
  • Parametrized over the three methods: different-column-names raises; non-DataFrame argument raises.

Local: full set-op + union suite + 31 doctests + expr suite + ruff + cargo fmt --check + clippy -Dwarnings all clean.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the Python DataFrame relational set-operation surface by adding intersect, intersect_distinct, and except_, implemented as thin wrappers over the underlying DataFusion set-op APIs and guarded by a shared compatibility check to prevent accidental positional misalignment.

Changes:

  • Add DataFrame.intersect, DataFrame.intersect_distinct, and DataFrame.except_ to the Python API with docstrings and input validation.
  • Extend the Rust InternalDataFrame pyclass with corresponding set-op methods.
  • Add a dedicated pytest suite covering semantics (multiplicity vs distinct), multi-column behavior, laziness, and argument/compatibility errors.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
python/sedonadb/src/dataframe.rs Adds Rust-side InternalDataFrame wrappers for intersect/intersect_distinct/except_ via DataFusion.
python/sedonadb/python/sedonadb/dataframe.py Adds the new Python DataFrame methods and generalizes the set-op compatibility guard used by union/intersect/except.
python/sedonadb/tests/expr/test_dataframe_set_ops.py Introduces coverage for intersect/except semantics and error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/sedonadb/python/sedonadb/dataframe.py Outdated
…xcept_

Completes the relational set-op family (after distinct and union):

- `df.intersect(other)` — rows in both, preserving multiplicity
  (SQL `INTERSECT ALL`).
- `df.intersect_distinct(other)` — distinct rows in both (SQL `INTERSECT`).
- `df.except_(other)` — distinct rows in this DataFrame not in the other
  (SQL `EXCEPT`). The trailing underscore avoids the `except` keyword
  (matching DuckDB's Python API).

Thin wrappers over DataFusion's `intersect` / `intersect_distinct` /
`except_distinct`. As with union, all three require the two inputs to
have the same column names in the same order (shared
`_check_set_op_compatible` guard, renamed from the union-only helper) so
a positional set op can't silently misalign differently-named columns.

No `except_distinct` method: DataFusion's `EXCEPT ALL` does not preserve
multiplicity (it returns the same distinct result as `EXCEPT`), so a
separate ALL/distinct pair for except would be two methods that behave
identically. `except_` therefore exposes the distinct semantics only,
documented as such; an `except_all` can be added if/when the engine
supports multiplicity-preserving `EXCEPT ALL`.

Tests: tests/expr/test_dataframe_set_ops.py — intersect multiplicity vs
distinct, multi-column intersect/except, except distinct difference,
lazy return, and (parametrized over the three methods) the
different-names and non-DataFrame error paths.
@jiayuasu jiayuasu force-pushed the feature/df-set-ops branch from 1a3294a to d16c2b5 Compare June 25, 2026 05:00
@jiayuasu jiayuasu requested a review from paleolimbot June 25, 2026 06:54
@jiayuasu jiayuasu marked this pull request as ready for review June 25, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants