Skip to content

[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776

Closed
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-cdc-streaming-doc-clarify
Closed

[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776
gengliangwang wants to merge 2 commits into
apache:masterfrom
gengliangwang:SPARK-cdc-streaming-doc-clarify

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

@gengliangwang gengliangwang commented May 9, 2026

What changes were proposed in this pull request?

Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the Changelog Javadoc.

The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode Aggregate keyed on _commit_timestamp and have the same lag as the netChanges transformWithState timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice.

Replaced the existing single paragraph with a bulleted list:

  • Output is buffered until the watermark advances past the commit. When a micro-batch ingests a commit, that commit's output rows are buffered in state and not emitted in the same batch. They are emitted by a later micro-batch -- whichever one advances the watermark past the commit's _commit_timestamp. The last commit's output is emitted when the source terminates.
  • netChanges only merges changes that are buffered together. When each row identity appears in at most one commit within any buffered window, the streaming output is the same as computeUpdates. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read.

This is a sub-task of SPARK-55668.

Why are the changes needed?

Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what CdcNetChangesStatefulProcessor actually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released.

Does this PR introduce any user-facing change?

Documentation only. No behavior change.

How was this patch tested?

Doc-only change. Xdoclint:html,syntax,accessibility is clean on Changelog.java (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

…tChanges scope

Address two follow-up review threads on PR apache#55637 (streaming CDC netChanges):

- The "held back" paragraph was worded as if the one-commit emission lag
  were a netChanges-specific property. It is not -- carry-over removal and
  update detection use append-mode `Aggregate` keyed on `_commit_timestamp`
  and have the same lag as the netChanges `transformWithState` timer.
- Set realistic expectations for streaming netChanges: for typical CDC
  sources that produce at most one change per row per commit, the
  streaming output equals what `computeUpdates` would produce, because
  only one commit's changes are buffered at a time. Cross-commit merging
  only kicks in when several commits touch the same row before the older
  one's output is emitted. Direct users to a batch read for full-range
  collapse.

Both points are now stated up-front in plain language, with a bulleted
list and short bold labels for scannability.
@gengliangwang
Copy link
Copy Markdown
Member Author

cc @johanl-db

* past them, or the source terminates.
* computation. Two streaming-specific behaviors to be aware of:
* <ul>
* <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output is delayed until a later micro-batch advances the watermark.

"Output is delayed by one commit" is not accurate. The delaying is from streaming watermark/stateful append semantics. The contract actually allows a micro-batch has multiple distinct commits. Earlier commit in a batch won't be output because other commit in the same batch.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- updated in 0be721e. Replaced "delayed by one commit" / "by the next micro-batch" with "buffered until the watermark advances past the commit" / "by a later micro-batch -- whichever one advances the watermark past the commit's _commit_timestamp". This anchors the lag on watermark progression rather than commit count, so the multiple-distinct-commits-per-batch case is no longer implicitly excluded.

* same batch. They are emitted by the next micro-batch -- the one that
* ingests the following commit. The last commit's output is emitted
* when the source terminates.</li>
* <li><b>netChanges only merges changes that are buffered together.</b> For
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition in “For a typical CDC source that produces at most one change per row per commit ... the streaming output is the same as computeUpdates” is not sufficient. Producing at most one change per row per commit only rules out multiple changes for the same row within a single commit; it does not guarantee that the same row will not be touched again by a later commit before the older buffered output has been emitted. The implementation keeps the first and last event in CdcNetChangesStatefulProcessor until the event-time timer fires and clears the state, so changes from multiple commits can still be merged whenever they fall into the same buffered window. Consider rewording this to say that streaming netChanges behaves like computeUpdates only when each row identity appears in at most one commit within any buffered window.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopted your framing in 0be721e -- the bullet now says "When each row identity appears in at most one commit within any buffered window, the streaming output is the same as computeUpdates." The condition is now on row-identity occurrences within the buffer window rather than on changes within a single commit, which matches what CdcNetChangesStatefulProcessor actually does. Thanks.

…ge-window wording

Address @viirya's review comments on PR apache#55776:

- "Output is delayed by one commit" implied a 1:1 commit-to-batch
  cadence. The contract allows multiple distinct commits with equal
  `_commit_timestamp` within a single micro-batch, so the lag is really
  driven by streaming watermark / append-mode semantics. Reword the
  bullet header and body to anchor the lag on watermark progression
  past the commit's timestamp.
- "At most one change per row per commit" was too weak a condition for
  streaming-output-matches-`computeUpdates`. The accurate condition is
  on row-identity occurrences within the buffered window (not on
  changes within a single commit). Replace with viirya's framing:
  "each row identity appears in at most one commit within any buffered
  window".
Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. Before merging, maybe it is better to update the PR description too. The PR description still uses "Output is delayed by one commit".

@gengliangwang
Copy link
Copy Markdown
Member Author

@johanl-db @viirya Thanks for the review.
I am merging this doc only change to master/4.x/4.2. CC 4.2 release manager @huaxingao

gengliangwang added a commit that referenced this pull request May 11, 2026
…tChanges scope

### What changes were proposed in this pull request?

Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc.

The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice.

Replaced the existing single paragraph with a bulleted list:

- **Output is buffered until the watermark advances past the commit.** When a micro-batch ingests a commit, that commit's output rows are buffered in state and not emitted in the same batch. They are emitted by a later micro-batch -- whichever one advances the watermark past the commit's `_commit_timestamp`. The last commit's output is emitted when the source terminates.
- **netChanges only merges changes that are buffered together.** When each row identity appears in at most one commit within any buffered window, the streaming output is the same as `computeUpdates`. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read.

This is a sub-task of SPARK-55668.

### Why are the changes needed?

Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what `CdcNetChangesStatefulProcessor` actually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released.

### Does this PR introduce _any_ user-facing change?

Documentation only. No behavior change.

### How was this patch tested?

Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java` (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 5d47d6e)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
gengliangwang added a commit that referenced this pull request May 11, 2026
…tChanges scope

### What changes were proposed in this pull request?

Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc.

The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice.

Replaced the existing single paragraph with a bulleted list:

- **Output is buffered until the watermark advances past the commit.** When a micro-batch ingests a commit, that commit's output rows are buffered in state and not emitted in the same batch. They are emitted by a later micro-batch -- whichever one advances the watermark past the commit's `_commit_timestamp`. The last commit's output is emitted when the source terminates.
- **netChanges only merges changes that are buffered together.** When each row identity appears in at most one commit within any buffered window, the streaming output is the same as `computeUpdates`. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read.

This is a sub-task of SPARK-55668.

### Why are the changes needed?

Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what `CdcNetChangesStatefulProcessor` actually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released.

### Does this PR introduce _any_ user-facing change?

Documentation only. No behavior change.

### How was this patch tested?

Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java` (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude opus-4-7

Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 5d47d6e)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants