[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776
[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope#55776gengliangwang wants to merge 2 commits into
Conversation
…tChanges scope Address two follow-up review threads on PR apache#55637 (streaming CDC netChanges): - The "held back" paragraph was worded as if the one-commit emission lag were a netChanges-specific property. It is not -- carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. - Set realistic expectations for streaming netChanges: for typical CDC sources that produce at most one change per row per commit, the streaming output equals what `computeUpdates` would produce, because only one commit's changes are buffered at a time. Cross-commit merging only kicks in when several commits touch the same row before the older one's output is emitted. Direct users to a batch read for full-range collapse. Both points are now stated up-front in plain language, with a bulleted list and short bold labels for scannability.
|
cc @johanl-db |
| * past them, or the source terminates. | ||
| * computation. Two streaming-specific behaviors to be aware of: | ||
| * <ul> | ||
| * <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a |
There was a problem hiding this comment.
Output is delayed until a later micro-batch advances the watermark.
"Output is delayed by one commit" is not accurate. The delaying is from streaming watermark/stateful append semantics. The contract actually allows a micro-batch has multiple distinct commits. Earlier commit in a batch won't be output because other commit in the same batch.
There was a problem hiding this comment.
Good catch -- updated in 0be721e. Replaced "delayed by one commit" / "by the next micro-batch" with "buffered until the watermark advances past the commit" / "by a later micro-batch -- whichever one advances the watermark past the commit's _commit_timestamp". This anchors the lag on watermark progression rather than commit count, so the multiple-distinct-commits-per-batch case is no longer implicitly excluded.
| * same batch. They are emitted by the next micro-batch -- the one that | ||
| * ingests the following commit. The last commit's output is emitted | ||
| * when the source terminates.</li> | ||
| * <li><b>netChanges only merges changes that are buffered together.</b> For |
There was a problem hiding this comment.
The condition in “For a typical CDC source that produces at most one change per row per commit ... the streaming output is the same as computeUpdates” is not sufficient. Producing at most one change per row per commit only rules out multiple changes for the same row within a single commit; it does not guarantee that the same row will not be touched again by a later commit before the older buffered output has been emitted. The implementation keeps the first and last event in CdcNetChangesStatefulProcessor until the event-time timer fires and clears the state, so changes from multiple commits can still be merged whenever they fall into the same buffered window. Consider rewording this to say that streaming netChanges behaves like computeUpdates only when each row identity appears in at most one commit within any buffered window.
There was a problem hiding this comment.
Adopted your framing in 0be721e -- the bullet now says "When each row identity appears in at most one commit within any buffered window, the streaming output is the same as computeUpdates." The condition is now on row-identity occurrences within the buffer window rather than on changes within a single commit, which matches what CdcNetChangesStatefulProcessor actually does. Thanks.
…ge-window wording Address @viirya's review comments on PR apache#55776: - "Output is delayed by one commit" implied a 1:1 commit-to-batch cadence. The contract allows multiple distinct commits with equal `_commit_timestamp` within a single micro-batch, so the lag is really driven by streaming watermark / append-mode semantics. Reword the bullet header and body to anchor the lag on watermark progression past the commit's timestamp. - "At most one change per row per commit" was too weak a condition for streaming-output-matches-`computeUpdates`. The accurate condition is on row-identity occurrences within the buffered window (not on changes within a single commit). Replace with viirya's framing: "each row identity appears in at most one commit within any buffered window".
viirya
left a comment
There was a problem hiding this comment.
Thanks for addressing the comments. Before merging, maybe it is better to update the PR description too. The PR description still uses "Output is delayed by one commit".
|
@johanl-db @viirya Thanks for the review. |
…tChanges scope ### What changes were proposed in this pull request? Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc. The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice. Replaced the existing single paragraph with a bulleted list: - **Output is buffered until the watermark advances past the commit.** When a micro-batch ingests a commit, that commit's output rows are buffered in state and not emitted in the same batch. They are emitted by a later micro-batch -- whichever one advances the watermark past the commit's `_commit_timestamp`. The last commit's output is emitted when the source terminates. - **netChanges only merges changes that are buffered together.** When each row identity appears in at most one commit within any buffered window, the streaming output is the same as `computeUpdates`. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read. This is a sub-task of SPARK-55668. ### Why are the changes needed? Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what `CdcNetChangesStatefulProcessor` actually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released. ### Does this PR introduce _any_ user-facing change? Documentation only. No behavior change. ### How was this patch tested? Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java` (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude opus-4-7 Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 5d47d6e) Signed-off-by: Gengliang Wang <gengliang@apache.org>
…tChanges scope ### What changes were proposed in this pull request? Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc. The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag as the netChanges `transformWithState` timer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice. Replaced the existing single paragraph with a bulleted list: - **Output is buffered until the watermark advances past the commit.** When a micro-batch ingests a commit, that commit's output rows are buffered in state and not emitted in the same batch. They are emitted by a later micro-batch -- whichever one advances the watermark past the commit's `_commit_timestamp`. The last commit's output is emitted when the source terminates. - **netChanges only merges changes that are buffered together.** When each row identity appears in at most one commit within any buffered window, the streaming output is the same as `computeUpdates`. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read. This is a sub-task of SPARK-55668. ### Why are the changes needed? Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what `CdcNetChangesStatefulProcessor` actually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released. ### Does this PR introduce _any_ user-facing change? Documentation only. No behavior change. ### How was this patch tested? Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on `Changelog.java` (errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude opus-4-7 Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit 5d47d6e) Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
Address two follow-up review threads on PR #55637 (streaming CDC netChanges) by clarifying the streaming behavior in the
ChangelogJavadoc.The previous paragraph read as if the emission lag were a netChanges-specific property; in fact carry-over removal and update detection use append-mode
Aggregatekeyed on_commit_timestampand have the same lag as the netChangestransformWithStatetimer. The paragraph also did not set expectations for what streaming netChanges actually collapses in practice.Replaced the existing single paragraph with a bulleted list:
_commit_timestamp. The last commit's output is emitted when the source terminates.computeUpdates. Cross-commit merging only happens when several commits touch the same row before the earliest one's output has been released. For full-range collapse, use a batch read.This is a sub-task of SPARK-55668.
Why are the changes needed?
Spelling out the emission timing and the practical netChanges scope prevents adopters from forming wrong expectations about what streaming netChanges does for typical CDC workloads. Anchoring the lag on watermark progression (not commit count) and the netChanges merge condition on row-identity occurrences within the buffered window (not on changes within a single commit) keeps the doc consistent with what
CdcNetChangesStatefulProcessoractually implements -- including the cases where multiple distinct commits share a micro-batch or the same row identity is touched in multiple commits before the older one's output has been released.Does this PR introduce any user-facing change?
Documentation only. No behavior change.
How was this patch tested?
Doc-only change.
Xdoclint:html,syntax,accessibilityis clean onChangelog.java(errors limited to expected "cannot find symbol" without classpath). No code changed; existing CDC test suites unaffected.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude opus-4-7