-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope #55776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -71,10 +71,21 @@ | |
| * </ul> | ||
| * <p> | ||
| * Streaming reads support carry-over removal, update detection, and net change | ||
| * computation. Net change collapses are kept in the state store keyed by row identity; | ||
| * row identities only touched in the latest observed commit are held back until either a | ||
| * later commit (with strictly greater `_commit_timestamp`) advances the global watermark | ||
| * past them, or the source terminates. | ||
| * computation. Two streaming-specific behaviors to be aware of: | ||
| * <ul> | ||
| * <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a | ||
| * commit, that commit's output rows are buffered and not emitted in the | ||
| * same batch. They are emitted by the next micro-batch -- the one that | ||
| * ingests the following commit. The last commit's output is emitted | ||
| * when the source terminates.</li> | ||
| * <li><b>netChanges only merges changes that are buffered together.</b> For | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The condition in “For a typical CDC source that produces at most one change per row per commit ... the streaming output is the same as computeUpdates” is not sufficient. Producing at most one change per row per commit only rules out multiple changes for the same row within a single commit; it does not guarantee that the same row will not be touched again by a later commit before the older buffered output has been emitted. The implementation keeps the first and last event in CdcNetChangesStatefulProcessor until the event-time timer fires and clears the state, so changes from multiple commits can still be merged whenever they fall into the same buffered window. Consider rewording this to say that streaming netChanges behaves like computeUpdates only when each row identity appears in at most one commit within any buffered window. |
||
| * a typical CDC source that produces at most one change per row per | ||
| * commit, only one commit's changes are buffered at a time per row, so | ||
| * the streaming output is the same as {@code computeUpdates}. Multiple | ||
| * commits' changes are merged only when those commits touch the same | ||
| * row before the older one's output has been emitted. For full-range | ||
| * collapse, use a batch read.</li> | ||
| * </ul> | ||
| * <p> | ||
| * <b>Pushdown contract.</b> When any post-processing pass applies (carry-over | ||
| * removal, update detection, or netChanges), Spark only pushes predicates | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Output is delayed until a later micro-batch advances the watermark.
"Output is delayed by one commit" is not accurate. The delaying is from streaming watermark/stateful append semantics. The contract actually allows a micro-batch has multiple distinct commits. Earlier commit in a batch won't be output because other commit in the same batch.