This is an automated email from the ASF dual-hosted git repository.
gengliangwang pushed a commit to branch branch-4.2
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-4.2 by this push:
new 23d271e91e4f [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission
timing and netChanges scope
23d271e91e4f is described below
commit 23d271e91e4f421d7cccc6e3ca42af88fefd34b6
Author: Gengliang Wang <[email protected]>
AuthorDate: Mon May 11 11:36:50 2026 -0700
[SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and
netChanges scope
### What changes were proposed in this pull request?
Address two follow-up review threads on PR #55637 (streaming CDC
netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc.
The previous paragraph read as if the emission lag were a
netChanges-specific property; in fact carry-over removal and update detection
use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag
as the netChanges `transformWithState` timer. The paragraph also did not set
expectations for what streaming netChanges actually collapses in practice.
Replaced the existing single paragraph with a bulleted list:
- **Output is buffered until the watermark advances past the commit.** When
a micro-batch ingests a commit, that commit's output rows are buffered in state
and not emitted in the same batch. They are emitted by a later micro-batch --
whichever one advances the watermark past the commit's `_commit_timestamp`. The
last commit's output is emitted when the source terminates.
- **netChanges only merges changes that are buffered together.** When each
row identity appears in at most one commit within any buffered window, the
streaming output is the same as `computeUpdates`. Cross-commit merging only
happens when several commits touch the same row before the earliest one's
output has been released. For full-range collapse, use a batch read.
This is a sub-task of SPARK-55668.
### Why are the changes needed?
Spelling out the emission timing and the practical netChanges scope
prevents adopters from forming wrong expectations about what streaming
netChanges does for typical CDC workloads. Anchoring the lag on watermark
progression (not commit count) and the netChanges merge condition on
row-identity occurrences within the buffered window (not on changes within a
single commit) keeps the doc consistent with what
`CdcNetChangesStatefulProcessor` actually implements -- including the cases
wher [...]
### Does this PR introduce _any_ user-facing change?
Documentation only. No behavior change.
### How was this patch tested?
Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on
`Changelog.java` (errors limited to expected "cannot find symbol" without
classpath). No code changed; existing CDC test suites unaffected.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude opus-4-7
Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify.
Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Gengliang Wang <[email protected]>
(cherry picked from commit 5d47d6e4af0ac267b68bdcbb5df83ac5b393bb73)
Signed-off-by: Gengliang Wang <[email protected]>
---
.../apache/spark/sql/connector/catalog/Changelog.java | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
index 2ef0846ab800..2c1dc896c1ba 100644
---
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
+++
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
@@ -71,10 +71,21 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap;
* </ul>
* <p>
* Streaming reads support carry-over removal, update detection, and net change
- * computation. Net change collapses are kept in the state store keyed by row
identity;
- * row identities only touched in the latest observed commit are held back
until either a
- * later commit (with strictly greater `_commit_timestamp`) advances the
global watermark
- * past them, or the source terminates.
+ * computation. Two streaming-specific behaviors to be aware of:
+ * <ul>
+ * <li><b>Output is buffered until the watermark advances past the
commit.</b>
+ * When a micro-batch ingests a commit, that commit's output rows are
+ * buffered in state and not emitted in the same batch. They are emitted
+ * by a later micro-batch -- whichever one advances the watermark past
+ * the commit's {@code _commit_timestamp}. The last commit's output is
+ * emitted when the source terminates.</li>
+ * <li><b>netChanges only merges changes that are buffered together.</b>
+ * When each row identity appears in at most one commit within any
+ * buffered window, the streaming output is the same as
+ * {@code computeUpdates}. Cross-commit merging only happens when
+ * several commits touch the same row before the earliest one's output
+ * has been released. For full-range collapse, use a batch read.</li>
+ * </ul>
* <p>
* <b>Pushdown contract.</b> When any post-processing pass applies (carry-over
* removal, update detection, or netChanges), Spark only pushes predicates
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]