(spark) branch branch-4.2 updated: [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope

gengliang Mon, 11 May 2026 11:37:27 -0700

This is an automated email from the ASF dual-hosted git repository.

gengliangwang pushed a commit to branch branch-4.2
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-4.2 by this push:
     new 23d271e91e4f [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission 
timing and netChanges scope
23d271e91e4f is described below

commit 23d271e91e4f421d7cccc6e3ca42af88fefd34b6
Author: Gengliang Wang <[email protected]>
AuthorDate: Mon May 11 11:36:50 2026 -0700

    [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and 
netChanges scope
    
    ### What changes were proposed in this pull request?
    
    Address two follow-up review threads on PR #55637 (streaming CDC 
netChanges) by clarifying the streaming behavior in the `Changelog` Javadoc.
    
    The previous paragraph read as if the emission lag were a 
netChanges-specific property; in fact carry-over removal and update detection 
use append-mode `Aggregate` keyed on `_commit_timestamp` and have the same lag 
as the netChanges `transformWithState` timer. The paragraph also did not set 
expectations for what streaming netChanges actually collapses in practice.
    
    Replaced the existing single paragraph with a bulleted list:
    
    - **Output is buffered until the watermark advances past the commit.** When 
a micro-batch ingests a commit, that commit's output rows are buffered in state 
and not emitted in the same batch. They are emitted by a later micro-batch -- 
whichever one advances the watermark past the commit's `_commit_timestamp`. The 
last commit's output is emitted when the source terminates.
    - **netChanges only merges changes that are buffered together.** When each 
row identity appears in at most one commit within any buffered window, the 
streaming output is the same as `computeUpdates`. Cross-commit merging only 
happens when several commits touch the same row before the earliest one's 
output has been released. For full-range collapse, use a batch read.
    
    This is a sub-task of SPARK-55668.
    
    ### Why are the changes needed?
    
    Spelling out the emission timing and the practical netChanges scope 
prevents adopters from forming wrong expectations about what streaming 
netChanges does for typical CDC workloads. Anchoring the lag on watermark 
progression (not commit count) and the netChanges merge condition on 
row-identity occurrences within the buffered window (not on changes within a 
single commit) keeps the doc consistent with what 
`CdcNetChangesStatefulProcessor` actually implements -- including the cases 
wher [...]
    
    ### Does this PR introduce _any_ user-facing change?
    
    Documentation only. No behavior change.
    
    ### How was this patch tested?
    
    Doc-only change. `Xdoclint:html,syntax,accessibility` is clean on 
`Changelog.java` (errors limited to expected "cannot find symbol" without 
classpath). No code changed; existing CDC test suites unaffected.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude opus-4-7
    
    Closes #55776 from gengliangwang/SPARK-cdc-streaming-doc-clarify.
    
    Authored-by: Gengliang Wang <[email protected]>
    Signed-off-by: Gengliang Wang <[email protected]>
    (cherry picked from commit 5d47d6e4af0ac267b68bdcbb5df83ac5b393bb73)
    Signed-off-by: Gengliang Wang <[email protected]>
---
 .../apache/spark/sql/connector/catalog/Changelog.java | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
index 2ef0846ab800..2c1dc896c1ba 100644
--- 
a/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
+++ 
b/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java
@@ -71,10 +71,21 @@ import org.apache.spark.sql.util.CaseInsensitiveStringMap;
  * </ul>
  * <p>
  * Streaming reads support carry-over removal, update detection, and net change
- * computation. Net change collapses are kept in the state store keyed by row 
identity;
- * row identities only touched in the latest observed commit are held back 
until either a
- * later commit (with strictly greater `_commit_timestamp`) advances the 
global watermark
- * past them, or the source terminates.
+ * computation. Two streaming-specific behaviors to be aware of:
+ * <ul>
+ *   <li><b>Output is buffered until the watermark advances past the 
commit.</b>
+ *       When a micro-batch ingests a commit, that commit's output rows are
+ *       buffered in state and not emitted in the same batch. They are emitted
+ *       by a later micro-batch -- whichever one advances the watermark past
+ *       the commit's {@code _commit_timestamp}. The last commit's output is
+ *       emitted when the source terminates.</li>
+ *   <li><b>netChanges only merges changes that are buffered together.</b>
+ *       When each row identity appears in at most one commit within any
+ *       buffered window, the streaming output is the same as
+ *       {@code computeUpdates}. Cross-commit merging only happens when
+ *       several commits touch the same row before the earliest one's output
+ *       has been released. For full-range collapse, use a batch read.</li>
+ * </ul>
  * <p>
  * <b>Pushdown contract.</b> When any post-processing pass applies (carry-over
  * removal, update detection, or netChanges), Spark only pushes predicates


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-4.2 updated: [SPARK-56798][SQL][DOCS] Clarify streaming CDC emission timing and netChanges scope

Reply via email to