[PR] Spark: Make maxRecordPerMicrobatch a soft limit [iceberg]

via GitHub Tue, 06 May 2025 18:37:35 -0700


singhpk234 opened a new pull request, #12988:
URL: https://github.com/apache/iceberg/pull/12988


   ### About the change
   
   Make _maxRecordsPerMicrobatch_ a soft limit, as the cases like for ex number 
of records is less than the maxRecords of a file would expect us to read the 
file partially, this can't be just a row group boundary or something which we 
can incorporate in our scan tasks as if like splitting it at the record count 
cutoff, hence its very difficult to define the boundary, it would be better if 
we make the make this a soft limit as if when including a file if it can be 
contained within the limit its fine, otherwise include the whole file and be 
done with that particular microbatch stream.
   
   This change is motivated by two major factors : 
   1. stream being stuck presently leading to poor UX 
https://github.com/apache/iceberg/pull/12217#discussion_r1962211721
   2. Softlimit is what other solution enforce for ex : delta 
[doc](https://docs.databricks.com/aws/en/structured-streaming/delta-lake) 
   
    **_maxBytesPerTrigger: How much data gets processed in each micro-batch. 
This option sets a “soft max”, meaning that a batch processes approximately 
this amount of data and may process more than the limit in order to make the 
streaming query move forward in cases when the smallest input unit is larger 
than this limit. This is not set by default._**
    
    
    ### Testing done 
    
    Modified the existing UT which mimics stuckness to pass now.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Spark: Make maxRecordPerMicrobatch a soft limit [iceberg]

Reply via email to