singhpk234 opened a new pull request, #12988: URL: https://github.com/apache/iceberg/pull/12988
### About the change Make _maxRecordsPerMicrobatch_ a soft limit, as the cases like for ex number of records is less than the maxRecords of a file would expect us to read the file partially, this can't be just a row group boundary or something which we can incorporate in our scan tasks as if like splitting it at the record count cutoff, hence its very difficult to define the boundary, it would be better if we make the make this a soft limit as if when including a file if it can be contained within the limit its fine, otherwise include the whole file and be done with that particular microbatch stream. This change is motivated by two major factors : 1. stream being stuck presently leading to poor UX https://github.com/apache/iceberg/pull/12217#discussion_r1962211721 2. Softlimit is what other solution enforce for ex : delta [doc](https://docs.databricks.com/aws/en/structured-streaming/delta-lake) **_maxBytesPerTrigger: How much data gets processed in each micro-batch. This option sets a “soft max”, meaning that a batch processes approximately this amount of data and may process more than the limit in order to make the streaming query move forward in cases when the smallest input unit is larger than this limit. This is not set by default._** ### Testing done Modified the existing UT which mimics stuckness to pass now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org