Re: [PR] Spark: Structured Streaming read limit support follow-up [iceberg]

via GitHub Sat, 17 May 2025 12:01:31 -0700


wypoon commented on PR #12260:
URL: https://github.com/apache/iceberg/pull/12260#issuecomment-2888536801


   Hi @huaxingao, thank you for reviewing this!
   You are correct that the tests would pass with the original implementation, 
except for `testReadStreamWithCompositeReadLimit`, which would fail due to the 
bug in `SparkMicroBatchStream::getDefaultReadLimit()`.
   As I understand it, in `SparkMicroBatchStream`, we implement 
`getDefaultReadLimit()` (using what is set by configuration options), and Spark 
calls `latestOffset(Offset, ReadLimit)` with that `ReadLimit`. In principle, 
Spark can call `latestOffset(Offset, ReadLimit)` with any `ReadLimit` and 
`SparkMicroBatchStream` should respond according to that `ReadLimit`, but if 
Spark calls `latestOffset(Offset, ReadLimit)` with the `ReadLimit` given by 
`getDefaultReadLimit()`, then there is no difference in behavior between the 
original implementation (where we ignore the `ReadLimit` passed in and just use 
the `ReadLimit` corresponding to the configuration options) and the one in this 
PR. However, technically, we should not assume that and instead use the 
`ReadLimit` passed in.
   Do you know if and how Spark would pass in a different `ReadLimit` than what 
is returned by `getDefaultReadLimit()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Spark: Structured Streaming read limit support follow-up [iceberg]

Reply via email to