jiayuasu opened a new pull request, #2783:
URL: https://github.com/apache/sedona/pull/2783

   ## Did you read the Contributor Guide?
   
   - Yes, I have read the [Contributor 
Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor 
Developer Guide](https://sedona.apache.org/latest/community/develop/)
   
   ## Is this PR related to a ticket?
   
   - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2781
   
   ## What changes were proposed in this PR?
   
   When Spark splits a PBF file into multiple partitions, the last partition 
may start inside the final data block where no more `OSMData` block boundaries 
exist. `HeaderFinder.find()` correctly returns `-1` in this case, but 
`OsmPartitionReader` did not check for it — leading to an invalid seek position 
(`file.start + (-1)`) and a bogus stream length limit, which causes 
`DataInputStream.readFully()` to throw `EOFException`.
   
   The fix: when `findOffset()` returns `-1`, close the stream and return an 
empty iterator. This is safe because the previous partition already reads 
through the complete last block (blocks are always read in full, regardless of 
partition boundaries).
   
   ## How was this patch tested?
   
   Added a test that forces small partition splits (`maxPartitionBytes=100000`) 
on the existing `monaco-latest.osm.pbf` test file. This creates a final 
partition that starts inside the last PBF block with no remaining `OSMData` 
header, which triggers the `EOFException` without the fix.
   
   ## Did this PR include necessary documentation updates?
   
   - No, this PR does not affect any public API so no need to change the 
documentation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to