jiayuasu opened a new pull request, #2783: URL: https://github.com/apache/sedona/pull/2783
## Did you read the Contributor Guide? - Yes, I have read the [Contributor Rules](https://sedona.apache.org/latest/community/rule/) and [Contributor Developer Guide](https://sedona.apache.org/latest/community/develop/) ## Is this PR related to a ticket? - Yes, and the PR name follows the format `[GH-XXX] my subject`. Closes #2781 ## What changes were proposed in this PR? When Spark splits a PBF file into multiple partitions, the last partition may start inside the final data block where no more `OSMData` block boundaries exist. `HeaderFinder.find()` correctly returns `-1` in this case, but `OsmPartitionReader` did not check for it — leading to an invalid seek position (`file.start + (-1)`) and a bogus stream length limit, which causes `DataInputStream.readFully()` to throw `EOFException`. The fix: when `findOffset()` returns `-1`, close the stream and return an empty iterator. This is safe because the previous partition already reads through the complete last block (blocks are always read in full, regardless of partition boundaries). ## How was this patch tested? Added a test that forces small partition splits (`maxPartitionBytes=100000`) on the existing `monaco-latest.osm.pbf` test file. This creates a final partition that starts inside the last PBF block with no remaining `OSMData` header, which triggers the `EOFException` without the fix. ## Did this PR include necessary documentation updates? - No, this PR does not affect any public API so no need to change the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
