[I] Spark pbf reader: java.io.EOFException [sedona]

via GitHub Tue, 24 Mar 2026 13:37:21 -0700


jornfranke opened a new issue, #2781:
URL: https://github.com/apache/sedona/issues/2781


   Hi,
   
   I use the Spark pbf reader successfully, but for the following file:
   https://download.geofabrik.de/europe/malta-latest.osm.pbf
   
   I receive the following exception:
   ```
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 3.0 failed 4 times, most recent failure: Lost task 2.3 in stage 3.0 (TID 
7) (100.100.206.138 executor 1): java.lang.RuntimeException: 
java.io.EOFException
        at 
org.apache.sedona.sql.datasources.osmpbf.iterators.PrimitiveGroupIterator.next(PrimitiveGroupIterator.java:59)
        at 
org.apache.sedona.sql.datasources.osmpbf.iterators.PbfIterator.readNextBlock(PbfIterator.java:60)
        at 
org.apache.sedona.sql.datasources.osmpbf.iterators.PbfIterator.<init>(PbfIterator.java:35)
        at 
org.apache.sedona.sql.datasources.osm.OsmPartitionReader.apply(OsmPartitionReader.scala:55)
        at 
org.apache.sedona.sql.datasources.osm.OsmPartitionReader.apply(OsmPartitionReader.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
        at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:385)
        at 
org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:100)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:367)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at org.apache.spark.scheduler.Task.run(Task.scala:141)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:637)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:640)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:197)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at 
org.apache.sedona.sql.datasources.osmpbf.iterators.PrimitiveGroupIterator.next(PrimitiveGroupIterator.java:50)
        ... 32 more
   ```
   
   It is the only file on Geofabrik that causes issues. All others work. I use 
Apache Sedona 1.7.2 (as the cluster currently cannot be upgraded to JDK17) on 
Spark 3.5. I tested the file in QGIS 4.0 and it does not seem to cause errors.
   
   Any idea how I can fix this or  is this really related to the file?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Spark pbf reader: java.io.EOFException [sedona]

Reply via email to