[I] Two-level parquet read EOF error: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1 [iceberg]

via GitHub Wed, 17 Jan 2024 07:26:07 -0800


gaoshihang opened a new issue, #9497:
URL: https://github.com/apache/iceberg/issues/9497


   ### Apache Iceberg version
   
   1.4.3 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We have a two-level parquet list, the schema like below:
   
![image](https://github.com/apache/iceberg/assets/20013931/ceddb1ba-76fc-4b0f-afa3-9fab3f01b583)
   
   Now if this array is an empty array: [], when we using add_files function 
add this parquet to a table, then query will throw this exception:
   ```
   Caused by: org.apache.parquet.io.ParquetDecodingException: Can't read value 
in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current 
page. repetition level: -1, definition level: -1
        at 
org.apache.iceberg.parquet.PageIterator.handleRuntimeException(PageIterator.java:220)
        at 
org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:141)
        at 
org.apache.iceberg.parquet.ColumnIterator.nextInteger(ColumnIterator.java:121)
        at 
org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:41)
        at 
org.apache.iceberg.parquet.ColumnIterator$2.next(ColumnIterator.java:38)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$UnboxedReader.read(ParquetValueReaders.java:246)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$RepeatedReader.read(ParquetValueReaders.java:467)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$OptionReader.read(ParquetValueReaders.java:419)
        at 
org.apache.iceberg.parquet.ParquetValueReaders$StructReader.read(ParquetValueReaders.java:745)
        at 
org.apache.iceberg.parquet.ParquetReader$FileIterator.next(ParquetReader.java:130)
        at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:65)
        at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
        at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:129)
        at 
org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
        at 
org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
        at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: org.apache.parquet.io.ParquetDecodingException: could not read int
        at 
org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:114)
        at 
org.apache.iceberg.parquet.PageIterator.nextInteger(PageIterator.java:139)
        ... 32 more
   Caused by: java.io.EOFException
        at 
org.apache.parquet.bytes.SingleBufferInputStream.read(SingleBufferInputStream.java:52)
        at 
org.apache.parquet.bytes.LittleEndianDataInputStream.readInt(LittleEndianDataInputStream.java:347)
        at 
org.apache.parquet.column.values.plain.PlainValuesReader$IntegerPlainValuesReader.readInteger(PlainValuesReader.java:112)
        ... 33 more
   ```
   
   And I read the code in Iceberg-parquet, it seems like this do-while will 
never exist:
   
![image](https://github.com/apache/iceberg/assets/20013931/2e74bbc0-4daa-4570-a91b-717ffdc65324)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Two-level parquet read EOF error: org.apache.parquet.io.ParquetDecodingException: Can't read value in column [a, array] repeated int32 array = 2 at value 4 out of 4 in current page. repetition level: -1, definition level: -1 [iceberg]

Reply via email to