wypoon commented on issue #11221: URL: https://github.com/apache/iceberg/issues/11221#issuecomment-2379964488
In Iceberg 1.1, a different bug occurs when reading the Iceberg table; the read fails altogether due to: ``` ERROR org.apache.iceberg.spark.source.BaseReader - Error reading file(s): file:/Users/wypoon/tmp/downloads/ENGESC-26958/lgim_test_data.parq java.lang.IndexOutOfBoundsException: index: 1016, length: 1 (expected: range(0, 1016)) at org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:318) at org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:305) at org.apache.arrow.memory.ArrowBuf.getByte(ArrowBuf.java:507) at org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:85) at org.apache.arrow.vector.DecimalVector.setBigEndian(DecimalVector.java:216) at org.apache.iceberg.arrow.vectorized.parquet.DecimalVectorUtil.setBigEndian(DecimalVectorUtil.java:31) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedDictionaryEncodedParquetValuesReader$FixedLengthDecimalDictEncodedReader.nextVal(VectorizedDictionaryEncodedParquetValuesReader.java:146) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedDictionaryEncodedParquetValuesReader$BaseDictEncodedReader.nextBatch(VectorizedDictionaryEncodedParquetValuesReader.java:67) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$FixedLengthDecimalReader.nextDictEncodedVal(VectorizedParquetDefinitionLevelReader.java:513) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedParquetDefinitionLevelReader$BaseReader.nextDictEncodedBatch(VectorizedParquetDefinitionLevelReader.java:356) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$FixedLengthDecimalPageReader.nextDictEncodedVal(VectorizedPageIterator.java:421) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator$BagePageReader.nextBatch(VectorizedPageIterator.java:186) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$FixedLengthDecimalBatchReader.nextBatchOf(VectorizedColumnIterator.java:213) at org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator$BatchReader.nextBatch(VectorizedColumnIterator.java:77) at org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.read(VectorizedArrowReader.java:146) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.readDataToColumnVectors(ColumnarBatchReader.java:123) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader$ColumnBatchLoader.loadDataToColumnBatch(ColumnarBatchReader.java:98) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:72) at org.apache.iceberg.spark.data.vectorized.ColumnarBatchReader.read(ColumnarBatchReader.java:44) at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:147) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63) at scala.Option.exists(Option.scala:376) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:136) ... ``` As far as I can tell, this has never worked or worked correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org