wypoon opened a new issue, #11221: URL: https://github.com/apache/iceberg/issues/11221
### Apache Iceberg version main (development) ### Query engine Spark ### Please describe the bug 🐞 The bug is present in Iceberg 1.2 and later (and is in main). A customer uses Impala to write Parquet data into an Iceberg table. We have a sample Parquet file written by Impala. It contains a single decimal(38, 0) column. ``` $ parquet-tools meta impala_test_data.parq file: file:/home/systest/impala_test_data.parq creator: impala version 4.0.0.2024.0.18.1-1 (build 1fe5a71a0498831cf13e5b30ca5e431d69da58bd) file schema: schema -------------------------------------------------------------------------------- id: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(38,0) R:0 D:1 row group 1: RC:435942 TS:2755159 OFFSET:4 -------------------------------------------------------------------------------- id: FIXED_LEN_BYTE_ARRAY SNAPPY DO:4 FPO:243092 SZ:2755159/7058047/2.56 VC:435942 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 224798, max: 5431555, num_nulls: 0] ``` Spark is able to read the Parquet file correctly (note this is Spark's own Parquet read path, not Spark Iceberg support): ``` scala> val df = spark.read.parquet("/user/systest/decimal_test/impala_test_data.parq") df: org.apache.spark.sql.DataFrame = [id: decimal(38,0)] scala> df.count() res0: Long = 435942 scala> df.show() +-------+ | id| +-------+ |3025050| |1401270| | 505425| | 479647| |5061822| |4170450| |3307794| | 683409| |3205921| |3261299| |1596856| |5260644| |4865400| |4737157| |4808919| |4032370| |5183774| |4119261| |1911171| | 782928| +-------+ only showing top 20 rows ``` However, when we create an Iceberg table, add the file to it (using the Spark `add_files` procedure), and then query the table, we get ``` scala> spark.sql("select * from test_iceberg").show(80, false) +-------+ |id | +-------+ |3025050| |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |1401270| |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |505425 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |479647 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |5061822| |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | |0 | +-------+ only showing top 80 rows ``` As we can see, the values do show up, but there are 15 zeros that show up after each value. This is using vectorized Parquet read, as `read.parquet.vectorization.enabled` is true by default. When I set it to false in table properties for the table and query it again, the results are correct. Thus the bug is in the vectorized read path. Note that when I write the DataFrame from reading the original Parquet file back out into another Iceberg table (with the same schema), the file written by Spark has a different encoding: ``` $ parquet-tools meta spark_iceberg_data.parquet file: file:/home/systest/spark_iceberg_data.parquet creator: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba) extra: iceberg.schema = {"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"decimal(38, 0)"}]} file schema: table -------------------------------------------------------------------------------- id: OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(38,0) R:0 D:1 row group 1: RC:435942 TS:6975886 OFFSET:4 -------------------------------------------------------------------------------- id: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:2685449/6975886/2.60 VC:435942 ENC:PLAIN,RLE,BIT_PACKED ST:[min: 224798, max: 5431555, num_nulls: 0] ``` ### Willingness to contribute - [ ] I can contribute a fix for this bug independently - [X] I would be willing to contribute a fix for this bug with guidance from the Iceberg community - [ ] I cannot contribute a fix for this bug at this time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org