[I] Spark vectorized read of Parquet produces incorrect result for a decimal column [iceberg]

via GitHub Fri, 27 Sep 2024 12:57:03 -0700


wypoon opened a new issue, #11221:
URL: https://github.com/apache/iceberg/issues/11221


   ### Apache Iceberg version
   
   main (development)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   The bug is present in Iceberg 1.2 and later (and is in main).
   A customer uses Impala to write Parquet data into an Iceberg table. We have 
a sample Parquet file written by Impala. It contains a single decimal(38, 0) 
column.
   ```
   $ parquet-tools meta impala_test_data.parq
   file:        file:/home/systest/impala_test_data.parq 
   creator:     impala version 4.0.0.2024.0.18.1-1 (build 
1fe5a71a0498831cf13e5b30ca5e431d69da58bd) 
   
   file schema: schema 
   
--------------------------------------------------------------------------------
   id:          OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(38,0) R:0 D:1
   
   row group 1: RC:435942 TS:2755159 OFFSET:4 
   
--------------------------------------------------------------------------------
   id:           FIXED_LEN_BYTE_ARRAY SNAPPY DO:4 FPO:243092 
SZ:2755159/7058047/2.56 VC:435942 ENC:PLAIN_DICTIONARY,PLAIN,RLE ST:[min: 
224798, max: 5431555, num_nulls: 0]
   ```
   Spark is able to read the Parquet file correctly (note this is Spark's own 
Parquet read path, not Spark Iceberg support):
   ```
   scala> val df = 
spark.read.parquet("/user/systest/decimal_test/impala_test_data.parq")
   df: org.apache.spark.sql.DataFrame = [id: decimal(38,0)]                     
   
   
   scala> df.count()
   res0: Long = 435942                                                          
   
   
   scala> df.show()
   +-------+                                                                    
   
   |     id|
   +-------+
   |3025050|
   |1401270|
   | 505425|
   | 479647|
   |5061822|
   |4170450|
   |3307794|
   | 683409|
   |3205921|
   |3261299|
   |1596856|
   |5260644|
   |4865400|
   |4737157|
   |4808919|
   |4032370|
   |5183774|
   |4119261|
   |1911171|
   | 782928|
   +-------+
   only showing top 20 rows
   
   ```
   However, when we create an Iceberg table, add the file to it (using the 
Spark `add_files` procedure), and then query the table, we get
   ```
   scala> spark.sql("select * from test_iceberg").show(80, false)
   +-------+
   |id     |
   +-------+
   |3025050|
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |1401270|
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |505425 |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |479647 |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |5061822|
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   |0      |
   +-------+
   only showing top 80 rows
   
   ```
   As we can see, the values do show up, but there are 15 zeros that show up 
after each value.
   This is using vectorized Parquet read, as 
`read.parquet.vectorization.enabled` is true by default.
   When I set it to false in table properties for the table and query it again, 
the results are correct. Thus the bug is in the vectorized read path.
   
   Note that when I write the DataFrame from reading the original Parquet file 
back out into another Iceberg table (with the same schema), the file written by 
Spark has a different encoding:
   ```
   $ parquet-tools meta spark_iceberg_data.parquet
   file:        file:/home/systest/spark_iceberg_data.parquet 
   creator:     parquet-mr version 1.13.1 (build 
db4183109d5b734ec5930d870cdae161e408ddba) 
   extra:       iceberg.schema = 
{"type":"struct","schema-id":0,"fields":[{"id":1,"name":"id","required":false,"type":"decimal(38,
 0)"}]} 
   
   file schema: table 
   
--------------------------------------------------------------------------------
   id:          OPTIONAL FIXED_LEN_BYTE_ARRAY L:DECIMAL(38,0) R:0 D:1
   
   row group 1: RC:435942 TS:6975886 OFFSET:4 
   
--------------------------------------------------------------------------------
   id:           FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:2685449/6975886/2.60 
VC:435942 ENC:PLAIN,RLE,BIT_PACKED ST:[min: 224798, max: 5431555, num_nulls: 0]
   ```
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [X] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Spark vectorized read of Parquet produces incorrect result for a decimal column [iceberg]

Reply via email to