[I] Duplicate file name in Iceberg's metadata [iceberg]

via GitHub Mon, 30 Oct 2023 09:15:41 -0700


github-raphael-douyere opened a new issue, #8953:
URL: https://github.com/apache/iceberg/issues/8953


   ### Apache Iceberg version
   
   1.3.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   While writing data to an Iceberg table using Spark Streaming 3.4.1 / Iceberg 
1.3.1 / EMR 6.13 we do observe multiple entries in the table's metadata for a 
single file name (path + name).
   
   ```
   
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
   |content|file_path                                                           
                                             
|file_format|spec_id|partition|record_count|file_size_in_bytes|
   
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
   |0      
|s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET
    |0      |{471424} |1176385     |52215529          |
   |0      
|s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|PARQUET
    |0      |{471424} |1152053     |51648666          |
   
+-------+-----------------------------------------------------------------------------------------------------------------+-----------+-------+---------+------------+------------------+
   ```
   
   This causes issues when reading data with Athena but it does not cause 
issues when reading with Spark (or opening parquet files directly with 
parquet-cli). 
   We also see that the two occurrences of the file belong to different 
snapshots:
   ```
   
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
   |status|snapshot_id        |sequence_number|file_sequence_number|file_path   
                                                                                
                     |
   
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
   |1     |5798287735063119103|605            |605                 
|s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
   |1     |48372161143873894  |604            |604                 
|s3://<redacted>/table/data/time_hour=2023-10-12-16/00030-20515-7162e9b9-8d49-4d04-a828-6725e75da400-00001.parquet|
   
+------+-------------------+---------------+--------------------+-----------------------------------------------------------------------------------------------------------------+
   ```
   
   This seems very similar to https://github.com/apache/iceberg/issues/8427 and 
https://github.com/apache/iceberg/issues/8609.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Duplicate file name in Iceberg's metadata [iceberg]

Reply via email to