hguercan opened a new issue, #13763:
URL: https://github.com/apache/iceberg/issues/13763

   ### Apache Iceberg version
   
   None
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   Hello everyone, 
   
   Our Architecture is to stream data via kafka and ingest the data via 
kafka-connect with the iceberg-sink-connector with iceberg 1.9.2 to our Azure 
Blob Storage against Polaris Catalog in version 1.0.0-incubating. Maintenance 
is done via DBX Spark in version 3.5.2 also with iceberg version 3.5.2 with 
partial-progress.enabled. We recently changed that spark option to true before 
we did not see the issue. 
   
   When working with Spark there is no issue but when we import the data into 
Snowflake the error message pops up. 
   
   `SQL execution error: Duplicate file path found in the Iceberg metadata 
snapshot. Please check that your Iceberg metadata generation is producing valid 
manifest files and refresh to a newer snapshot once fixed. File path: 
'<irrelevant-prefix>created_at_day=2025-08-06/country_code=<counry_code>/entity_number=<irrelevant>/00001-xxxxxxxxx*-.parquet`.
 
   
   When going down from the snapshot to the manifest file we really see a 
duplicate entry for the file. 
   
   `
   
{"status":0,"snapshot_id":{"long":7127004696002716753},"sequence_number":{"long":59331},"file_sequence_number":{"long":59331},"data_file":{"content":0,"file_path":"abfss://<masked-path-prefix>/xxxx/xxx/<table-name>/data/created_at_day=2025-08-05/country_code=<country_code>/entity_number=<irrelevant>/00001-1754381332739-c55fd420-b673-45ec-be45-35408bf4e650-00001.parquet","file_format":"PARQUET","partition":{"created_at_day":{"int":20305},"country_code":{"string":"<country_code>"},"entity_number":{"string":"<irrelevant>"}},"record_count":307,"file_size_in_bytes":25282,"column_sizes":{"array":[]},"value_counts":{"array":[]},"null_value_counts":{"array":[]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[]},"upper_bounds":{"array":[]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0},"referenced_data_file":null}}
   
{"status":0,"snapshot_id":{"long":1026760103416329420},"sequence_number":{"long":59330},"file_sequence_number":{"long":59330},"data_file":{"content":0,"file_path":"abfss://<masked-path-prefix>/xxxx/xxx/<table-name>/data/created_at_day=2025-08-05/country_code=<country_code>/entity_number=<irrelevant>/00001-1754381332739-c55fd420-b673-45ec-be45-35408bf4e650-00001.parquet","file_format":"PARQUET","partition":{"created_at_day":{"int":20305},"country_code":{"string":"<country_code>"},"entity_number":{"string":"<irrelevant>"}},"record_count":307,"file_size_in_bytes":25282,"column_sizes":{"array":[]},"value_counts":{"array":[]},"null_value_counts":{"array":[]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[]},"upper_bounds":{"array":[]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0},"referenced_data_file":null}}
   `
   I masked some parts but relevant is that the path location, record_count and 
file_size_in_bytes is same. 
   What differs is the snapshot_id, sequence_number and 
file_in_sequence_number. 
   
   Running our maintenance job again with manifest_rewrite but 
partial-progress.enabled true was not solving the issue. By disabling it and 
running the manifest_rewrite a first try solved the issue.
   
   
   ### Willingness to contribute
   
   - [ ] I can contribute a fix for this bug independently
   - [x] I would be willing to contribute a fix for this bug with guidance from 
the Iceberg community
   - [ ] I cannot contribute a fix for this bug at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to