hguercan opened a new issue, #13763:
URL: https://github.com/apache/iceberg/issues/13763
### Apache Iceberg version
None
### Query engine
None
### Please describe the bug 🐞
Hello everyone,
Our Architecture is to stream data via kafka and ingest the data via
kafka-connect with the iceberg-sink-connector with iceberg 1.9.2 to our Azure
Blob Storage against Polaris Catalog in version 1.0.0-incubating. Maintenance
is done via DBX Spark in version 3.5.2 also with iceberg version 3.5.2 with
partial-progress.enabled. We recently changed that spark option to true before
we did not see the issue.
When working with Spark there is no issue but when we import the data into
Snowflake the error message pops up.
`SQL execution error: Duplicate file path found in the Iceberg metadata
snapshot. Please check that your Iceberg metadata generation is producing valid
manifest files and refresh to a newer snapshot once fixed. File path:
'<irrelevant-prefix>created_at_day=2025-08-06/country_code=<counry_code>/entity_number=<irrelevant>/00001-xxxxxxxxx*-.parquet`.
When going down from the snapshot to the manifest file we really see a
duplicate entry for the file.
`
{"status":0,"snapshot_id":{"long":7127004696002716753},"sequence_number":{"long":59331},"file_sequence_number":{"long":59331},"data_file":{"content":0,"file_path":"abfss://<masked-path-prefix>/xxxx/xxx/<table-name>/data/created_at_day=2025-08-05/country_code=<country_code>/entity_number=<irrelevant>/00001-1754381332739-c55fd420-b673-45ec-be45-35408bf4e650-00001.parquet","file_format":"PARQUET","partition":{"created_at_day":{"int":20305},"country_code":{"string":"<country_code>"},"entity_number":{"string":"<irrelevant>"}},"record_count":307,"file_size_in_bytes":25282,"column_sizes":{"array":[]},"value_counts":{"array":[]},"null_value_counts":{"array":[]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[]},"upper_bounds":{"array":[]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0},"referenced_data_file":null}}
{"status":0,"snapshot_id":{"long":1026760103416329420},"sequence_number":{"long":59330},"file_sequence_number":{"long":59330},"data_file":{"content":0,"file_path":"abfss://<masked-path-prefix>/xxxx/xxx/<table-name>/data/created_at_day=2025-08-05/country_code=<country_code>/entity_number=<irrelevant>/00001-1754381332739-c55fd420-b673-45ec-be45-35408bf4e650-00001.parquet","file_format":"PARQUET","partition":{"created_at_day":{"int":20305},"country_code":{"string":"<country_code>"},"entity_number":{"string":"<irrelevant>"}},"record_count":307,"file_size_in_bytes":25282,"column_sizes":{"array":[]},"value_counts":{"array":[]},"null_value_counts":{"array":[]},"nan_value_counts":{"array":[]},"lower_bounds":{"array":[]},"upper_bounds":{"array":[]},"key_metadata":null,"split_offsets":{"array":[4]},"equality_ids":null,"sort_order_id":{"int":0},"referenced_data_file":null}}
`
I masked some parts but relevant is that the path location, record_count and
file_size_in_bytes is same.
What differs is the snapshot_id, sequence_number and
file_in_sequence_number.
Running our maintenance job again with manifest_rewrite but
partial-progress.enabled true was not solving the issue. By disabling it and
running the manifest_rewrite a first try solved the issue.
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [x] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [ ] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]