SreeramGarlapati opened a new issue, #2507:
URL: https://github.com/apache/iceberg-rust/issues/2507

   ### Apache Iceberg Rust version
   
   `main` (observed against the v0.9 line; same code present on `main` at the 
time of filing).
   
   ### Describe the bug
   
   `FastAppendAction::commit` with `with_check_duplicate(true)` (the default) 
does not detect duplicate `file_path` values **within the batch being 
appended**. Two `DataFile` entries carrying the same `file_path` in a single 
`add_data_files(...)` call are accepted, written into the manifest, and 
committed without error.
   
   The root cause is in `SnapshotProducer::validate_duplicate_files` 
(`crates/iceberg/src/transaction/snapshot.rs`):
   
   ```rust
   let new_files: HashSet<&str> = self
       .added_data_files
       .iter()
       .map(|df| df.file_path.as_str())
       .collect();
   ```
   
   Collecting into a `HashSet` silently dedupes the batch before the check 
runs. The subsequent loop then only compares the (already-deduped) set against 
entries in existing manifests — it never observes that the batch itself 
contained the same path twice. The pre-existing fix for #1394 only addressed 
the cross-snapshot side of the check; the intra-batch side has the same shape 
of bug.
   
   ### Impact
   
   Once the duplicate-bearing snapshot is committed:
   
   - A scan reads the same physical file twice, returning duplicate rows.
   - `added_files_count` / `added_records` in the snapshot summary are inflated 
relative to physical storage.
   - Nothing in the commit path surfaces the violation — corruption is silent.
   
   This is reachable from any upstream file-discovery or retry path that can 
hand the same path to `add_data_files` more than once.
   
   ### To Reproduce
   
   ```rust
   let table = make_v2_minimal_table();
   let tx = Transaction::new(&table);
   
   let mk = |size, records| DataFileBuilder::default()
       .content(DataContentType::Data)
       .file_path(\"test/dup.parquet\".to_string()) // identical path
       .file_format(DataFileFormat::Parquet)
       .file_size_in_bytes(size)
       .record_count(records)
       .partition_spec_id(table.metadata().default_partition_spec_id())
       .partition(Struct::from_iter([Some(Literal::long(1))]))
       .build()
       .unwrap();
   
   let action = tx
       .fast_append()
       .with_check_duplicate(true)
       .add_data_files(vec![mk(100, 10), mk(200, 20)]);
   
   // Today: returns Ok and writes a manifest with two entries pointing at the 
same file.
   // Expected: returns Err(DataInvalid).
   Arc::new(action).commit(&table).await.unwrap();
   ```
   
   ### Expected behavior
   
   `validate_duplicate_files` should reject any commit whose batch contains the 
same `file_path` more than once, with `ErrorKind::DataInvalid` and a message 
listing the offending paths.
   
   ### Willingness to contribute
   
   Yes — fix and tests prepared, PR incoming.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to