Tmonster opened a new issue, #13751:
URL: https://github.com/apache/iceberg/issues/13751

   ### Query engine
   
   DuckDB-Iceberg
   
   ### Question
   
   Hi Iceberg team,
   
   I am currently implementing merge-on-read Delete support in DuckDB-Iceberg. 
I have a simple test case working in DuckDB, but am encountering errors when 
reading the table again in Spark. The Table can be read with python-iceberg 
just fine.
   
   Below are the errors I am encountering/what I have tried. 
   
   - When writing a simple DuckDB positional-deletes file. File is attached in 
the zip
   ```
   25/08/06 10:18:04 ERROR BaseDeleteLoader: Failed to process 
GenericDeleteFile{content=position_deletes, 
file_path=s3://warehouse/default/test_delete_table/data/0aa20b80-434e-4f65-becc-64bcd7ff9795-deletes.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5, 
file_size_in_bytes=963, column_sizes=null, value_counts=null, 
null_value_counts=null, nan_value_counts=null, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@53de5f70, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@53de5f70, 
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null, 
data_sequence_number=2, file_sequence_number=2, first_row_id=null, 
referenced_data_file=s3://warehouse/default/test_delete_table/data/01987e74-d022-7109-8be2-7aa0a313d9fa.parquet,
 content_offset=null, content_size_in_bytes=null}
   java.lang.IllegalArgumentException: Missing required field: file_path
        at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
        at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
        at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:219)
        at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:207)
        at 
org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:48)
   ```
   This was using file `duckdb-deletes-no-required.parquet`
   
   
   
   I thought this would be because DuckDB parquet files don't label the 
file_path as required, so I tried writing a parquet file with a required 
file_path field. I got the same error as above
   ```
   25/08/06 10:03:07 WARN CheckAllocator: More than one 
DefaultAllocationManager on classpath. Choosing first found
   25/08/06 10:03:07 ERROR BaseDeleteLoader: Failed to process 
GenericDeleteFile{content=position_deletes, 
file_path=s3://warehouse/default/test_delete_table/data/353d5bc1-38d4-4fab-b9a6-7a665453e3e1-deletes.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5, 
file_size_in_bytes=967, column_sizes=null, value_counts=null, 
null_value_counts=null, nan_value_counts=null, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050 ....
   java.lang.IllegalArgumentException: Missing required field: file_path
        at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
       at 
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
   ```
   This was using file `duckdb-deletes-required-columns.parquet`. 
   
   
   I then tried writing the positional delete file contents using Spark and 
uploading the resulting file to the location where the original DuckDB delete 
file was placed. I thought at the very lease this should **not** error. 
However, this error is even more confusing.
   ```
   25/08/06 10:10:25 ERROR BaseDeleteLoader: Failed to process 
GenericDeleteFile{content=position_deletes, 
file_path=s3://warehouse/default/test_delete_table/data/353d5bc1-38d4-4fab-b9a6-7a665453e3e1-deletes.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5, 
file_size_in_bytes=967, column_sizes=null, value_counts=null, 
null_value_counts=null, nan_value_counts=null, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050, 
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null, 
data_sequence_number=2, file_sequence_number=2, first_row_id=null, 
referenced_data_file=s3://warehouse/default/test_delete_table/data/01987e63-6f21-7e85-995f-d09c8e68524d.parquet,
 content_offset=null, content_size_in_bytes=null}
   java.lang.RuntimeException: 
org.apache.iceberg.parquet.ParquetIO$ParquetInputFile@41172525 is not a Parquet 
file. Expected magic number at tail, but found [0, 0, 0, 25]
        at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:602)
        at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934)
        at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
        at 
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710)
        at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194)
        at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76)
        at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71)
        at 
org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91)
   ```
   
   All three delete files are zipped here.
   
[delete-files.zip](https://github.com/user-attachments/files/21614486/delete-files.zip)
   
   Each one of the attached files can be read with a normal spark session.
   
   Any advice as to what I might be missing would be appreciated. I find it odd 
that python-iceberg can read the position delete files fine, while Spark 
cannot. This leads me to believe Spark is assuming that some field/parquet 
schema element exists. 
   
   Thanks,
   Tom Ebergen
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to