Tmonster opened a new issue, #13751:
URL: https://github.com/apache/iceberg/issues/13751
### Query engine
DuckDB-Iceberg
### Question
Hi Iceberg team,
I am currently implementing merge-on-read Delete support in DuckDB-Iceberg.
I have a simple test case working in DuckDB, but am encountering errors when
reading the table again in Spark. The Table can be read with python-iceberg
just fine.
Below are the errors I am encountering/what I have tried.
- When writing a simple DuckDB positional-deletes file. File is attached in
the zip
```
25/08/06 10:18:04 ERROR BaseDeleteLoader: Failed to process
GenericDeleteFile{content=position_deletes,
file_path=s3://warehouse/default/test_delete_table/data/0aa20b80-434e-4f65-becc-64bcd7ff9795-deletes.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5,
file_size_in_bytes=963, column_sizes=null, value_counts=null,
null_value_counts=null, nan_value_counts=null,
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@53de5f70,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@53de5f70,
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null,
data_sequence_number=2, file_sequence_number=2, first_row_id=null,
referenced_data_file=s3://warehouse/default/test_delete_table/data/01987e74-d022-7109-8be2-7aa0a313d9fa.parquet,
content_offset=null, content_size_in_bytes=null}
java.lang.IllegalArgumentException: Missing required field: file_path
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:219)
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.message(BaseParquetReaders.java:207)
at
org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:48)
```
This was using file `duckdb-deletes-no-required.parquet`
I thought this would be because DuckDB parquet files don't label the
file_path as required, so I tried writing a parquet file with a required
file_path field. I got the same error as above
```
25/08/06 10:03:07 WARN CheckAllocator: More than one
DefaultAllocationManager on classpath. Choosing first found
25/08/06 10:03:07 ERROR BaseDeleteLoader: Failed to process
GenericDeleteFile{content=position_deletes,
file_path=s3://warehouse/default/test_delete_table/data/353d5bc1-38d4-4fab-b9a6-7a665453e3e1-deletes.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5,
file_size_in_bytes=967, column_sizes=null, value_counts=null,
null_value_counts=null, nan_value_counts=null,
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050 ....
java.lang.IllegalArgumentException: Missing required field: file_path
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.defaultReader(BaseParquetReaders.java:269)
at
org.apache.iceberg.data.parquet.BaseParquetReaders$ReadBuilder.struct(BaseParquetReaders.java:252)
```
This was using file `duckdb-deletes-required-columns.parquet`.
I then tried writing the positional delete file contents using Spark and
uploading the resulting file to the location where the original DuckDB delete
file was placed. I thought at the very lease this should **not** error.
However, this error is even more confusing.
```
25/08/06 10:10:25 ERROR BaseDeleteLoader: Failed to process
GenericDeleteFile{content=position_deletes,
file_path=s3://warehouse/default/test_delete_table/data/353d5bc1-38d4-4fab-b9a6-7a665453e3e1-deletes.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=5,
file_size_in_bytes=967, column_sizes=null, value_counts=null,
null_value_counts=null, nan_value_counts=null,
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@10b77050,
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null,
data_sequence_number=2, file_sequence_number=2, first_row_id=null,
referenced_data_file=s3://warehouse/default/test_delete_table/data/01987e63-6f21-7e85-995f-d09c8e68524d.parquet,
content_offset=null, content_size_in_bytes=null}
java.lang.RuntimeException:
org.apache.iceberg.parquet.ParquetIO$ParquetInputFile@41172525 is not a Parquet
file. Expected magic number at tail, but found [0, 0, 0, 25]
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:602)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
at
org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710)
at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194)
at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76)
at org.apache.iceberg.parquet.ParquetReader.init(ParquetReader.java:71)
at
org.apache.iceberg.parquet.ParquetReader.iterator(ParquetReader.java:91)
```
All three delete files are zipped here.
[delete-files.zip](https://github.com/user-attachments/files/21614486/delete-files.zip)
Each one of the attached files can be read with a normal spark session.
Any advice as to what I might be missing would be appreciated. I find it odd
that python-iceberg can read the position delete files fine, while Spark
cannot. This leads me to believe Spark is assuming that some field/parquet
schema element exists.
Thanks,
Tom Ebergen
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]