arnaudbriche opened a new issue, #305: URL: https://github.com/apache/iceberg-go/issues/305
### Apache Iceberg version main (development) ### Please describe the bug 🐞 I'm trying to use the package to create and maintain Iceberg tables from independently generated Parquet files on S3. I'm using the various builders to create and persist Avro and JSON metadata files to S3. I'm hitting an issue with ManifestV2, more specifically the DatFile structure. It looks like this package uses static Avro schema definitions in JSON format; here the schema for ManifestEntryV2: ```json { "type": "record", "name": "manifest_entry", "fields": [ {"name": "status", "type": "int", "field-id": 0}, {"name": "snapshot_id", "type": ["null", "long"], "field-id": 1}, {"name": "sequence_number", "type": ["null", "long"], "field-id": 3}, {"name": "file_sequence_number", "type": ["null", "long"], "field-id": 4}, { "name": "data_file", "type": { "type": "record", "name": "r2", "fields": [ {"name": "content", "type": "int", "doc": "Type of content stored by the data file", "field-id": 134}, {"name": "file_path", "type": "string", "doc": "Location URI with FS scheme", "field-id": 100}, { "name": "file_format", "type": "string", "doc": "File format name: avro, orc, or parquet", "field-id": 101 }, { "name": "partition", "type": { "type": "record", "name": "r102", "fields": [ {"field-id": 1000, "name": "VendorID", "type": ["null", "int"]}, { "field-id": 1001, "name": "tpep_pickup_datetime", "type": ["null", {"type": "int", "logicalType": "date"}] } ] }, "field-id": 102 }, {"name": "record_count", "type": "long", "doc": "Number of records in the file", "field-id": 103}, {"name": "file_size_in_bytes", "type": "long", "doc": "Total file size in bytes", "field-id": 104}, { "name": "column_sizes", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k117_v118", "fields": [ {"name": "key", "type": "int", "field-id": 117}, {"name": "value", "type": "long", "field-id": 118} ] }, "logicalType": "map" } ], "doc": "Map of column id to total size on disk", "field-id": 108 }, { "name": "value_counts", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k119_v120", "fields": [ {"name": "key", "type": "int", "field-id": 119}, {"name": "value", "type": "long", "field-id": 120} ] }, "logicalType": "map" } ], "doc": "Map of column id to total count, including null and NaN", "field-id": 109 }, { "name": "null_value_counts", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k121_v122", "fields": [ {"name": "key", "type": "int", "field-id": 121}, {"name": "value", "type": "long", "field-id": 122} ] }, "logicalType": "map" } ], "doc": "Map of column id to null value count", "field-id": 110 }, { "name": "nan_value_counts", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k138_v139", "fields": [ {"name": "key", "type": "int", "field-id": 138}, {"name": "value", "type": "long", "field-id": 139} ] }, "logicalType": "map" } ], "doc": "Map of column id to number of NaN values in the column", "field-id": 137 }, { "name": "lower_bounds", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k126_v127", "fields": [ {"name": "key", "type": "int", "field-id": 126}, {"name": "value", "type": "bytes", "field-id": 127} ] }, "logicalType": "map" } ], "doc": "Map of column id to lower bound", "field-id": 125 }, { "name": "upper_bounds", "type": [ "null", { "type": "array", "items": { "type": "record", "name": "k129_v130", "fields": [ {"name": "key", "type": "int", "field-id": 129}, {"name": "value", "type": "bytes", "field-id": 130} ] }, "logicalType": "map" } ], "doc": "Map of column id to upper bound", "field-id": 128 }, { "name": "key_metadata", "type": ["null", "bytes"], "doc": "Encryption key metadata blob", "field-id": 131 }, { "name": "split_offsets", "type": ["null", {"type": "array", "items": "long", "element-id": 133}], "doc": "Splittable offsets", "field-id": 132 }, { "name": "equality_ids", "type": ["null", {"type": "array", "items": "int", "element-id": 136}], "doc": "Field ids used to determine row equality for delete files", "field-id": 135 }, { "name": "sort_order_id", "type": ["null", "int"], "doc": "Sort order ID", "field-id": 140 } ] }, "field-id": 2 } ] } ``` The part that is causing issue is this: ```json { "name": "partition", "type": { "type": "record", "name": "r102", "fields": [ {"field-id": 1000, "name": "VendorID", "type": ["null", "int"]}, { "field-id": 1001, "name": "tpep_pickup_datetime", "type": ["null", {"type": "int", "logicalType": "date"}] } ] }, "field-id": 102 } ``` This is clearly not the right schema type for partition as of the spec. It looks more like an example from the doc or something. Here's what the spec says about the partition field: required | required | required | 102 partition | struct<...> | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids -- | -- | -- | -- | -- | -- This is not very clear to me, but it sounds like the type is a dynamically generate Avro record type. Not sure how it can be implemented with the current status Avro schema approach. Logically, my experiment fails with the following error message: `"Data: PartitionData: avro: missing required field VendorID"`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org