[I] Wrong Avro schema for ManifestEntry [iceberg-go]

via GitHub Sun, 16 Feb 2025 06:45:32 -0800


arnaudbriche opened a new issue, #305:
URL: https://github.com/apache/iceberg-go/issues/305


   ### Apache Iceberg version
   
   main (development)
   
   ### Please describe the bug 🐞
   
   I'm trying to use the package to create and maintain Iceberg tables from 
independently generated Parquet files on S3.
   I'm using the various builders to create and persist Avro and JSON metadata 
files to S3.
   
   I'm hitting an issue with ManifestV2, more specifically the DatFile 
structure.
   
   It looks like this package uses static Avro schema definitions in JSON 
format; here the schema for ManifestEntryV2:
   
   ```json
   {
       "type": "record",
       "name": "manifest_entry",
       "fields": [
           {"name": "status", "type": "int", "field-id": 0},
           {"name": "snapshot_id", "type": ["null", "long"], "field-id": 1},
           {"name": "sequence_number", "type": ["null", "long"], "field-id": 3},
           {"name": "file_sequence_number", "type": ["null", "long"], 
"field-id": 4},
           {
               "name": "data_file",
               "type": {
                   "type": "record",
                   "name": "r2",
                   "fields": [
                       {"name": "content", "type": "int", "doc": "Type of 
content stored by the data file", "field-id": 134},
                       {"name": "file_path", "type": "string", "doc": "Location 
URI with FS scheme", "field-id": 100},
                       {
                           "name": "file_format",
                           "type": "string",
                           "doc": "File format name: avro, orc, or parquet",
                           "field-id": 101
                       },
                       {
                           "name": "partition",
                           "type": {
                               "type": "record",
                               "name": "r102",
                               "fields": [
                                   {"field-id": 1000, "name": "VendorID", 
"type": ["null", "int"]},
                                   {
                                       "field-id": 1001,                        
                
                                       "name": "tpep_pickup_datetime",
                                       "type": ["null", {"type": "int", 
"logicalType": "date"}]
                                   }
                               ]
                           },
                           "field-id": 102
                       },
                       {"name": "record_count", "type": "long", "doc": "Number 
of records in the file", "field-id": 103},
                       {"name": "file_size_in_bytes", "type": "long", "doc": 
"Total file size in bytes", "field-id": 104},                        
                       {
                           "name": "column_sizes",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k117_v118",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 117},
                                           {"name": "value", "type": "long", 
"field-id": 118}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to total size on disk",     
                       
                           "field-id": 108
                       },
                       {
                           "name": "value_counts",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k119_v120",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 119},
                                           {"name": "value", "type": "long", 
"field-id": 120}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to total count, including 
null and NaN",                            
                           "field-id": 109
                       },
                       {
                           "name": "null_value_counts",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k121_v122",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 121},
                                           {"name": "value", "type": "long", 
"field-id": 122}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to null value count",       
                     
                           "field-id": 110
                       },
                       {
                           "name": "nan_value_counts",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k138_v139",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 138},
                                           {"name": "value", "type": "long", 
"field-id": 139}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to number of NaN values in 
the column",                            
                           "field-id": 137
                       },
                       {
                           "name": "lower_bounds",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k126_v127",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 126},
                                           {"name": "value", "type": "bytes", 
"field-id": 127}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to lower bound",            
                
                           "field-id": 125
                       },
                       {
                           "name": "upper_bounds",
                           "type": [
                               "null",
                               {
                                   "type": "array",
                                   "items": {
                                       "type": "record",
                                       "name": "k129_v130",
                                       "fields": [
                                           {"name": "key", "type": "int", 
"field-id": 129},
                                           {"name": "value", "type": "bytes", 
"field-id": 130}
                                       ]
                                   },
                                   "logicalType": "map"
                               }
                           ],
                           "doc": "Map of column id to upper bound",            
                
                           "field-id": 128
                       },
                       {
                           "name": "key_metadata",
                           "type": ["null", "bytes"],
                           "doc": "Encryption key metadata blob",               
             
                           "field-id": 131
                       },
                       {
                           "name": "split_offsets",
                           "type": ["null", {"type": "array", "items": "long", 
"element-id": 133}],
                           "doc": "Splittable offsets",                         
   
                           "field-id": 132
                       },
                       {
                           "name": "equality_ids",
                           "type": ["null", {"type": "array", "items": "int", 
"element-id": 136}],
                           "doc": "Field ids used to determine row equality for 
delete files",
                           "field-id": 135
                       },
                       {
                           "name": "sort_order_id",
                           "type": ["null", "int"],
                           "doc": "Sort order ID",                            
                           "field-id": 140
                       }
                   ]
               },
               "field-id": 2
           }
       ]
   }
   ```
   
   The part that is causing issue is this:
   
   ```json
   {
       "name": "partition",
       "type": {
           "type": "record",
           "name": "r102",
           "fields": [
               {"field-id": 1000, "name": "VendorID", "type": ["null", "int"]},
               {
                   "field-id": 1001,                                        
                   "name": "tpep_pickup_datetime",
                   "type": ["null", {"type": "int", "logicalType": "date"}]
               }
           ]
       },
       "field-id": 102
   }
   ```
   
   This is clearly not the right schema type for partition as of the spec. It 
looks more like an example from the doc or something.
   
   Here's what the spec says about the partition field:
   
   required | required | required | 102 partition | struct<...> | Partition 
data tuple, schema based on the partition spec output using partition field ids 
for the struct field ids
   -- | -- | -- | -- | -- | --
   
   This is not very clear to me, but it sounds like the type is a dynamically 
generate Avro record type. Not sure how it can be implemented with the current 
status Avro schema approach.
   
   Logically,  my experiment fails with the following error message: `"Data: 
PartitionData: avro: missing required field VendorID"`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Wrong Avro schema for ManifestEntry [iceberg-go]

Reply via email to