a-agmon commented on issue #338: URL: https://github.com/apache/iceberg-rust/issues/338#issuecomment-2095268461
Thanks, @Fokko and @zeodtr, for the clarifications and explanations! I think that it's important to fix this in the next release, as the current situation is that the Rust API currently does not support Iceberg tables generated using the > 1.5.0 Iceberg Java library, which seems to be quite a limitation. We agree that we must use `field-id` for field resolution, but also that this might require a lot of work because the Avro Rust library at the moment uses `field-name` for serde. I have tried to implement a resolution that incorporates both requirements: it uses `field-id` when reading the Avro manifest files, but also does not make significant changes in the code flow and the usage of the Avro lib. It's not the most elegant solution, but seems to resolve the issue along these lines, at least as I tested. The solution consists of 3 stages: 1. reading the Manifest schema to a map of field-id -> field-name 2. reading the actual avro file and using its schema, and map it to map of field-id -> field-value 3. iterate the schema, and compose the avro records by using field-id to get the value read from the file. Here is just the main flow in the `parse_with_version()` function ```rust FormatVersion::V2 => { // 1. get a hashmap that maps the field_id to the field_name in the Manifest's schema let manifest_file_schema = Self::get_record_schema(MANIFEST_LIST_AVRO_SCHEMA_V2.clone())?; let manifest_file_schema_fields = Self::get_manifest_schema_fields_map(manifest_file_schema, true)?; // 2. get a hashmap that maps field_name to field_id in the schema of the read avro file let reader = Reader::new(bs)?; let file_schema = Self::get_record_schema(reader.writer_schema().clone())?; let file_schema_fields: HashMap<String, String> = Self::get_manifest_schema_fields_map(file_schema, false)?; // 3. get a vec of records from the read avro file . // each record is a hashmap of field_id and field_value let file_records = reader.collect::<std::result::Result<Vec<Value>, _>>()?; let file_records_values_map = Self::get_avro_records_as_map(file_records, file_schema_fields)?; // 4. for each record (manifest file) in the Avro file records maps, // traverse the schema of the manifest file: for each field id in the schema, get the field value from the record let manifest_records: Vec<Value> = file_records_values_map .into_iter() .map(|file_record_fields| { let fields_values: Vec<_> = manifest_file_schema_fields .iter() .filter_map(|(schem_field_id, schem_field_name)| { file_record_fields .get(schem_field_id) .map(|value| (schem_field_name.clone(), value.clone())) }) .collect(); Value::Record(fields_values) }) .collect(); let values = Value::Array(manifest_records); let manifest = from_value::<_serde::ManifestListV2>(&values)?; manifest.try_into(partition_type_provider) } ``` Please let me know what you think, (Its just a suggestion, ofc, its also possible to re-write this without the Avro lib) I'm also posting this in Slack for visibility as I think its sufficiently important. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org