royantman commented on issue #13218:
URL: https://github.com/apache/iceberg/issues/13218#issuecomment-3124695543

   I assume your external parquet were written using a "normal" parquet writer 
and thus the schema in their footer is missing field ids.
   
   Current java code will add field ids to the MessageType with 
`getParquetTypeWithIds()` according to the Iceberg table 
`schema.name-mappings.default` and use it to convert(renames) and 
prune(deletes) the schema. However, the MessageType object that will be sent to 
the `ParquetMetrics.metrics()` method is the schema without ids.
   
   Inside `ParquetMetrics.metrics()` these fields that have no field-ids will 
be skipped due to:
   ```java
   if (null === id) {
      continue;
   }
   ```
   Forking the lib and calling `metrics()` with `parquetTypeWithIds` instead of 
`messageType` will result in you having expected statistics.
   
   It you are willing to use the `apache/iceberg-go` libs to perform 
`AddFiles()`, it will work well and even populate 
`schema.name-mappings.default` for you (and save it to the table props if not 
already exists).
   
   1) It seems strange to me both language libs are not aligned.
   2) I really don't understand why Java libs' code isn't passing schema with 
ids if it is already translating it and without it no statistics will be 
collected (besides row counts)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to