zhongyujiang commented on PR #6118:
URL: https://github.com/apache/iceberg/pull/6118#issuecomment-1305081026

   @rdblue sure.
   When collecting metrics from Parquet footer, Iceberg 
[converts](https://github.com/apache/iceberg/blob/167a8ccd7c578296c40f8fc61c90135e71cf1183/parquet/src/main/java/org/apache/iceberg/parquet/ParquetUtil.java#L107)
 the file MessageType to an Iceberg Schema and 
[uses](https://github.com/apache/iceberg/blob/167a8ccd7c578296c40f8fc61c90135e71cf1183/core/src/main/java/org/apache/iceberg/MetricsUtil.java#L56)
 this schema to get the column name of an field id it mapping, and then uses 
the obtained field name to get its corresponding metric mode. 
   
   However Iceberg will escape special characters in field names when 
converting an Iceberg Schema to an Parquet MessageType, and those escaped names 
cannot be restored when converting an Parquet MessageType back to an Iceberg 
Schema, that is to say, we are now using those escaped column names to get 
their corresponding metric modes, which may resulted in incorrect results since 
those escaped names cannot be recognized by MetricsConfig. 
   
   The ORC path does not have this problem because special characters are not 
escaped when converting to ORC schema and ORC 
[itself](https://lists.apache.org/thread/93xbnbs0mr0zxx4fzvrz10m5mmd4qb5w) can 
handle any UTF-8 characters in the column names.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to