zhangbutao commented on code in PR #10449: URL: https://github.com/apache/iceberg/pull/10449#discussion_r1635743782
########## mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java: ########## @@ -125,11 +125,9 @@ public List<InputSplit> getSplits(JobContext context) { } String schemaStr = conf.get(InputFormatConfig.READ_SCHEMA); if (schemaStr != null) { - scan.project(SchemaParser.fromJson(schemaStr)); - } - String[] selectedColumns = conf.getStrings(InputFormatConfig.SELECTED_COLUMNS); - if (selectedColumns != null) { - scan.select(selectedColumns); + scan = scan.project(SchemaParser.fromJson(schemaStr)); + } else if (conf.getStrings(InputFormatConfig.SELECTED_COLUMNS) != null) { Review Comment: @pvary Your comment made me think further. :) > We allow the user to set columns and schema, but behind the scenes we decide to use only one of them? I don't think we should allow user to set the` projected columns`, as this is a internal query optimization in engine like hive/spark. This conf related to `projected columns `should not be exposed to the end user. So, before this comment, what i have said was based on that user can not set the conf about ` projected columns`. In hive, the conf related to iceberg ` projected columns` are `iceberg.mr.read.schema` & `iceberg.mr.selected.columns`, and `iceberg.mr.selected.columns` will be always overwritten by `hive.io.file.readcolumn.names`, Check https://github.com/apache/iceberg/blob/282fa73b2c2b51b1d519a91e8bb05b2d2bdf3cf4/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L99-L100 . And `hive.io.file.readcolumn.names` starts with prefix **hive** but not a valid propertity in `HiveConf.java`, so user can not set the value of `hive.io.file.readcolumn.names`. https://github.com/apache/hive/blob/33cadc5b498b57f779cddc9bf4e3f8aef2d9f6dc/ql/src/java/org/apache/hadoop/hive/ql/processors/SetProcessor.java#L254-L255 ``` 0: jdbc:hive2://127.0.0.1:10000/default> set hive.io.file.readcolumn.names=id; Error: Error while processing statement: hive configuration hive.io.file.readcolumn.names does not exists. (state=42000,code=1) ``` That is to say, user can not change the vlaue of `hive.io.file.readcolumn.names` & `iceberg.mr.selected.columns`, and i think this behavior is reasonable. For `iceberg.mr.read.schema` which is used by project shcemas, hive does not actually set the value internally & always null. So, if user don't set & change its value, hive will never use it. But now, your comments made me think that user may can change the value of `iceberg.mr.read.schema`. if user set the value of `iceberg.mr.read.schema`, it can be changed by normal and will lead to the side effect when initializing iceberg reader(`iceberg.mr.read.schema` is chosen as a priority over `iceberg.mr.selected.columns`), check: https://github.com/apache/iceberg/blob/282fa73b2c2b51b1d519a91e8bb05b2d2bdf3cf4/mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L490 **The last, what i want to say is that we should not allow to set the conf related project columns.** I think we can remove the use of `iceberg.mr.read.schema` as it actually not be used by hive internal & it can lead to side effect. If you have any other good suggesstion, please comment here. Many thanks. Also cc @deniskuzZ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org