Re: [PR] Hive: Return new scan after applying column project parameter [iceberg]

via GitHub Tue, 11 Jun 2024 19:50:10 -0700


zhangbutao commented on code in PR #10449:
URL: https://github.com/apache/iceberg/pull/10449#discussion_r1635743782



##########
mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java:
##########
@@ -125,11 +125,9 @@ public List<InputSplit> getSplits(JobContext context) {
     }
     String schemaStr = conf.get(InputFormatConfig.READ_SCHEMA);
     if (schemaStr != null) {
-      scan.project(SchemaParser.fromJson(schemaStr));
-    }
-    String[] selectedColumns = 
conf.getStrings(InputFormatConfig.SELECTED_COLUMNS);
-    if (selectedColumns != null) {
-      scan.select(selectedColumns);
+      scan = scan.project(SchemaParser.fromJson(schemaStr));
+    } else if (conf.getStrings(InputFormatConfig.SELECTED_COLUMNS) != null) {

Review Comment:
   @pvary Your comment made me think further. :)
   
   > We allow the user to set columns and schema, but behind the scenes we 
decide to use only one of them?
   
   I don't think we should allow user to set the` projected columns`, as this 
is a internal query optimization in engine like hive/spark. This conf related 
to `projected columns `should not be exposed to the end user.
   
   So, before this comment, what i have said was based on that user can not set 
the conf about ` projected columns`. 
   
   In hive, the conf related to iceberg ` projected columns` are 
`iceberg.mr.read.schema` & `iceberg.mr.selected.columns`, and 
`iceberg.mr.selected.columns` will be always overwritten by 
`hive.io.file.readcolumn.names`, Check 
https://github.com/apache/iceberg/blob/282fa73b2c2b51b1d519a91e8bb05b2d2bdf3cf4/mr/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java#L99-L100
 . And `hive.io.file.readcolumn.names` starts with prefix **hive** but not a 
valid propertity in `HiveConf.java`, so user can not set the value of 
`hive.io.file.readcolumn.names`.
   
https://github.com/apache/hive/blob/33cadc5b498b57f779cddc9bf4e3f8aef2d9f6dc/ql/src/java/org/apache/hadoop/hive/ql/processors/SetProcessor.java#L254-L255
   ```
   0: jdbc:hive2://127.0.0.1:10000/default> set 
hive.io.file.readcolumn.names=id;
   Error: Error while processing statement: hive configuration 
hive.io.file.readcolumn.names does not exists. (state=42000,code=1)
   ```
   
   That is to say, user can not change the vlaue of 
`hive.io.file.readcolumn.names` & `iceberg.mr.selected.columns`, and i think 
this behavior is reasonable.
   
   For `iceberg.mr.read.schema` which is used by project shcemas, hive does not 
actually set the value internally & always null. So, if user don't set & change 
its value, hive will never use it.
   
   But now, your comments made me think that user may can change the value of 
`iceberg.mr.read.schema`. if user set the value of `iceberg.mr.read.schema`, it 
can be changed by normal and will lead to the side effect when initializing 
iceberg reader(`iceberg.mr.read.schema`  is chosen as a priority over 
`iceberg.mr.selected.columns`), check: 
https://github.com/apache/iceberg/blob/282fa73b2c2b51b1d519a91e8bb05b2d2bdf3cf4/mr/src/main/java/org/apache/iceberg/mr/mapreduce/IcebergInputFormat.java#L490
   
   **The last, what i want to say is that we should not allow to set the conf 
related project columns.** I think we can remove the use of 
`iceberg.mr.read.schema` as it actually not be used by hive internal & it can 
lead to side effect.
   
   If you have any other good suggesstion, please comment here. Many thanks.
   
   Also cc @deniskuzZ 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Hive: Return new scan after applying column project parameter [iceberg]

Reply via email to