rdblue commented on code in PR #12634:
URL: https://github.com/apache/iceberg/pull/12634#discussion_r2337932655


##########
parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java:
##########
@@ -90,11 +90,11 @@ public Type struct(StructType expected, GroupType struct, 
List<Type> fields) {
       Type originalField = struct.getType(i);
       Type field = fields.get(i);
       Integer fieldId = getId(originalField);
-      if (fieldId != null && selectedIds.contains(fieldId)) {
-        filteredFields.add(originalField);
-      } else if (field != null) {
-        filteredFields.add(originalField);
+      if (field != null) {
+        filteredFields.add(field);
         hasChange = true;
+      } else if (fieldId != null && selectedIds.contains(fieldId)) {

Review Comment:
   I think you're right that the current way that this class is called by 
`ParquetSchemaUtil.pruneColumns` will always pass the struct IDs in. Here's 
that method:
   
   ```java
     public static MessageType pruneColumns(MessageType fileSchema, Schema 
expectedSchema) {
       // column order must match the incoming type, so it doesn't matter that 
the ids are unordered
       Set<Integer> selectedIds = TypeUtil.getProjectedIds(expectedSchema);
       return (MessageType)
           TypeWithSchemaVisitor.visit(
               expectedSchema.asStruct(), fileSchema, new 
PruneColumns(selectedIds));
     }
   ```
   
   The call to `TypeUtil.getProjectedIds` calls 
`getIdsInternal(schema.asStruct(), true /*includeStructIds*/)`. If we keep the 
previous order of the `if` blocks as I suggested, then this call would always 
produce the original struct, which doesn't fix column pruning. However, I think 
we do want to keep the original order of the blocks and fix the problem by 
setting `includeStructIds` to `false`. That way we don't incorrectly select 
entire structs unless they are explicitly selected.
   
   I think that the behavior of selecting the entire struct when its field ID 
is passed in is correct, for the reason I gave in my comment above. I don't 
think that we should change the behavior of the visitor just because we know 
how it is called. That creates a dependency for correct behavior on how it is 
called, which we don't want to assume. We could add other uses of this visitor 
in the future and it should still behave predictably if we reuse it and don't 
make the same choices in how it is called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to