rdblue commented on code in PR #12634:
URL: https://github.com/apache/iceberg/pull/12634#discussion_r2337932655
##########
parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java:
##########
@@ -90,11 +90,11 @@ public Type struct(StructType expected, GroupType struct,
List<Type> fields) {
Type originalField = struct.getType(i);
Type field = fields.get(i);
Integer fieldId = getId(originalField);
- if (fieldId != null && selectedIds.contains(fieldId)) {
- filteredFields.add(originalField);
- } else if (field != null) {
- filteredFields.add(originalField);
+ if (field != null) {
+ filteredFields.add(field);
hasChange = true;
+ } else if (fieldId != null && selectedIds.contains(fieldId)) {
Review Comment:
I think you're right that the current way that this class is called by
`ParquetSchemaUtil.pruneColumns` will always pass the struct IDs in. Here's
that method:
```java
public static MessageType pruneColumns(MessageType fileSchema, Schema
expectedSchema) {
// column order must match the incoming type, so it doesn't matter that
the ids are unordered
Set<Integer> selectedIds = TypeUtil.getProjectedIds(expectedSchema);
return (MessageType)
TypeWithSchemaVisitor.visit(
expectedSchema.asStruct(), fileSchema, new
PruneColumns(selectedIds));
}
```
The call to `TypeUtil.getProjectedIds` calls
`getIdsInternal(schema.asStruct(), true /*includeStructIds*/)`. If we keep the
previous order of the `if` blocks as I suggested, then this call would always
produce the original struct, which doesn't fix column pruning. However, I think
we do want to keep the original order of the blocks and fix the problem by
setting `includeStructIds` to `false`. That way we don't incorrectly select
entire structs unless they are explicitly selected.
I think that the behavior of selecting the entire struct when its field ID
is passed in is correct, for the reason I gave in my comment above. I don't
think that we should change the behavior of the visitor just because we know
how it is called. That creates a dependency for correct behavior on how it is
called, which we don't want to assume. We could add other uses of this visitor
in the future and it should still behave predictably if we reuse it and don't
make the same choices in how it is called.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]