amogh-jahagirdar commented on code in PR #12634:
URL: https://github.com/apache/iceberg/pull/12634#discussion_r2017066766
##########
parquet/src/test/java/org/apache/iceberg/parquet/TestPruneColumns.java:
##########
@@ -248,7 +269,16 @@ public void testStructElementName() {
StructType.of(
NestedField.required(4, "y", DoubleType.get()),
NestedField.required(5, "z", DoubleType.get()))),
- NestedField.optional(6, "struct_name_2", StructType.of()));
+ NestedField.optional(6, "struct_name_2", StructType.of()),
+ NestedField.optional(
+ 10,
+ "struct_name_3",
+ StructType.of(
+ NestedField.required(11, "x", DoubleType.get()),
+ NestedField.optional(
+ 12,
+ "nested_struct_1",
+ StructType.of(NestedField.required(13, "y",
BinaryType.get()))))));
Review Comment:
Since this structure is a bit deeply nested, could you add/update the above
comment to indicate which fields we expect to be projected?
##########
parquet/src/main/java/org/apache/iceberg/parquet/PruneColumns.java:
##########
@@ -90,11 +90,11 @@ public Type struct(StructType expected, GroupType struct,
List<Type> fields) {
Type originalField = struct.getType(i);
Type field = fields.get(i);
Integer fieldId = getId(originalField);
- if (fieldId != null && selectedIds.contains(fieldId)) {
- filteredFields.add(originalField);
Review Comment:
My thoughts on why this is correct:
This check being performed first prevented column pruning on deeply nested
fields because `selectedIds` contains the IDs of structs along with any further
nested fields being projected. This check would pass first and we would just
add the whole `originalField` which includes fields that may not be needed to
read to satisfy a read.
In the change, by checking `field` is not null first we're checking the
pruned fields of the struct since this visitor works in a preorder fashion
where a structure is visited and then those contents are passed here in that
List<Type>. If a pruned field exists then we add that field (and not the
original field since that would not be pruned).
OK I think I'm convinced that this change is correct but I'd like @rdblue
take on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]