Re: [PR] fix: Support data pruning using nested partition columns [hudi]

via GitHub Sat, 04 Apr 2026 07:37:49 -0700


yihua commented on code in PR #18126:
URL: https://github.com/apache/hudi/pull/18126#discussion_r3035648942



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########


Review Comment:
   **Line 258:** 🤖 The struct-parent prefix matching 
(`partCol.startsWith(logicalRef + ".")`) is still too broad — this was flagged 
in the previous review cycle and remains unfixed.
   
   If a struct `nested_record` has both a partition field 
(`nested_record.level`) and a non-partition field (`nested_record.nested_int`), 
a filter like `nested_record.nested_int = 10` would be misclassified as a 
partition-pruning predicate because its only reference (`nested_record`) 
prefix-matches the partition column `nested_record.level`. The subsequent 
`GetStructField` transform (line 306) would fail to bind 
`nested_record.nested_int` (not in `partitionFieldNames`), leaving an 
unresolved `GetStructField` node in `transformedPredicate`. When 
`InterpretedPredicate.eval()` encounters it against an `InternalRow` of 
partition values, it will throw at runtime — the catch block (line 276) only 
handles `NoSuchMethodException | IllegalArgumentException`, not evaluation 
errors.
   
   Suggested fix: instead of matching on the struct parent, resolve the full 
dot-path of each `GetStructField` chain in the filter and check whether that 
full path exists in `partitionColumnNames`. A helper like 
`getPartitionColumnPath` (already defined at line 298) can be reused for this 
purpose:
   ```scala
   def referencesPartitionColumn(expr: Expression): Boolean = expr match {
     case g: GetStructField => 
getPartitionColumnPath(g).exists(partitionColumnNames.contains)
     case a: AttributeReference => partitionColumnNames.exists(partCol => 
resolve(logicalRefName(a.name), partCol))
     case _ => expr.children.forall(referencesPartitionColumn)
   }
   ```



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########


Review Comment:
   **Line 196:** 🤖 The cache is consumed once and then immediately cleared 
(lines 200-202). If Spark calls `listFiles` more than once during a single 
query execution — which can happen with Adaptive Query Execution (AQE) 
re-planning, or with subqueries — the second call would see 
`partitionFilters.isEmpty && cachedPrunedSlices.isEmpty`, fall through to the 
`else` branch (line 186), and call `filterFileSlices` with the empty 
`partitionFilters` it received. Since `partitionFilters` is empty (that's the 
nested-partition-column symptom this cache was designed to solve), 
`filterFileSlices` would return all partitions, causing a full table scan on 
the second execution.
   
   Consider either:
   1. Not clearing the cache after a single read — instead, clear it only when 
the file-index is refreshed or a new `filterFileSlices` call with 
`isPartitionPruned=true` repopulates it, or
   2. Keeping a separate `cacheConsumedForCurrentQuery` flag that prevents 
re-clearing within the same logical query execution.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########


Review Comment:
   **Line 219:** 🤖 The `IsNotNull`/`IsNull` rewriting uses the same 
overly-broad struct-parent prefix matching. `IsNotNull(nested_record)` is 
rewritten to `Literal(true)` whenever `nested_record` is a prefix of ANY 
partition field name. This is semantically correct for partition pruning 
(partition values parsed from paths are never null), but the problem is 
upstream: the filter containing `IsNotNull(nested_record)` AND a non-partition 
predicate like `nested_record.nested_int = 10` would have been misclassified as 
a partition filter (see comment on line 258). The `IsNotNull` would be 
rewritten to `true`, but the `GetStructField` for `nested_record.nested_int` 
would remain unresolved, causing an evaluation failure.
   
   This is a consequence of the same root cause (prefix matching at line 
188/258), so fixing the filter classification would also fix this.



##########
hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/hudi/analysis/Spark3HoodiePruneFileSourcePartitions.scala:
##########


Review Comment:
   **Line 117:** 🤖 The prefix check 
`partitionSchema.names.exists(_.startsWith(logicalName + "."))` is dead code 
when receiving the nested `partitionSchemaForSpark`. The nested schema's 
top-level names are struct names like `"nested_record"`, never dot-paths like 
`"nested_record.level"`. The `contains` check on the line above already matches 
`"nested_record"` directly. Consider removing the prefix branch or adding a 
comment explaining it's a defensive fallback.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Support data pruning using nested partition columns [hudi]

Reply via email to