yihua commented on code in PR #18384:
URL: https://github.com/apache/hudi/pull/18384#discussion_r3035811428


##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -324,6 +324,16 @@ public static final String getDefaultPayloadClassName() {
       .withDocumentation("When enabled, populates all meta fields. When 
disabled, no meta fields are populated "
           + "and incremental queries will not be functional. This is only 
meant to be used for append only/immutable data for batch processing");
 
+  public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE = 
ConfigProperty
+      .key("hoodie.meta.fields.to.exclude")
+      .noDefaultValue()
+      .markAdvanced()
+      .withDocumentation("Comma-separated list of Hudi meta field names to 
exclude from population. "
+          + "Excluded fields remain in the schema but are written as empty 
strings. "

Review Comment:
   🤖 The documentation here says excluded fields are written as "empty 
strings", but the PR description and tests indicate they are written as `null`. 
Could you align the documentation with the implementation? Using `null` is the 
correct approach for storage savings.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetWriter.java:
##########
@@ -95,11 +108,28 @@ protected void updateRecordMetadata(InternalRow row,
                                       UTF8String recordKey,

Review Comment:
   🤖 nit: to avoid duplicating the update logic in the `if/else` blocks, you 
could normalize the `populateIndividualMetaFields` array at the beginning of 
the method. For example, if it's `null`, create an all-true array, and then 
just have a single code path that uses the flags.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -324,6 +324,16 @@ public static final String getDefaultPayloadClassName() {
       .withDocumentation("When enabled, populates all meta fields. When 
disabled, no meta fields are populated "
           + "and incremental queries will not be functional. This is only 
meant to be used for append only/immutable data for batch processing");
 
+  public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE = 
ConfigProperty

Review Comment:
   🤖 This is a critical point. Without a table version upgrade, older writers 
could read the table config, ignore the new exclusion property, and write data 
with all meta-fields populated. A version bump seems necessary to enforce the 
new contract for all writers.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1738,6 +1741,43 @@ public boolean populateMetaFields() {
     return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
   }
 
+  private Set<String> getMetaFieldsToExclude() {
+    String value = getString(HoodieTableConfig.META_FIELDS_TO_EXCLUDE);
+    if (value == null || value.trim().isEmpty()) {
+      return Collections.emptySet();
+    }
+    Set<String> excluded = new HashSet<>();
+    for (String field : value.split(",")) {
+      String trimmed = field.trim();
+      if (!trimmed.isEmpty()) {
+        excluded.add(trimmed);
+      }
+    }
+    return excluded;
+  }
+
+  /**

Review Comment:
   🤖 I wonder if this could be made more robust against future changes to meta 
fields. The current implementation hard-codes the field names and their ordinal 
positions. To avoid divergence, maybe you could iterate over the 
`HoodieRecord.HOODIE_META_COLUMNS` list to build the flags array?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1738,6 +1741,43 @@ public boolean populateMetaFields() {
     return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
   }
 
+  public Set<String> getMetaFieldsToExclude() {
+    String value = getString(HoodieTableConfig.META_FIELDS_TO_EXCLUDE);
+    if (value == null || value.trim().isEmpty()) {
+      return Collections.emptySet();
+    }
+    Set<String> excluded = new HashSet<>();
+    for (String field : value.split(",")) {
+      String trimmed = field.trim();
+      if (!trimmed.isEmpty()) {
+        excluded.add(trimmed);
+      }
+    }
+    return excluded;
+  }
+
+  /**
+   * Returns a boolean[5] indexed by meta field ordinal.
+   * true = populate this meta field, false = write empty string.
+   */
+  public boolean[] getMetaFieldPopulationFlags() {
+    boolean[] flags = new boolean[5];
+    if (!populateMetaFields()) {
+      return flags; // all false
+    }
+    Set<String> excluded = getMetaFieldsToExclude();
+    if (excluded.isEmpty()) {
+      Arrays.fill(flags, true);
+      return flags;
+    }
+    flags[0] = !excluded.contains(HoodieRecord.COMMIT_TIME_METADATA_FIELD);
+    flags[1] = !excluded.contains(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD);
+    flags[2] = !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+    flags[3] = !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
+    flags[4] = !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD);
+    return flags;

Review Comment:
   🤖 That's a good clarifying question. The `OPERATION_METADATA_FIELD` is 
generally treated separately from the other five meta-fields since it serves a 
core functional purpose for MOR tables. Excluding it could break 
write-ahead-log processing, so I think it's correct to keep it out of scope for 
this feature.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -176,15 +178,16 @@ private void writeRow(InternalRow row) {
       //          over again)
       UTF8String[] metaFields = new UTF8String[5];
       UTF8String recordKey = 
row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
-      metaFields[3] = 
row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD);
-      // This is the only meta-field that is generated dynamically, hence 
conversion b/w
-      // [[String]] and [[UTF8String]] is unavoidable if 
preserveHoodieMetadata is false
-      metaFields[1] = shouldPreserveHoodieMetadata ? 
row.getUTF8String(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD_ORD)
-          : 
UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement()));
-      metaFields[0] = shouldPreserveHoodieMetadata ? 
row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD)
-          : commitTime;
-      metaFields[2] = recordKey;
-      metaFields[4] = fileName;
+      metaFields[0] = populateField[0]
+          ? (shouldPreserveHoodieMetadata ? 
row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD) : commitTime)
+          : null;
+      metaFields[1] = populateField[1]
+          ? (shouldPreserveHoodieMetadata ? 
row.getUTF8String(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD_ORD)
+              : 
UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement())))
+          : null;
+      metaFields[2] = populateField[2] ? recordKey : null;
+      metaFields[3] = populateField[3] ? 
row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD) : null;
+      metaFields[4] = populateField[4] ? fileName : null;

Review Comment:
   🤖 This is a great architectural question. I think keeping the fields in the 
table schema (but writing nulls) is the right call. It avoids a breaking schema 
evolution and gives users the flexibility to re-include a field later without a 
complex backfill or migration. It's a safer, more reversible approach.



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -110,11 +110,13 @@ object HoodieDatasetBulkInsertHelper
                 sparkKeyGenerator
               }
 
+          val populateIndividualMetaFields = config.getMetaFieldPopulationFlags

Review Comment:
   🤖 It looks like the logic here only handles exclusion for record key and 
partition path. What happens if a user wants to exclude `_hoodie_commit_time`? 
It seems this field will get an empty string instead of null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to