yihua commented on code in PR #18384:
URL: https://github.com/apache/hudi/pull/18384#discussion_r3035811428
##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -324,6 +324,16 @@ public static final String getDefaultPayloadClassName() {
.withDocumentation("When enabled, populates all meta fields. When
disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only
meant to be used for append only/immutable data for batch processing");
+ public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE =
ConfigProperty
+ .key("hoodie.meta.fields.to.exclude")
+ .noDefaultValue()
+ .markAdvanced()
+ .withDocumentation("Comma-separated list of Hudi meta field names to
exclude from population. "
+ + "Excluded fields remain in the schema but are written as empty
strings. "
Review Comment:
🤖 The documentation here says excluded fields are written as "empty
strings", but the PR description and tests indicate they are written as `null`.
Could you align the documentation with the implementation? Using `null` is the
correct approach for storage savings.
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkParquetWriter.java:
##########
@@ -95,11 +108,28 @@ protected void updateRecordMetadata(InternalRow row,
UTF8String recordKey,
Review Comment:
🤖 nit: to avoid duplicating the update logic in the `if/else` blocks, you
could normalize the `populateIndividualMetaFields` array at the beginning of
the method. For example, if it's `null`, create an all-true array, and then
just have a single code path that uses the flags.
##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -324,6 +324,16 @@ public static final String getDefaultPayloadClassName() {
.withDocumentation("When enabled, populates all meta fields. When
disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only
meant to be used for append only/immutable data for batch processing");
+ public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE =
ConfigProperty
Review Comment:
🤖 This is a critical point. Without a table version upgrade, older writers
could read the table config, ignore the new exclusion property, and write data
with all meta-fields populated. A version bump seems necessary to enforce the
new contract for all writers.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1738,6 +1741,43 @@ public boolean populateMetaFields() {
return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
}
+ private Set<String> getMetaFieldsToExclude() {
+ String value = getString(HoodieTableConfig.META_FIELDS_TO_EXCLUDE);
+ if (value == null || value.trim().isEmpty()) {
+ return Collections.emptySet();
+ }
+ Set<String> excluded = new HashSet<>();
+ for (String field : value.split(",")) {
+ String trimmed = field.trim();
+ if (!trimmed.isEmpty()) {
+ excluded.add(trimmed);
+ }
+ }
+ return excluded;
+ }
+
+ /**
Review Comment:
🤖 I wonder if this could be made more robust against future changes to meta
fields. The current implementation hard-codes the field names and their ordinal
positions. To avoid divergence, maybe you could iterate over the
`HoodieRecord.HOODIE_META_COLUMNS` list to build the flags array?
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -1738,6 +1741,43 @@ public boolean populateMetaFields() {
return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
}
+ public Set<String> getMetaFieldsToExclude() {
+ String value = getString(HoodieTableConfig.META_FIELDS_TO_EXCLUDE);
+ if (value == null || value.trim().isEmpty()) {
+ return Collections.emptySet();
+ }
+ Set<String> excluded = new HashSet<>();
+ for (String field : value.split(",")) {
+ String trimmed = field.trim();
+ if (!trimmed.isEmpty()) {
+ excluded.add(trimmed);
+ }
+ }
+ return excluded;
+ }
+
+ /**
+ * Returns a boolean[5] indexed by meta field ordinal.
+ * true = populate this meta field, false = write empty string.
+ */
+ public boolean[] getMetaFieldPopulationFlags() {
+ boolean[] flags = new boolean[5];
+ if (!populateMetaFields()) {
+ return flags; // all false
+ }
+ Set<String> excluded = getMetaFieldsToExclude();
+ if (excluded.isEmpty()) {
+ Arrays.fill(flags, true);
+ return flags;
+ }
+ flags[0] = !excluded.contains(HoodieRecord.COMMIT_TIME_METADATA_FIELD);
+ flags[1] = !excluded.contains(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD);
+ flags[2] = !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD);
+ flags[3] = !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD);
+ flags[4] = !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD);
+ return flags;
Review Comment:
🤖 That's a good clarifying question. The `OPERATION_METADATA_FIELD` is
generally treated separately from the other five meta-fields since it serves a
core functional purpose for MOR tables. Excluding it could break
write-ahead-log processing, so I think it's correct to keep it out of scope for
this feature.
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowCreateHandle.java:
##########
@@ -176,15 +178,16 @@ private void writeRow(InternalRow row) {
// over again)
UTF8String[] metaFields = new UTF8String[5];
UTF8String recordKey =
row.getUTF8String(HoodieRecord.RECORD_KEY_META_FIELD_ORD);
- metaFields[3] =
row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD);
- // This is the only meta-field that is generated dynamically, hence
conversion b/w
- // [[String]] and [[UTF8String]] is unavoidable if
preserveHoodieMetadata is false
- metaFields[1] = shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD_ORD)
- :
UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement()));
- metaFields[0] = shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD)
- : commitTime;
- metaFields[2] = recordKey;
- metaFields[4] = fileName;
+ metaFields[0] = populateField[0]
+ ? (shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_TIME_METADATA_FIELD_ORD) : commitTime)
+ : null;
+ metaFields[1] = populateField[1]
+ ? (shouldPreserveHoodieMetadata ?
row.getUTF8String(HoodieRecord.COMMIT_SEQNO_METADATA_FIELD_ORD)
+ :
UTF8String.fromString(seqIdGenerator.apply(GLOBAL_SEQ_NO.getAndIncrement())))
+ : null;
+ metaFields[2] = populateField[2] ? recordKey : null;
+ metaFields[3] = populateField[3] ?
row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD) : null;
+ metaFields[4] = populateField[4] ? fileName : null;
Review Comment:
🤖 This is a great architectural question. I think keeping the fields in the
table schema (but writing nulls) is the right call. It avoids a breaking schema
evolution and gives users the flexibility to re-include a field later without a
complex backfill or migration. It's a safer, more reversible approach.
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -110,11 +110,13 @@ object HoodieDatasetBulkInsertHelper
sparkKeyGenerator
}
+ val populateIndividualMetaFields = config.getMetaFieldPopulationFlags
Review Comment:
🤖 It looks like the logic here only handles exclusion for record key and
partition path. What happens if a user wants to exclude `_hoodie_commit_time`?
It seems this field will get an empty string instead of null.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]