yihua commented on code in PR #18379:
URL: https://github.com/apache/hudi/pull/18379#discussion_r3035867831
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index,
Runnable writer) {
}
private void writeFields(InternalRow row, StructType schema, ValueWriter[]
fieldWriters) {
- for (int i = 0; i < row.numFields(); i++) {
+ for (int i = 0; i < schema.fields().length; i++) {
Review Comment:
🤖 Changing the loop bound from `row.numFields()` to `schema.fields().length`
— is this an intentional fix for a known issue, or incidental to the config
injector feature? In normal operation these should match, but if there's ever a
mismatch (e.g., schema evolution edge case), the old code would fail on
`schema.fields()[i]` and the new code would fail on `row.isNullAt(i)`. Since
`fieldWriters` is already sized from `schema.fields()`, iterating over
`schema.fields().length` is arguably more consistent, but it would be good to
know the motivation.
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java:
##########
@@ -217,10 +217,14 @@ private void consumeField(String field, int index,
Runnable writer) {
}
private void writeFields(InternalRow row, StructType schema, ValueWriter[]
fieldWriters) {
- for (int i = 0; i < row.numFields(); i++) {
+ for (int i = 0; i < schema.fields().length; i++) {
int index = i;
if (!row.isNullAt(i)) {
- consumeField(schema.fields()[i].name(), index, () ->
fieldWriters[index].write(row, index));
+ try {
Review Comment:
🤖 nit: this `catch (ClassCastException e) { throw e; }` is a no-op —
catching an exception only to rethrow it unchanged has no effect. Could you
remove the try-catch block?
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java:
##########
@@ -57,16 +62,32 @@ protected HoodieFileWriter newParquetFileWriter(
if (compressionCodecName.isEmpty()) {
compressionCodecName = null;
}
- HoodieRowParquetWriteSupport writeSupport =
getHoodieRowParquetWriteSupport(storage.getConf(), schema,
- config, enableBloomFilter(populateMetaFields, config));
+
Review Comment:
🤖 The config injection block (lines 65-79) is copy-pasted identically across
`HoodieSparkFileWriterFactory`, `HoodieRowDataFileWriterFactory`, and
`HoodieAvroFileWriterFactory`. Have you considered extracting this into a
shared utility method (e.g., on `HoodieParquetConfigInjector` itself or a
helper class) to avoid the duplication?
##########
hudi-common/src/main/java/org/apache/hudi/io/HoodieParquetConfigInjector.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.config.HoodieConfig;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.storage.StorageConfiguration;
+import org.apache.hudi.storage.StoragePath;
+
+/**
+ * A pluggable interface that all parquet-based writers (Spark/Flink) will
invoke before creating write support
+ * or parquet file writer objects.
+ * <p>
+ * This allows users to inject custom configurations into the Parquet writer
pipeline at runtime, enabling
+ * fine-grained control over Parquet file properties such as bloom filters,
compression settings, encoding
+ * options, and other advanced Parquet configurations.
+ * <p>
+ * Implementations of this interface can modify both the storage configuration
(e.g., Hadoop Configuration)
+ * and the Hudi-specific configuration before the Parquet writer is created.
+ * <p>
+ * Example use cases:
+ * <ul>
+ * <li>Enabling column-specific Parquet bloom filters</li>
+ * <li>Setting custom compression codecs per file or partition</li>
+ * <li>Adjusting page sizes or row group sizes based on data
characteristics</li>
+ * <li>Injecting custom metadata into Parquet files</li>
+ * </ul>
+ *
+ * @since 1.2.0
+ */
+public interface HoodieParquetConfigInjector {
+
+ /**
+ * Injects custom configurations into the Parquet writer pipeline.
+ * <p>
+ * This method is invoked before creating the Parquet write support and
writer objects, allowing
+ * implementations to modify both the storage-level and Hudi-level
configurations.
+ *
+ * @param path the file path where the Parquet file will be written
+ * @param storageConf the storage configuration (e.g., Hadoop Configuration)
that will be used by the writer
+ * @param hoodieConfig the Hudi configuration containing write settings and
table properties
+ * @return a pair containing the potentially modified storage configuration
and Hudi configuration.
+ * Both configurations will be used to create the Parquet writer.
+ */
+ Pair<StorageConfiguration, HoodieConfig> withProps(StoragePath path,
StorageConfiguration storageConf, HoodieConfig hoodieConfig);
+}
Review Comment:
🤖 The interface contract doesn't specify whether implementations are allowed
to mutate the input `storageConf` and `hoodieConfig` in place, or whether they
must return new copies. The test implementations (e.g.,
`DisableDictionaryInjector`) mutate the input directly and return it. Since the
factory code does `HoodieConfig hoodieConfig = config` (a reference copy),
mutations by the injector will also affect the caller's original `config`
object. It might be worth documenting the expected behavior — or defensively
copying before calling the injector.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]