yihua commented on code in PR #18396:
URL: https://github.com/apache/hudi/pull/18396#discussion_r3036286741


##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/table/action/commit/BaseFlinkCommitActionExecutor.java:
##########
@@ -167,13 +167,13 @@ protected Iterator<List<WriteStatus>> 
handleUpsertPartition(
         // and append instead of UPDATE.
         return handleInsert(fileIdHint, recordItr);
       } else if (this.writeHandle instanceof HoodieWriteMergeHandle) {
-        return handleUpdate(partitionPath, fileIdHint, recordItr);
+        return handleUpdate(partitionPath, fileIdHint, -1L, recordItr);

Review Comment:
   🤖 nit: `-1L` is a magic sentinel here (meaning "unknown update count"). 
Could you extract a named constant like `UNKNOWN_NUM_UPDATES = -1L` and use it 
in both call sites? Makes the intent a lot clearer when reading the call.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/MergeContext.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.model.HoodieRecord;
+
+import java.util.Iterator;
+
+/**
+ * Context for merge handle creation, carrying incoming data and 
characteristics
+ */
+public class MergeContext<T> {
+
+  /**
+   * The estimated number of incoming update and delete records to merge
+   * A value of -1 means unknown.
+   */
+  private final long numIncomingUpdates;
+
+  /**
+   * Iterator over the incoming records to be merged.
+   */
+  private final Iterator<HoodieRecord<T>> recordItr;
+
+  private MergeContext(long numIncomingUpdates, Iterator<HoodieRecord<T>> 
recordItr) {
+    this.numIncomingUpdates = numIncomingUpdates;
+    this.recordItr = recordItr;
+  }
+
+  public static <T> MergeContext<T> create(long numIncomingUpdates, 
Iterator<HoodieRecord<T>> recordItr) {
+    return new MergeContext<>(numIncomingUpdates, recordItr);
+  }
+
+  public static <T> MergeContext<T> create(Iterator<HoodieRecord<T>> 
recordItr) {
+    return new MergeContext<>(-1L, recordItr);
+  }
+

Review Comment:
   🤖 nit: the `-1L` sentinel for "unknown" is repeated here and in `BucketInfo` 
(and both Javadoc blocks). Could you extract it as a named constant like 
`public static final long UNKNOWN_NUM_UPDATES = -1L;` on `MergeContext` so all 
callers can reference `MergeContext.UNKNOWN_NUM_UPDATES` rather than scattering 
the magic value?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/MergeContext.java:
##########
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.common.model.HoodieRecord;
+
+import java.util.Iterator;
+
+/**
+ * Context for merge handle creation, carrying incoming data and 
characteristics
+ */
+public class MergeContext<T> {
+
+  /**
+   * The estimated number of incoming update and delete records to merge
+   * A value of -1 means unknown.
+   */
+  private final long numIncomingUpdates;
+
+  /**
+   * Iterator over the incoming records to be merged.
+   */
+  private final Iterator<HoodieRecord<T>> recordItr;
+
+  private MergeContext(long numIncomingUpdates, Iterator<HoodieRecord<T>> 
recordItr) {
+    this.numIncomingUpdates = numIncomingUpdates;
+    this.recordItr = recordItr;
+  }
+
+  public static <T> MergeContext<T> create(long numIncomingUpdates, 
Iterator<HoodieRecord<T>> recordItr) {
+    return new MergeContext<>(numIncomingUpdates, recordItr);
+  }
+
+  public static <T> MergeContext<T> create(Iterator<HoodieRecord<T>> 
recordItr) {
+    return new MergeContext<>(-1L, recordItr);
+  }
+
+  public long getNumIncomingUpdates() {
+    return numIncomingUpdates;
+  }
+
+  public Iterator<HoodieRecord<T>> getRecordItr() {
+    return recordItr;
+  }

Review Comment:
   🤖 nit: `getRecordItr()` uses the abbreviated form — could you spell it out 
as `getRecordIterator()` for a bit more clarity at call sites?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAbstractMergeHandle.java:
##########
@@ -64,16 +67,18 @@ public abstract class HoodieAbstractMergeHandle<T, I, K, O> 
extends HoodieWriteH
    * @param config Hoodie writer configs.
    * @param instantTime current instant time.
    * @param hoodieTable an instance of {@link HoodieTable}
+   * @param mergeContext context carrying incoming data to merge and 
characteristics

Review Comment:
   🤖 nit: the `@param mergeContext` Javadoc is missing a trailing period — same 
pattern appears across several constructors in the PR. Could you add one for 
consistency with the surrounding Javadoc style?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteMergeHandle.java:
##########
@@ -137,7 +137,7 @@ public HoodieWriteMergeHandle(HoodieWriteConfig config, 
String instantTime, Hood
                                 Map<String, HoodieRecord<T>> keyToNewRecords, 
String partitionPath, String fileId,
                                 HoodieBaseFile dataFileToBeMerged, 
TaskContextSupplier taskContextSupplier,
                                 Option<BaseKeyGenerator> keyGeneratorOpt) {
-    super(config, instantTime, hoodieTable, partitionPath, fileId, 
taskContextSupplier, dataFileToBeMerged, keyGeneratorOpt,
+    super(config, instantTime, hoodieTable, 
MergeContext.create(Collections.emptyIterator()), partitionPath, fileId, 
taskContextSupplier, dataFileToBeMerged, keyGeneratorOpt,

Review Comment:
   🤖 nit: `MergeContext.create(Collections.emptyIterator())` here feels a bit 
misleading — this constructor receives records via `keyToNewRecords` directly 
and never uses the iterator from the context. A brief comment clarifying why 
the empty context is passed (or an overload that doesn't require one) might 
save the next reader a double-take.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAbstractMergeHandle.java:
##########
@@ -46,6 +46,9 @@
 @Slf4j
 public abstract class HoodieAbstractMergeHandle<T, I, K, O> extends 
HoodieWriteHandle<T, I, K, O> implements HoodieMergeHandle<T, I, K, O> {
 
+  // The number of incoming update records based on tagging; -1 means unknown
+  @Getter
+  protected long numIncomingUpdates = -1L;

Review Comment:
   🤖 nit: could you extract `-1L` into a named constant (e.g. 
`UNKNOWN_NUM_INCOMING_UPDATES`)? The comment explains the sentinel well, but a 
constant makes it self-documenting at every use site too.



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java:
##########
@@ -122,7 +124,7 @@ private void 
assertSchemaEvolutionOnUpdateResult(WriteStatus insertResult, Hoodi
     jsc.parallelize(Arrays.asList(1)).map(x -> {
       Executable executable = () -> {
         HoodieWriteMergeHandle mergeHandle = new 
HoodieWriteMergeHandle(updateTable.getConfig(), "101", updateTable,
-            updateRecords.iterator(), updateRecords.get(0).getPartitionPath(), 
insertResult.getFileId(), supplier, Option.empty());
+            MergeContext.create(updateRecords.size(), (Iterator) 
updateRecords.iterator()), updateRecords.get(0).getPartitionPath(), 
insertResult.getFileId(), supplier, Option.empty());

Review Comment:
   🤖 nit: the raw `(Iterator)` cast here (and in a few spots in 
`TestMergeHandle.java`) is a little awkward — would it be possible to type the 
list as `List<HoodieRecord<T>>` upstream so the cast can be avoided, or at 
least use `@SuppressWarnings("unchecked")` alongside a typed cast like 
`(Iterator<HoodieRecord<T>>)`?



##########
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/JavaUpsertPartitioner.java:
##########
@@ -191,7 +191,7 @@ private void assignInserts(WorkloadProfile profile, 
HoodieEngineContext context)
             } else {
               recordsPerBucket.add(totalUnassignedInserts - (insertBuckets - 
1) * insertRecordsPerBucket);
             }
-            BucketInfo bucketInfo = new BucketInfo(BucketType.INSERT, 
FSUtils.createNewFileIdPfx(), partitionPath);
+            BucketInfo bucketInfo = new BucketInfo(BucketType.INSERT, 
FSUtils.createNewFileIdPfx(), partitionPath, 0L);

Review Comment:
   🤖 nit: the `0L` passed as `numUpdates` for an INSERT bucket reads a bit like 
a magic number — would a named constant (e.g. `NO_UPDATES = 0L`) or even just a 
comment clarify that inserts carry no update count?



##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/action/commit/TestUpsertPartitioner.java:
##########
@@ -392,30 +404,47 @@ private HoodieWriteConfig.Builder 
makeHoodieClientConfigBuilder() {
 
   @Test
   void testMapAndListBasedSparkBucketInfoGetter() {
-    List<BucketInfo> bucketInfos = Arrays.asList(new 
BucketInfo(BucketType.UPDATE, "bucket1", "partition1"),
-        new BucketInfo(BucketType.UPDATE, "bucket2", "partition2"));
+    List<BucketInfo> bucketInfos = Arrays.asList(new 
BucketInfo(BucketType.UPDATE, "bucket1", "partition1", 42),
+        new BucketInfo(BucketType.UPDATE, "bucket2", "partition2", 99));
     Map<Integer, BucketInfo> bucketInfoMap = new HashMap<>();
     bucketInfoMap.put(0, bucketInfos.get(0));
     bucketInfoMap.put(1, bucketInfos.get(1));
     MapBasedSparkBucketInfoGetter getter = new 
MapBasedSparkBucketInfoGetter(bucketInfoMap);
     ListBasedSparkBucketInfoGetter listGetter = new 
ListBasedSparkBucketInfoGetter(bucketInfos);
     assertEquals(bucketInfos.get(0), getter.getBucketInfo(0));
     assertEquals(bucketInfos.get(0), listGetter.getBucketInfo(0));
+    assertEquals(42, getter.getBucketInfo(0).getNumUpdates());
+    assertEquals(42, listGetter.getBucketInfo(0).getNumUpdates());
     assertEquals(bucketInfos.get(1), getter.getBucketInfo(1));
     assertEquals(bucketInfos.get(1), listGetter.getBucketInfo(1));
+    assertEquals(99, getter.getBucketInfo(1).getNumUpdates());
+    assertEquals(99, listGetter.getBucketInfo(1).getNumUpdates());

Review Comment:
   🤖 nit: the `-1L` sentinel is hardcoded here (and again a few lines below in 
`testInsertOverwriteBucketInfoGetter`). If `BucketInfo` exposes a named 
constant for "unknown" (e.g., `UNKNOWN_NUM_UPDATES`), could the test reference 
that instead so the magic value doesn't have to be re-explained in every 
assertion message?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieConcatHandle.java:
##########
@@ -72,10 +72,10 @@ public class HoodieConcatHandle<T, I, K, O> extends 
HoodieWriteMergeHandle<T, I,
   private final Iterator<HoodieRecord<T>> recordItr;
 
   public HoodieConcatHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable<T, I, K, O> hoodieTable,
-                            Iterator<HoodieRecord<T>> recordItr, String 
partitionPath, String fileId,
+                            MergeContext<T> mergeContext, String 
partitionPath, String fileId,
                             TaskContextSupplier taskContextSupplier, 
Option<BaseKeyGenerator> keyGeneratorOpt) {
-    super(config, instantTime, hoodieTable, Collections.emptyIterator(), 
partitionPath, fileId, taskContextSupplier, keyGeneratorOpt);
-    this.recordItr = recordItr;
+    super(config, instantTime, hoodieTable, 
MergeContext.create(mergeContext.getNumIncomingUpdates(), 
Collections.emptyIterator()), partitionPath, fileId, taskContextSupplier, 
keyGeneratorOpt);

Review Comment:
   🤖 nit: the inline `MergeContext.create(...)` nested inside `super(...)` 
makes this already-long line quite hard to parse. Since Java prevents 
pre-`super` statements, it might be worth considering an overload that accepts 
the count and iterator directly to avoid the double wrapping here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to