Re: [PR] Flink: Enhance DeleteFilesProcessor with Time Metrics and Unit Tests. [iceberg]

via GitHub Fri, 22 Aug 2025 23:37:10 -0700


slfan1989 commented on code in PR #13831:
URL: https://github.com/apache/iceberg/pull/13831#discussion_r2295303950



##########
flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/maintenance/operator/TestDeleteFilesProcessor.java:
##########
@@ -74,27 +79,172 @@ void testDeleteMissingFile() throws Exception {
     Path dummyFile =
         FileSystems.getDefault().getPath(table.location().substring(5), 
DUMMY_FILE_NAME);
 
-    deleteFile(tableLoader(), dummyFile.toString());
+    deleteFile(tableLoader(), dummyFile.toString(), true);
 
     assertThat(listFiles(table)).isEqualTo(TABLE_FILES);
   }
 
   @Test
   void testInvalidURIScheme() throws Exception {
-    deleteFile(tableLoader(), "wrong://");
+    deleteFile(tableLoader(), "wrong://", false);
 
     assertThat(listFiles(table)).isEqualTo(TABLE_FILES);
   }
 
-  private void deleteFile(TableLoader tableLoader, String fileName) throws 
Exception {
-    tableLoader().open();
+  @Test
+  void testDeleteNonExistentFile() throws Exception {
+    String nonexistentFile = "nonexistentFile.txt";
+
+    deleteFile(tableLoader(), nonexistentFile, true);
+
+    assertThat(listFiles(table)).isEqualTo(TABLE_FILES);
+  }
+
+  @Test
+  void testDeleteLargeFile() throws Exception {

Review Comment:
   @pvary This is a very good question! In most practical scenarios, the 
metadata files of Iceberg tables are usually not very large. However, 
considering our internal use cases, I wrote this unit test to validate 
potential risks.
   
   We plan to provide feature data for the company’s internal AI team. 
According to user feedback, these feature tables may contain hundreds of 
columns and require support for multiple concurrent writing tasks. This 
combination of high dimensionality and high concurrency could cause metadata 
files to grow rapidly.
   
   Therefore, we need to strike a balance between two strategies:
   
   - More, smaller metadata files — beneficial for concurrent writes but 
increases metadata management overhead.
   - Fewer, larger metadata files — reduces the number of files but may 
increase latency for individual operations.
   
   Currently, the default value of MANIFEST_TARGET_SIZE_BYTES in our system is 
8 MB:
   
   ```
   public static final String MANIFEST_TARGET_SIZE_BYTES = 
"commit.manifest.target-size-bytes";
   public static final long MANIFEST_TARGET_SIZE_BYTES_DEFAULT = 8 * 1024 * 
1024; // 8 MB
   ```
   
   We are now evaluating the appropriate value for this parameter. One 
suggestion is to increase it to 256 MB. Next, we will conduct performance tests 
to compare throughput and latency under different configurations, so that we 
can determine the optimal setting.
   
    I’d really like to hear about your experiences with metadata management. If 
you could share some practical approaches and insights, that would be fantastic.
   
   cc: @mxm @Guosmilesmile 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Flink: Enhance DeleteFilesProcessor with Time Metrics and Unit Tests. [iceberg]

Reply via email to