Re: [PR] Core: Fix cleanup of orphaned statistics files in dropTableData [iceberg]

via GitHub Thu, 30 Jan 2025 04:04:41 -0800


gaborkaszab commented on code in PR #12132:
URL: https://github.com/apache/iceberg/pull/12132#discussion_r1935324663



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : 
metadata.previousFiles()) {
+      metadataToDelete.add(previousFile.file());
+
+      // Skip missing metadata files
+      if (!io.newInputFile(previousFile.file()).exists()) {

Review Comment:
   If it doesn't exist, no need to add to `metadataToDelete` on L117. I'd move 
that line after this check.



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : 
metadata.previousFiles()) {
+      metadataToDelete.add(previousFile.file());
+
+      // Skip missing metadata files
+      if (!io.newInputFile(previousFile.file()).exists()) {
+        LOG.warn("Skipping missing metadata file: {}", previousFile.file());
+        continue;
+      }
+
+      TableMetadata previousMetadata = TableMetadataParser.read(io, 
previousFile.file());
+      statisticsToDelete.addAll(previousMetadata.statisticsFiles());
+      
partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles());
+    }
+
+    // Process the latest metadata file
+    metadataToDelete.add(metadata.metadataFileLocation());
+    if (io.newInputFile(metadata.metadataFileLocation()).exists()) {
+      TableMetadata latestMetadata = TableMetadataParser.read(io, 
metadata.metadataFileLocation());

Review Comment:
   Is it needed to do an IO here to read the latestMetadata? We already have it 
loaded in `metadata`.



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : 
metadata.previousFiles()) {

Review Comment:
   We could be doing a lot of operations on the storage with this for loop in a 
sequential manner. We could use a Tasks.foreach() to do it in parallel.



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : 
metadata.previousFiles()) {
+      metadataToDelete.add(previousFile.file());
+
+      // Skip missing metadata files
+      if (!io.newInputFile(previousFile.file()).exists()) {
+        LOG.warn("Skipping missing metadata file: {}", previousFile.file());
+        continue;
+      }
+
+      TableMetadata previousMetadata = TableMetadataParser.read(io, 
previousFile.file());

Review Comment:
   I wonder if we need the existence check on L120 at all. We could just call 
this read() and handle the exception if the file doesn't exist. This could 
reduce the number of operation made on the storage.



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files

Review Comment:
   I found a `ReachableFileUtil.metadataFileLocations()` function that has a 
recursive mode that not just iterates the `previousFiles()` but also looks into 
those metadata and their `previousFiles()` too. I wonder in what cases would it 
give different results for the recursive run compared to the non-recursive, but 
I guess it was implemented for some reason :)
   I think we should check if this iteration is enough here or if we should 
also do a recursive one to not miss any metadata locations.



##########
core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java:
##########
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws 
IOException {
         .containsAll(partitionStatsLocations);
   }
 
+  @Test
+  public void dropTableDataDeletesAllPuffinFiles() throws IOException {

Review Comment:
   Maybe `dropTableDataDeletesStatsFromAllMetadataFiles` ?



##########
core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java:
##########
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws 
IOException {
         .containsAll(partitionStatsLocations);
   }
 
+  @Test

Review Comment:
   I might be wrong here but I don't think all the catalog implementations use 
this function to purge files. I haven't found any usage for RESTCatalog for 
instance. Adding this to CatalogTests would also result running this for 
non-affected catalog IMO.



##########
core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java:
##########
@@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws 
IOException {
         .containsAll(partitionStatsLocations);
   }
 
+  @Test
+  public void dropTableDataDeletesAllPuffinFiles() throws IOException {
+    table.newFastAppend().appendFile(FILE_A).commit();
+
+    StatisticsFile oldStatisticsFile =
+        writeStatsFile(
+            table.currentSnapshot().snapshotId(),
+            table.currentSnapshot().sequenceNumber(),
+            tableLocation + "/metadata/" + UUID.randomUUID() + ".stats",
+            table.io());
+    table
+        .updateStatistics()
+        .setStatistics(oldStatisticsFile.snapshotId(), oldStatisticsFile)

Review Comment:
   This function is deprecated. The other one without the snapshotId param 
should be used. This goes for other occurences below too.



##########
core/src/main/java/org/apache/iceberg/CatalogUtil.java:
##########
@@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata 
metadata) {
 
     LOG.info("Manifests to delete: {}", Joiner.on(", 
").join(manifestsToDelete));
 
+    // Collect all metadata files and extract historical statistics files
+    for (TableMetadata.MetadataLogEntry previousFile : 
metadata.previousFiles()) {
+      metadataToDelete.add(previousFile.file());
+
+      // Skip missing metadata files
+      if (!io.newInputFile(previousFile.file()).exists()) {
+        LOG.warn("Skipping missing metadata file: {}", previousFile.file());
+        continue;
+      }
+
+      TableMetadata previousMetadata = TableMetadataParser.read(io, 
previousFile.file());
+      statisticsToDelete.addAll(previousMetadata.statisticsFiles());
+      
partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles());
+    }
+
+    // Process the latest metadata file

Review Comment:
   I think instead of having dedicated code to take care of the latest metadata 
file, can't we simply initialize `metadataToDelete` and the same for stats to 
have these instead of having this code here? (the existence check and io read 
probably isn't needed anyway. See my below comment.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [PR] Core: Fix cleanup of orphaned statistics files in dropTableData [iceberg]

Reply via email to