gaborkaszab commented on code in PR #12132: URL: https://github.com/apache/iceberg/pull/12132#discussion_r1935324663
########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files + for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { + metadataToDelete.add(previousFile.file()); + + // Skip missing metadata files + if (!io.newInputFile(previousFile.file()).exists()) { Review Comment: If it doesn't exist, no need to add to `metadataToDelete` on L117. I'd move that line after this check. ########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files + for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { + metadataToDelete.add(previousFile.file()); + + // Skip missing metadata files + if (!io.newInputFile(previousFile.file()).exists()) { + LOG.warn("Skipping missing metadata file: {}", previousFile.file()); + continue; + } + + TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file()); + statisticsToDelete.addAll(previousMetadata.statisticsFiles()); + partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles()); + } + + // Process the latest metadata file + metadataToDelete.add(metadata.metadataFileLocation()); + if (io.newInputFile(metadata.metadataFileLocation()).exists()) { + TableMetadata latestMetadata = TableMetadataParser.read(io, metadata.metadataFileLocation()); Review Comment: Is it needed to do an IO here to read the latestMetadata? We already have it loaded in `metadata`. ########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files + for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { Review Comment: We could be doing a lot of operations on the storage with this for loop in a sequential manner. We could use a Tasks.foreach() to do it in parallel. ########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files + for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { + metadataToDelete.add(previousFile.file()); + + // Skip missing metadata files + if (!io.newInputFile(previousFile.file()).exists()) { + LOG.warn("Skipping missing metadata file: {}", previousFile.file()); + continue; + } + + TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file()); Review Comment: I wonder if we need the existence check on L120 at all. We could just call this read() and handle the exception if the file doesn't exist. This could reduce the number of operation made on the storage. ########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files Review Comment: I found a `ReachableFileUtil.metadataFileLocations()` function that has a recursive mode that not just iterates the `previousFiles()` but also looks into those metadata and their `previousFiles()` too. I wonder in what cases would it give different results for the recursive run compared to the non-recursive, but I guess it was implemented for some reason :) I think we should check if this iteration is enough here or if we should also do a recursive one to not miss any metadata locations. ########## core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java: ########## @@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { .containsAll(partitionStatsLocations); } + @Test + public void dropTableDataDeletesAllPuffinFiles() throws IOException { Review Comment: Maybe `dropTableDataDeletesStatsFromAllMetadataFiles` ? ########## core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java: ########## @@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { .containsAll(partitionStatsLocations); } + @Test Review Comment: I might be wrong here but I don't think all the catalog implementations use this function to purge files. I haven't found any usage for RESTCatalog for instance. Adding this to CatalogTests would also result running this for non-affected catalog IMO. ########## core/src/test/java/org/apache/iceberg/hadoop/TestCatalogUtilDropTable.java: ########## @@ -129,6 +130,81 @@ public void dropTableDataDeletesExpectedFiles() throws IOException { .containsAll(partitionStatsLocations); } + @Test + public void dropTableDataDeletesAllPuffinFiles() throws IOException { + table.newFastAppend().appendFile(FILE_A).commit(); + + StatisticsFile oldStatisticsFile = + writeStatsFile( + table.currentSnapshot().snapshotId(), + table.currentSnapshot().sequenceNumber(), + tableLocation + "/metadata/" + UUID.randomUUID() + ".stats", + table.io()); + table + .updateStatistics() + .setStatistics(oldStatisticsFile.snapshotId(), oldStatisticsFile) Review Comment: This function is deprecated. The other one without the snapshotId param should be used. This goes for other occurences below too. ########## core/src/main/java/org/apache/iceberg/CatalogUtil.java: ########## @@ -108,6 +112,31 @@ public static void dropTableData(FileIO io, TableMetadata metadata) { LOG.info("Manifests to delete: {}", Joiner.on(", ").join(manifestsToDelete)); + // Collect all metadata files and extract historical statistics files + for (TableMetadata.MetadataLogEntry previousFile : metadata.previousFiles()) { + metadataToDelete.add(previousFile.file()); + + // Skip missing metadata files + if (!io.newInputFile(previousFile.file()).exists()) { + LOG.warn("Skipping missing metadata file: {}", previousFile.file()); + continue; + } + + TableMetadata previousMetadata = TableMetadataParser.read(io, previousFile.file()); + statisticsToDelete.addAll(previousMetadata.statisticsFiles()); + partitionStatsToDelete.addAll(previousMetadata.partitionStatisticsFiles()); + } + + // Process the latest metadata file Review Comment: I think instead of having dedicated code to take care of the latest metadata file, can't we simply initialize `metadataToDelete` and the same for stats to have these instead of having this code here? (the existence check and io read probably isn't needed anyway. See my below comment.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org