gaborkaszab commented on PR #12132: URL: https://github.com/apache/iceberg/pull/12132#issuecomment-2634011059
> @gaborkaszab @ebyhr I did some more search and found that it seems Iceberg already have mechanism for clean up unreferenced statistics and partition statistics files as part of the [delete-orphan-files procedure](https://iceberg.apache.org/docs/nightly/maintenance/#delete-orphan-files). This unit test in spark also tested the scenario. > > https://github.com/apache/iceberg/blob/9feca0c306b9f49382c4b9bab39daddf5a81712c/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRemoveOrphanFilesProcedure.java#L443 > > In such case, do we still really need to handle this issue? I know, remove_orphan_files basically drops everything that is not relevant for the latest metadata of the table. However, I'm not sure how much we should rely on that procedure to clean up the stats files. Orphan files are mostly supposed to be created after some failure when trying to commit files into the table. Replacing a stat file with another seems like a normal operation that shouldn't leave orphan files. With how stats work now we just added another way to create orphan files. What I don't know is that this is intentional, or some side-effect that no one took care of so far. But in general the table format leaving unreferenced stat files behind seems the latter for me TBH. Hence I said this would worth discussing on the dev@ list to have wider audience and see if anyone has some deeper insights. I cc'd @ajantha-bhat because he had a lot of exposure to the partition stats development recently. He might know more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org