gaborkaszab commented on PR #12132:
URL: https://github.com/apache/iceberg/pull/12132#issuecomment-2634011059

   > @gaborkaszab @ebyhr I did some more search and found that it seems Iceberg 
already have mechanism for clean up unreferenced statistics and partition 
statistics files as part of the [delete-orphan-files 
procedure](https://iceberg.apache.org/docs/nightly/maintenance/#delete-orphan-files).
 This unit test in spark also tested the scenario.
   > 
   > 
https://github.com/apache/iceberg/blob/9feca0c306b9f49382c4b9bab39daddf5a81712c/spark/v3.5/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestRemoveOrphanFilesProcedure.java#L443
   > 
   > In such case, do we still really need to handle this issue?
   
   I know, remove_orphan_files basically drops everything that is not relevant 
for the latest metadata of the table. However, I'm not sure how much we should 
rely on that procedure to clean up the stats files. Orphan files are mostly 
supposed to be created after some failure when trying to commit files into the 
table. Replacing a stat file with another seems like a normal operation that 
shouldn't leave orphan files.
   With how stats work now we just added another way to create orphan files. 
What I don't know is that this is intentional, or some side-effect that no one 
took care of so far. But in general the table format leaving unreferenced stat 
files behind seems the latter for me TBH.
   
   Hence I said this would worth discussing on the dev@ list to have wider 
audience and see if anyone has some deeper insights. I cc'd @ajantha-bhat 
because he had a lot of exposure to the partition stats development recently. 
He might know more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to