github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1831463021
@Fokko I don't know how to have a simple and reproductible setup. We had the issue at a rate of ~10 files per week with an app producing hundreds of files per hour. @amogh-jahagirdar And yes I know that the file name is not only the query id. But I think the other elements can definitively repeat (`taskId` and `partitionId`). What I'm not sure of is the `fileCount` part. I think it is kept in memory but resets when the app is restarted (ie: not part of the state). So my point is: with a UUID this can't happen (barring the UUID collision) as whatever collisions on the other part of the filename are handled by a uniq part. Another fix could be to keep the `operationId` but add an UUID as well. This would extend the file names a little bit but is probably fine to avoid data loss issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org