Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2024-01-02 Thread via GitHub
rdblue closed issue #8953: Duplicate file name in Iceberg's metadata URL: https://github.com/apache/iceberg/issues/8953 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-12-07 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1846634916 I also want to verify why fileCount doesn't really cover the uniqueness right now, the only other way would be if it's in a different thread (and both threads just end up ha

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-12-07 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1846628756 Ok I actually looked at the history of these changes now https://github.com/apache/iceberg/pull/5214 was never merged but followed by https://github.com/apache/iceberg/pull/

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-12-05 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1841135980 Thanks for the details, yeah I agree a UUID would of course essentially guarantee uniqueness, I'm just not sure of all the implications of changing the output paths. There

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-29 Thread via GitHub
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1831463021 @Fokko I don't know how to have a simple and reproductible setup. We had the issue at a rate of ~10 files per week with an app producing hundreds of files per hour.

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-28 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830774809 Yeah +1 to what Fokko said, if we can get a minimal setup reproduction test case of the problem that would be helpful. >After looking a bit more, we think this https

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-28 Thread via GitHub
Fokko commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830756897 > After looking a bit more, we think this https://github.com/apache/iceberg/pull/5214 introduced the bug. @github-raphael-douyere Can you elaborate on why you think this is the c

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-28 Thread via GitHub
cccs-jory commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830747479 We observed the same behaviour in Spark 3.4 with Iceberg 1.3 as well. Additionally we tested with Spark 3.5 and Iceberg 1.4 and have the same problem. Our job stops and re

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-16 Thread via GitHub
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1814229365 After looking a bit more, we think this https://github.com/apache/iceberg/pull/5214 introduced the bug. -- This is an automated message from the Apache Git Service.

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-12 Thread via GitHub
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1807614244 Not sure it helps but I need to mention that our streaming app is restarted a lot. Maybe all work fine when everything is kept in memory but on restart some elements a

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-09 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805005236 That being said, I do see there have been a few issues reported that are very similar as @github-raphael-douyere pointed out. I'm looking into this -- This is an automate

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-09 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805003446 It looks like Spark creates a data writer per task, so we should be good there https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execu

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-09 Thread via GitHub
amogh-jahagirdar commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805001659 Looking at the code, this shouldn't happen but would need to check more deeply. We create an `OutputFileFactory` per writer, https://github.com/apache/iceberg/blob/main/spa

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-11-06 Thread via GitHub
github-raphael-douyere commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1794358478 We enabled S3 versioning on the bucket and can see a file name being used 2 times by 2 distincts micro-batches. So it is not a case of task retry inside Spark.

Re: [I] Duplicate file name in Iceberg's metadata [iceberg]

2023-10-31 Thread via GitHub
Fokko commented on issue #8953: URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1787411953 Slack conversation for reference that provides some more interesting details: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1698676018510089 -- This is an automated message

[I] Duplicate file name in Iceberg's metadata [iceberg]

2023-10-30 Thread via GitHub
github-raphael-douyere opened a new issue, #8953: URL: https://github.com/apache/iceberg/issues/8953 ### Apache Iceberg version 1.3.1 ### Query engine Spark ### Please describe the bug 🐞 While writing data to an Iceberg table using Spark Streaming 3.4.1 / Ic