rdblue closed issue #8953: Duplicate file name in Iceberg's metadata
URL: https://github.com/apache/iceberg/issues/8953
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsub
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1846634916
I also want to verify why fileCount doesn't really cover the uniqueness
right now, the only other way would be if it's in a different thread (and both
threads just end up ha
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1846628756
Ok I actually looked at the history of these changes now
https://github.com/apache/iceberg/pull/5214 was never merged but followed by
https://github.com/apache/iceberg/pull/
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1841135980
Thanks for the details, yeah I agree a UUID would of course essentially
guarantee uniqueness, I'm just not sure of all the implications of changing the
output paths. There
github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1831463021
@Fokko I don't know how to have a simple and reproductible setup. We had
the issue at a rate of ~10 files per week with an app producing hundreds of
files per hour.
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830774809
Yeah +1 to what Fokko said, if we can get a minimal setup reproduction test
case of the problem that would be helpful.
>After looking a bit more, we think this
https
Fokko commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830756897
> After looking a bit more, we think this
https://github.com/apache/iceberg/pull/5214 introduced the bug.
@github-raphael-douyere Can you elaborate on why you think this is the c
cccs-jory commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1830747479
We observed the same behaviour in Spark 3.4 with Iceberg 1.3 as well.
Additionally we tested with Spark 3.5 and Iceberg 1.4 and have the same
problem.
Our job stops and re
github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1814229365
After looking a bit more, we think this
https://github.com/apache/iceberg/pull/5214 introduced the bug.
--
This is an automated message from the Apache Git Service.
github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1807614244
Not sure it helps but I need to mention that our streaming app is restarted
a lot. Maybe all work fine when everything is kept in memory but on restart
some elements a
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805005236
That being said, I do see there have been a few issues reported that are
very similar as @github-raphael-douyere pointed out. I'm looking into this
--
This is an automate
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805003446
It looks like Spark creates a data writer per task, so we should be good
there
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execu
amogh-jahagirdar commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1805001659
Looking at the code, this shouldn't happen but would need to check more
deeply. We create an `OutputFileFactory` per writer,
https://github.com/apache/iceberg/blob/main/spa
github-raphael-douyere commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1794358478
We enabled S3 versioning on the bucket and can see a file name being used 2
times by 2 distincts micro-batches. So it is not a case of task retry inside
Spark.
Fokko commented on issue #8953:
URL: https://github.com/apache/iceberg/issues/8953#issuecomment-1787411953
Slack conversation for reference that provides some more interesting
details: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1698676018510089
--
This is an automated message
github-raphael-douyere opened a new issue, #8953:
URL: https://github.com/apache/iceberg/issues/8953
### Apache Iceberg version
1.3.1
### Query engine
Spark
### Please describe the bug 🐞
While writing data to an Iceberg table using Spark Streaming 3.4.1 / Ic
16 matches
Mail list logo