aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2571751817
Thank you for your reply @szehon-ho
I actually get your point regarding the example 2)
The idea I was suggesting is to centralize the vision of each replicated
task : For e
szehon-ho commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2564342146
I think there are two 'not matched' here,
1 ) entries in A not matched in B
2) entries in B not matched in A.
Case (1) is do-able.
For case (2), it is harder bec
szehon-ho commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2557495242
Hm, but for "not match" check, you need to check each replicated partition
against all the splitted partitions, so it defeats the point of splitting I
think.
--
This is an aut
aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2552388682
I don't know if it makes sense regarding Spark/Iceberg internals. If we
consider the following example
Table A huge partition to split Table B partition to
replicate
szehon-ho commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2551982279
Hm not sure i get it, can you give an example? SPJ is at the planning
stage, we have to decide what partition from each side to put together.
--
This is an automated message f
aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2551600689
Yeah I see the problem now. Thanks.
I'm not an iceberg/spark internals expert, but wouldn't it be possible to
compute a flag per partition to know if there was at least one match
szehon-ho commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550666384
Yea im not sure how to solve the problem. What do you mean salting?
To explain the current problem, if you have two side [A] and [B], the
algoirthm split A and duplicate B
aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550630071
Thank you for your fast reply @szehon-ho.
Is there any plans to make it work for full outer joins as well ? I'd like
to work on this issue if possible.
What do you suggest
szehon-ho commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550367687
Yes unfortunately that optimization is a bit limited, it splits the big size
and replicate the small side, so is only correct to do for inner join. I think
in this case, you hav
aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2548591198
@szehon-ho I saw that video you made on this topic during the Iceberg
Summit. Do you see anything missing in the configuration ?
Thank you for your help.
--
This is an automat
aiss93 opened a new issue, #11800:
URL: https://github.com/apache/iceberg/issues/11800
### Query engine
- I'm using AWS Glue interactive session with glue version 5.0.
- Spark 3.5.2
- iceberg 1.6.1
### Question
Hi
I have two s3 data sources, full_time_series
11 matches
Mail list logo