Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2025-01-05 Thread via GitHub
aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2571751817 Thank you for your reply @szehon-ho I actually get your point regarding the example 2) The idea I was suggesting is to centralize the vision of each replicated task : For e

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-28 Thread via GitHub
szehon-ho commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2564342146 I think there are two 'not matched' here, 1 ) entries in A not matched in B 2) entries in B not matched in A. Case (1) is do-able. For case (2), it is harder bec

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-20 Thread via GitHub
szehon-ho commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2557495242 Hm, but for "not match" check, you need to check each replicated partition against all the splitted partitions, so it defeats the point of splitting I think. -- This is an aut

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-18 Thread via GitHub
aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2552388682 I don't know if it makes sense regarding Spark/Iceberg internals. If we consider the following example Table A huge partition to split Table B partition to replicate

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-18 Thread via GitHub
szehon-ho commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2551982279 Hm not sure i get it, can you give an example? SPJ is at the planning stage, we have to decide what partition from each side to put together. -- This is an automated message f

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-18 Thread via GitHub
aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2551600689 Yeah I see the problem now. Thanks. I'm not an iceberg/spark internals expert, but wouldn't it be possible to compute a flag per partition to know if there was at least one match

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-18 Thread via GitHub
szehon-ho commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550666384 Yea im not sure how to solve the problem. What do you mean salting? To explain the current problem, if you have two side [A] and [B], the algoirthm split A and duplicate B

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-18 Thread via GitHub
aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550630071 Thank you for your fast reply @szehon-ho. Is there any plans to make it work for full outer joins as well ? I'd like to work on this issue if possible. What do you suggest

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-17 Thread via GitHub
szehon-ho commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2550367687 Yes unfortunately that optimization is a bit limited, it splits the big size and replicate the small side, so is only correct to do for inner join. I think in this case, you hav

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-17 Thread via GitHub
aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2548591198 @szehon-ho I saw that video you made on this topic during the Iceberg Summit. Do you see anything missing in the configuration ? Thank you for your help. -- This is an automat

[I] [SPJ] Skweded partitions harm merge performances [iceberg]

2024-12-17 Thread via GitHub
aiss93 opened a new issue, #11800: URL: https://github.com/apache/iceberg/issues/11800 ### Query engine - I'm using AWS Glue interactive session with glue version 5.0. - Spark 3.5.2 - iceberg 1.6.1 ### Question Hi I have two s3 data sources, full_time_series