aiss93 commented on issue #11800: URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2552388682
I don't know if it makes sense regarding Spark/Iceberg internals. If we consider the following example <table> <tr><th>Table A huge partition to split </th><th>Table B partition to replicate </th></tr> <tr><td> | date | id | value | | ----------- | ----------- | ----------- | | 10/10/2024 | 1 | a | | 10/10/2024 | 2 | b | | 10/10/2024 | 3 | c | | 10/10/2024 | 4 | d | | 10/10/2024 | 5 | e | | 10/10/2024 | 6 | f | </td><td> | date | id | value | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td></tr> </table> After the partition split we'll get the following : <table> <tr><th>Table A splited partition</th><th>Table B replicated partition </th></tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 1 | a | | 10/10/2024 | 2 | b| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 3 | c| | 10/10/2024 | 4 | d| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> <tr> <td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 5 | e | | 10/10/2024 | 6 | f| </td><td> | | | | | ----------- | ----------- | ----------- | | 10/10/2024 | 7 | x | | 10/10/2024 | 8 | y| </td> </tr> </table> In case we have a `if not matched then insert *`, as you explained above each replicated partition from table B will be inserted and therefore we'll have duplicates. The idea I was suggesting consists in : - Computing an aggregated boolean that tells if `not matched ` check is true for all of theses replicated partitions. - If this flag is true, then assign only one replicated partition to execute the insert statement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org