Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

via GitHub Wed, 18 Dec 2024 14:27:38 -0800


aiss93 commented on issue #11800:
URL: https://github.com/apache/iceberg/issues/11800#issuecomment-2552388682


   I don't know if it makes sense regarding Spark/Iceberg internals. If we 
consider the following example 
   
   
   <table>
   <tr><th>Table A huge partition to split </th><th>Table B partition to 
replicate </th></tr>
   <tr><td>
   
   | date      | id | value |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 1       | a |
   | 10/10/2024   | 2        | b |
   | 10/10/2024   | 3        | c |
   | 10/10/2024   | 4        | d |
   | 10/10/2024   | 5        | e |
   | 10/10/2024   | 6        | f |
   
   </td><td>
   
   | date      | id | value |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 7       | x |
   | 10/10/2024   | 8       | y|
   
   </td></tr> </table>
   
   After the partition split we'll get the following :
   
   <table>
   <tr><th>Table A splited partition</th><th>Table B replicated partition 
</th></tr>
   <tr>
   <td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 1       | a |
   | 10/10/2024   | 2       | b|
   </td><td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 7       | x |
   | 10/10/2024   | 8       | y|
   </td>
   </tr>
   
   <tr>
   <td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 3       | c|
   | 10/10/2024   | 4       | d|
   </td><td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 7       | x |
   | 10/10/2024   | 8       | y|
   </td>
   </tr>
   
   
   <tr>
   <td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 5     | e |
   | 10/10/2024   | 6       | f|
   </td><td>
   
   | | | |
   | ----------- | ----------- | ----------- |
   | 10/10/2024      | 7       | x |
   | 10/10/2024   | 8       | y|
   </td>
   </tr>
   
   </table>
   
   
   In case we have a `if not matched then insert *`, as you explained above 
each replicated partition from table B will be inserted and therefore we'll 
have duplicates. The idea I was suggesting consists in :
   - Computing an aggregated boolean that tells if `not matched ` check is true 
for all of theses replicated partitions.
   - If this flag is true, then assign only one replicated partition to execute 
the insert statement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] [SPJ] Skweded partitions harm merge performances [iceberg]

Reply via email to