[PR] Core, Spark 3.5: Support file and partition delete granularity [iceberg]

via GitHub Tue, 26 Dec 2023 12:35:22 -0800


aokolnychyi opened a new pull request, #9384:
URL: https://github.com/apache/iceberg/pull/9384


   This PR adds support for file and partition delete granularity, allowing 
users to pick between the two.
   
   Under partition granularity, delete writers are allowed to group deletes for 
different data files into one delete file. This strategy tends to reduce the 
total number of delete files in the table. However, it may lead to the 
assignment of irrelevant deletes to some data files during the job planning. 
All irrelevant deletes are filtered by readers but add extra overhead, which 
can be mitigated via caching.
   
   Under file granularity, delete writers always organize deletes by their 
target data file, creating separate delete files for each referenced data file. 
This strategy ensures the job planning does not assign irrelevant deletes to 
data files. However, it also increases the total number of delete files in the 
table and may require a more aggressive approach for delete file compaction.
   
   Currently, this configuration is only applicable to position deletes.
   
   Each granularity has its own benefits and drawbacks and should be picked 
based on a use case. Despite the chosen granularity, regular delete compaction 
remains necessary. It is also possible to use one granularity for ingestion and 
another one for table maintenance.
   
   **After**
   
   ```
   Benchmark                                                                    
                                                      Mode  Cnt           Score 
            Error   Units
   
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriterFileGranularity
                                               ss    5           2.751 ±        
   0.097    s/op
   
ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriterPartitionGranularity
                                          ss    5           2.329 ±           
0.114    s/op
   
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterFileGranularity
                                                  ss    5           3.602 ±     
      0.085    s/op
   
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterPartitionGranularity
                                             ss    5           3.098 ±          
 0.110    s/op
   
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffledFileGranularity
                                          ss    5           3.561 ±           
0.108    s/op
   
ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffledPartitionGranularity
                                     ss    5           3.587 ±           0.142  
  s/op
   ```
   
   **Before**
   
   ```
   Benchmark                                                                    
                              Mode  Cnt           Score             Error   
Units
   ParquetWritersBenchmark.writeUnpartitionedClusteredPositionDeleteWriter      
                                ss    5           2.279 ±           0.107    
s/op
   ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriter         
                                ss    5           3.052 ±           0.075    
s/op
   ParquetWritersBenchmark.writeUnpartitionedFanoutPositionDeleteWriterShuffled 
                                ss    5           3.645 ±           0.081    
s/op
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[PR] Core, Spark 3.5: Support file and partition delete granularity [iceberg]

Reply via email to