ZhangYu0123 opened a new issue #4164: URL: https://github.com/apache/incubator-doris/issues/4164
**Describe problems** At present, the Compaction in BE is divided into two parts: Base and Cumulative. For the incremental overlap rowsets, the Cumulative Compaction process is merged first, and then in Base Compaction process nonoverlap rowsets are merged. When there are frequent small batches of continuous writing, a large number of small nonoverlap rowsets to Base Compaction process. This will cause the following problems: (1) In Base Compaction, each time a small batch file data is merged, the merge efficiency of Base Compaction is very low (it takes a long time to merge many small files, the number of merges of Base is more, and the write amplification is serious). (2) There are too many small files at the same time, and too many files will be read during query, which affects the query efficiency. **Solution** **base compaction process** The base compaction process is unchanged, the existing process is maintained, and the versions from Base to Cumulative Point are merged. **comulative compaction process** _**1 Cumulative Point calculation (changing part)**_ Cumulative Point calculation will be executed once when it is initialized. Traverse the rowsets collection in version order. If the following conditions are satisfied, the Cumulative Point moves down, otherwise it breaks: - The current rowset is a non-overlap type. - The rowset disk space is greater than base_compaction_delta_ratio(2%) of the base version or greater than or equal to min_base_compaction_promotion_size(2G). 2 Select the Rowset to be merged by Cumulative Compaction 2.1 Candidate_rowsets collection of candidate sets, (consistent with the original process) - The candidate set is selected to be greater than or equal to all versions of Cumulative Point. And sort. - Collect rowset versions from the candidate set, and stop collecting when compaction_score is greater than or equal to 1000 or when the traversal reaches the end. - When there is a deletion condition for candidate_rowsets, the previous version is collected and deleted. Move Cumulative Point after deleting rowset. _**2.2 Create input_rowsets (changing part)**_ (1) Definition: - When the total number of disks in the candidate set candidate_rowsets s_total, - The disk space s1 of the first version in candidate_rowsets. - level_size(n) represents n to the disk size segment that can be mapped to [1G~2G), [512m~1G), [256m~512m), [128m~256m), [64m~128m), this segments can be calculated by min_base_compaction_promotion_size. (2) Implementation process: - When s_total is greater than or equal to base_compaction_delta_ratio(2%) of the base version or greater than or equal to min_base_compaction_promotion_size(2G) or s_total <= smallest_compaction_limit(64m), candidate_rowsets is directly assigned to input_rowsets and exits. - When level_size (s_total-s1) >= level_size (s1), candidate_rowsets are directly assigned to input_rowsets and exit. - When level_size (s_total-s1) <level_size (s1), after removing the first version of candidate_rowsets, return to the first step and loop the process. (3) Check input_rowsets (consistent with the original process) - When the compaction_score of input_rowsets is less than the minimum score of 5, input_rowsets are cleared and not merged. - Check that input_rowsets is empty, and supplement the check according to time. If the merge is not performed for more than 1 day, and there is an overlap type rowset in candidate_rowsets, candidate_rowsets is directly assigned to input_rowsets and exits. 3 Execute compaction (consistent with the original process) _**4 Cumulative Point update (changing part)**_ When the disk space of the newly generated rowset is greater than base_compaction_delta_ratio(2%) of the base version rowset or greater than or equal to min_base_compaction_promotion_size(2G), the Cumulative Point moves forward, otherwise it does not move. **Evaluation of merger effect** 1. Configuration variable selection smallest_compaction_limit: 64m min_base_compaction_promotion_size: 2G base_compaction_delta_ratio: (2%) 2. File num growth trend  _**3. Comparison of two strategy**_ This includes base compaction and cumulative compaction. Assuming that the file size of each import is d, a cumulative compaction is performed after q rounds of data load, and a base compaction is performed after k cumulative compactions. The import of d is executed n times. It is assumed that k is greater than 4. 3.1 old strategy Maximum number of existing files = k + q Base Compaction times p = n / (k * q) Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q Cumulative IO write data volume = d*n 3.2 this new strategy 3.2.1 The number of files: a) when q*d> 2G: Maximum number of existing files = k + 6 b) when q*d <2G: Maximum number of existing files = k * q * d / 2G + 6 Can avoid too many files caused by too many small batches 3.2.2 Base IO write amplification situation: a) When d*q*k >= 8G: Base Compaction times p = n / (k * q) Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) / k * q **_In this case, Base Compaction IO is basically the same as the previous Compaction IO situation._** b) When d*q*k <8G and the data volume is less than 100G: Base Compaction times p = log(1 + 8%)(n), here does not depend on the times k Base IO write data volume = d(1-n/1.08) / (1-1.08) = 18.51* n * d – 20d **_In this case, Base Compaction IO is at the O(n) level, which is significantly better than the previous O(n2) situation. The 8% here is due to the base compaction limit when more than 5 versions (including base) are merged, and the worst-case evaluation is here._** c) When d*q*k <8G and the data volume is greater than 100G: Base Compaction times p = n*d / 8G Base IO write data volume = 8G * p (p + 1) / 2 = d * n * (n + d/8G) / 2 * (d / 8G) _**In this case, the Base Compaction IO is basically the same as the previous Compaction IO. It will bring an increase of 8G/(d*q*k) times.**_ 3.2.3 Cumulative IO write amplification situation: When q*d> 2G: Cumulative IO write data volume = n * d When q*d <2G: Cumulative IO write data volume = (log (2G / q*d) + 1) * n * d 3.4 In conclusion: **_When q*d <2G, Cumulative IO wastes log (2G / q*d) write amplification and reduces k-k*q*d / 2G number of files. When k*q*d <= 8G, Base IO greatly reduces write amplification. When q*d> 2G, the effect is the same as before._** ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org