ZhangYu0123 opened a new issue #4164:
URL: https://github.com/apache/incubator-doris/issues/4164


   **Describe problems**
   At present, the Compaction in BE is divided into two parts: Base and 
Cumulative. For the incremental overlap rowsets, the Cumulative Compaction 
process is merged first, and then in Base Compaction process nonoverlap rowsets 
are merged. When there are frequent small batches of continuous writing, a 
large number of small nonoverlap rowsets to Base Compaction process. This will 
cause the following problems:
   (1) In Base Compaction, each time a small batch file data is merged, the 
merge efficiency of Base Compaction is very low (it takes a long time to merge 
many small files, the number of merges of Base is more, and the write 
amplification is serious).
   (2) There are too many small files at the same time, and too many files will 
be read during query, which affects the query efficiency.
    
   **Solution**
   **base compaction process**
   The base compaction process is unchanged, the existing process is 
maintained, and the versions from Base to Cumulative Point are merged.
   
   **comulative compaction process**
   _**1  Cumulative Point calculation (changing part)**_
    Cumulative Point calculation will be executed once when it is initialized.
   Traverse the rowsets collection in version order. If the following 
conditions are satisfied, the Cumulative Point moves down, otherwise it breaks:
   - The current rowset is a non-overlap type.
   - The rowset disk space is greater than base_compaction_delta_ratio(2%) of 
the base version or greater than or equal to 
min_base_compaction_promotion_size(2G).
   
   2  Select the Rowset to be merged by Cumulative Compaction
   2.1 Candidate_rowsets collection of candidate sets, (consistent with the 
original process)
   - The candidate set is selected to be greater than or equal to all versions 
of Cumulative Point. And sort.
   - Collect rowset versions from the candidate set, and stop collecting when 
compaction_score is greater than or equal to 1000 or when the traversal reaches 
the end.
   - When there is a deletion condition for candidate_rowsets, the previous 
version is collected and deleted. Move Cumulative Point after deleting rowset.
   
   _**2.2 Create input_rowsets  (changing part)**_
   (1) Definition:
   - When the total number of disks in the candidate set candidate_rowsets 
s_total,
   - The disk space s1 of the first version in candidate_rowsets.
   - level_size(n) represents n to the disk size segment that can be mapped to 
[1G~2G), [512m~1G), [256m~512m), [128m~256m), [64m~128m), this segments can be 
calculated by min_base_compaction_promotion_size.
   
   (2) Implementation process:
   - When s_total is greater than or equal to base_compaction_delta_ratio(2%) 
of the base version or greater than or equal to 
min_base_compaction_promotion_size(2G) or s_total <= 
smallest_compaction_limit(64m), candidate_rowsets is directly assigned to 
input_rowsets and exits.
   - When level_size (s_total-s1) >= level_size (s1), candidate_rowsets are 
directly assigned to input_rowsets and exit.
   - When level_size (s_total-s1) <level_size (s1), after removing the first 
version of candidate_rowsets, return to the first step and loop the process.
   
   (3) Check input_rowsets (consistent with the original process)
   - When the compaction_score of input_rowsets is less than the minimum score 
of 5, input_rowsets are cleared and not merged.
   - Check that input_rowsets is empty, and supplement the check according to 
time. If the merge is not performed for more than 1 day, and there is an 
overlap type rowset in candidate_rowsets, candidate_rowsets is directly 
assigned to input_rowsets and exits.
   
   3 Execute compaction (consistent with the original process)
   
   _**4 Cumulative Point update  (changing part)**_
   When the disk space of the newly generated rowset is greater than 
base_compaction_delta_ratio(2%) of the base version rowset or greater than or 
equal to min_base_compaction_promotion_size(2G), the Cumulative Point moves 
forward, otherwise it does not move.
   
   
   **Evaluation of merger effect**
   1. Configuration variable selection
   smallest_compaction_limit: 64m
   min_base_compaction_promotion_size: 2G
   base_compaction_delta_ratio: (2%)
   
   2. File num growth trend
   
![image](https://user-images.githubusercontent.com/67053339/88299940-836ad980-cd35-11ea-9a54-96d9998abcad.png)
   
   _**3. Comparison of two strategy**_
   This includes base compaction and cumulative compaction. Assuming that the 
file size of each import is d, a cumulative compaction is performed after q 
rounds of data load, and a base compaction is performed after k cumulative 
compactions. The import of d is executed n times. It is assumed that k is 
greater than 4.
   
   3.1 old strategy
   Maximum number of existing files = k + q
   Base Compaction times p = n / (k * q)
   Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) 
/ k * q
   Cumulative IO write data volume = d*n
   
   3.2 this new strategy
   3.2.1 The number of files: 
   a) when q*d> 2G:
   Maximum number of existing files = k + 6
   b) when q*d <2G:
   Maximum number of existing files = k * q * d / 2G + 6
   Can avoid too many files caused by too many small batches
   
   3.2.2 Base IO write amplification situation:
   a) When d*q*k >= 8G:
   Base Compaction times p = n / (k * q)
   Base IO write data volume = d * q * k * p * (p + 1) / 2 = d * n * (n + k* q) 
/ k * q
   
   **_In this case, Base Compaction IO is basically the same as the previous 
Compaction IO situation._**
   
   b) When d*q*k <8G and the data volume is less than 100G:
   Base Compaction times p = log(1 + 8%)(n), here does not depend on the times k
   Base IO write data volume = d(1-n/1.08) / (1-1.08) = 18.51* n * d – 20d
   
   **_In this case, Base Compaction IO is at the O(n) level, which is 
significantly better than the previous O(n2) situation. The 8% here is due to 
the base compaction limit when more than 5 versions (including base) are 
merged, and the worst-case evaluation is here._**
   
   c) When d*q*k <8G and the data volume is greater than 100G:
   Base Compaction times p = n*d / 8G
   Base IO write data volume = 8G * p (p + 1) / 2 = d * n * (n + d/8G) / 2 * (d 
/ 8G)
   
   _**In this case, the Base Compaction IO is basically the same as the 
previous Compaction IO. It will bring an increase of 8G/(d*q*k) times.**_
   
   3.2.3 Cumulative IO write amplification situation:
   When q*d> 2G:
   Cumulative IO write data volume = n * d
   When q*d <2G:
   Cumulative IO write data volume = (log (2G / q*d) + 1) * n * d
   
   3.4 In conclusion:
   **_When q*d <2G, Cumulative IO wastes log (2G / q*d) write amplification and 
reduces
   k-k*q*d / 2G number of files. 
   When k*q*d <= 8G, Base IO greatly reduces write amplification.
   When q*d> 2G, the effect is the same as before._**
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

Reply via email to