With the widespread use of the new load framework, existing compaction 
strategies are no longer work in some scenarios. This document focuses on the 
problems that the new load framework brings to the compaction logic and how to 
improve it.


## Problem


In the new load framework, the load data forms a serial of `Memtables` in 
memory. When the size of a memtable reaches the threshold (default is 100MB), 
it will be written to the disk to form a `Segment`. A batch of load is 
corresponding ti a `Version`. When a batch of loaded data is relatively large, 
or a row of a table is large, a batch of load may generate thousands of 
segments.


In the compaction logic, at least one version is selected for one compaction. 
Compaction is an external sorting that will open a `RowBlock` for each segment, 
with 1024 rows per RowBlock. So a RowBlock occupies a memory size of (1024 * 
row size).


Assuming that a Compaction has 1000 Segments and each row is 4K in size, 
RowBlock will take up 4G memory. When multiple Compactions are running at the 
same time, the system OOM may be caused.


## Solution


This proposal is to ensure that Compaction can run stably with less memory by 
estimating and limiting the amount of memory used by Compaction. This work is 
divided into the following three steps.


### Compaction ratio statistic


To estimate the amount of memory used by a Compaction, it is mainly to estimate 
the size of a row in memory. We can simply use the ratio of the size of a 
memtable in memory to the size of file it is written on disk as the compaction 
ratio. With this ratio, the size of the data file on the disk, and the number 
of rows in file, we can calculate the approximate occupancy of a single row of 
data in memory.


### Supported compaction within a version


Currently only Compaction with at least one version is supported. And if there 
are too many Segments in a single version, it still consumes a lot of memory. 
So we need to support compaction with a subset of segments with a single 
version. 


### Limiting Compaction memory usage


With the previous two steps, it has been possible to estimate and limit the 
memory usage of a single Compaction. Finally, we need an overall limit to 
ensure that the memory overhead can be within a reasonable range when multiple 
Compactions are running at the same time.


--
此致!Best Regards
陈明雨 Mingyu Chen

Email:
chenmin...@apache.org

Reply via email to