Doris??????????(Dataset Cache)????????????????
????????????
1??????????fe????????be????????????
2??????????1??????10000????????????????????3??????????????????????????????????????????
3????????????????????????????????????????checkpoint????????????????????????????????????????????
4??????AP??????????????????????????????


??????????????
1????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
2????????????????????????????????????????????????????????????????????????Copy-on-Write??????????????
3????????????????????????????????????????????????
4????????????????????????????????????????????????????????????
5??JDBC??????Doris??jdbc????????????????????????????????????jdbc????????


??????????????
1????????????????????????????????????????????????????????????????????????????????????????????????????cdc??????????????????????????????????????????????????????????????
2????????????????????????????????????????????????????????????????????????????????????????????????????????????????


????????????????????????????????????????????Doris????????????????????????????????????????????????????????????????????????????????????????????kudu??MemRowSet????????????????????????????????????????????????


----------------The following is from Baidu translation------------------


Some ideas of Doris dataset cache:
a?? How to save batches
1. Where: Batches may be at Frontend or Backend.
2. Size: Batches size in 1 minute or 10000 pieces, and save batches of 3 
datasets for temporary de duplication and column compression.
3. Data security: the Batches in memory can write to the hard disk cache in 
real time, such as checkpoint. The hard disk cache data is used to deal with 
downtime events and does not participate in the calculation.
4. Participate in AP: participate in data analysis and statistics.


b?? Advantages of saving batches
1. Real time data: since the new data are in memory, the data can participate 
in the calculation when they arrive. From the generation to the visibility of 
the data, the millisecond delay can be achieved. This is the ultimate goal of 
the real-time data warehouse and data Lake in the market, volume volume volume.
2. Data De duplication and consolidation: it can realize data De duplication in 
a short time window, reduce the pressure of de duplication, and better realize 
data consolidation during copy on write.
3. Data compression: compress the columns after saving batches, and the 
compression rate should be higher.
4. Historical data: after the historical data is involved in the calculation, 
it is merged with the calculation results of real-time data.
5. JDBC: at present, the JDBC operation performance of Doris is low. If 
properly designed, it can improve the performance of JDBC.


c?? Disadvantages of saving approval
1. Dirty data reading: because the real-time data cannot be duplicated with the 
historical data after it arrives, there is a situation of dirty data reading. 
However, for CDC data without data modification and deletion, dirty reading 
does not exist, and some scene data can also tolerate a certain amount of data 
deviation.
2. Storage and calculation separation: there is a contradiction with storage 
and calculation separation. Solution: real time data can be calculated at the 
storage node, historical data can be calculated at the calculation node, and 
the final results can be consolidated.

Reply via email to