Doris??????????(Dataset Cache)????????????????Some ideas of Doris dataset cache??

???? Thu, 04 Aug 2022 19:26:56 -0700

Doris??????????(Dataset Cache)????????????????
????????????
1??????????fe????????be????????????
2??????????1??????10000????????????????????3??????????????????????????????????????????
3????????????????????????????????????????checkpoint????????????????????????????????????????????
4??????AP??????????????????????????????

??????????????
1????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
2????????????????????????????????????????????????????????????????????????Copy-on-Write??????????????
3????????????????????????????????????????????????
4????????????????????????????????????????????????????????????
5??JDBC??????Doris??jdbc????????????????????????????????????jdbc????????

??????????????
1????????????????????????????????????????????????????????????????????????????????????????????????????cdc??????????????????????????????????????????????????????????????
2????????????????????????????????????????????????????????????????????????????????????????????????????????????????

????????????????????????????????????????????Doris????????????????????????????????????????????????????????????????????????????????????????????kudu??MemRowSet????????????????????????????????????????????????

----------------The following is from Baidu translation------------------

Some ideas of Doris dataset cache:
a?? How to save batches
1. Where: Batches may be at Frontend or Backend.
2. Size: Batches size in 1 minute or 10000 pieces, and save batches of 3
datasets for temporary de duplication and column compression.
3. Data security: the Batches in memory can write to the hard disk cache in
real time, such as checkpoint. The hard disk cache data is used to deal with
downtime events and does not participate in the calculation.
4. Participate in AP: participate in data analysis and statistics.

b?? Advantages of saving batches
1. Real time data: since the new data are in memory, the data can participate
in the calculation when they arrive. From the generation to the visibility of
the data, the millisecond delay can be achieved. This is the ultimate goal of
the real-time data warehouse and data Lake in the market, volume volume volume.
2. Data De duplication and consolidation: it can realize data De duplication in
a short time window, reduce the pressure of de duplication, and better realize
data consolidation during copy on write.
3. Data compression: compress the columns after saving batches, and the
compression rate should be higher.
4. Historical data: after the historical data is involved in the calculation,
it is merged with the calculation results of real-time data.
5. JDBC: at present, the JDBC operation performance of Doris is low. If
properly designed, it can improve the performance of JDBC.

c?? Disadvantages of saving approval
1. Dirty data reading: because the real-time data cannot be duplicated with the
historical data after it arrives, there is a situation of dirty data reading.
However, for CDC data without data modification and deletion, dirty reading
does not exist, and some scene data can also tolerate a certain amount of data
deviation.
2. Storage and calculation separation: there is a contradiction with storage
and calculation separation. Solution: real time data can be calculated at the
storage node, historical data can be calculated at the calculation node, and
the final results can be consolidated.

Doris??????????(Dataset Cache)????????????????Some ideas of Doris dataset cache??

Reply via email to