[GitHub] [incubator-doris] Youngwb opened a new issue #3930: [Proposal] Doris support version column for REPLACE aggregate type

GitBox Tue, 23 Jun 2020 00:43:04 -0700


Youngwb opened a new issue #3930:
URL: https://github.com/apache/incubator-doris/issues/3930



   ### BackGround
   Doris currently use REPLACE to update data, but the replacement order cannot 
be guaranteed for the data import of the same batch. The user needs to 
guarantee that there is no same key column in the imported data of the same 
batch to guarantee the replacement order, which is very inconvenient for the 
user. To solve this problem, we can use a **version** column to specify the 
replacement order.
   
   ### Goal
   The user specifies a **version column** when creating the table. Doris 
relies on this column to update the data of REPLACE type. The larger version 
column data can REPLACE the data of the smaller version column, while the data 
of the smaller version column cannot REPLACE the larger version column data.
   
   ### Create Table Interface
   ```
   CREATE TABLE `test` (
   `id` bigint(20) NOT NULL,
   `date` date NOT NULL,
   `group_id` bigint(20) NOT NULL,
   `version` int MAX NOT NULL,
   `keyword` varchar(128) REPLACE NOT NULL,
   `clicks` bigint(20) SUM NULL DEFAULT "0" ,
   `cost` bigint(20) SUM NULL DEFAULT "0" 
   ) ENGINE=OLAP
   AGGREGATE KEY(`id`, `date`, `group_id`)
   DISTRIBUTED BY HASH(`id`) BUCKETS 16
   PROPERTIES (
     "replace_version_column" = "version"
   );
   ```
   When creating a table, the user simply adds the **replace_version_column** 
attribute in PROPERTIES to identify the version column, which requires a MAX 
aggregation type to ensure that only the largest version column is retained for 
the same key column.
   
   ### Query 
   When a user's query does not contain the REPLACE column, the original logic 
follows. When a user's query contains REPLACE columns, BE needs to extend the 
Version column on which the REPLACE column depends, and compare the value 
column when it is aggregated. These operations can be done by extending 
**Reader return columns**, and in FE，the **isPreAggregation** is OFF because of 
the REPLACE column is value column in StorageEngine
   ，which means the storage engine needs to aggregate the data before returning 
to scan node，so we can guarantee that the same key columns will be aggregated 
in Reader.
   
   
   ### Compaction
   Base and Cumulative Compaction use Reader to aggregate data, and it use all 
tablet columns as return columns, so similar to the query processing, we can 
use Reader for replace based on version columns.
   
   ### Load
   With the same batch of data load, Doris uses one or more **MemTable**.  We 
need to ensure that the same key column in one MemTable, columns of REPLACE 
type are replaced with version column, while the data in different MemTable is 
not guaranteed in LOAD because Query and Compaction guarantee the order of 
replacement.
   
   ### RollUp
   If rollup contains a column of REPLACE type, we need the user to add the 
Replace version column or extend the column automatically.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] Youngwb opened a new issue #3930: [Proposal] Doris support version column for REPLACE aggregate type

Reply via email to