This is an automated email from the ASF dual-hosted git repository. zykkk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris.git
The following commit(s) were added to refs/heads/master by this push: new aa1704c50b5 [doc](data-model) update data-model doc (#24941) aa1704c50b5 is described below commit aa1704c50b5077b00aee6ac0652f2d5c09c76c05 Author: Guangdong Liu <liug...@gmail.com> AuthorDate: Mon Oct 9 10:08:16 2023 +0800 [doc](data-model) update data-model doc (#24941) --- docs/en/docs/data-table/data-model.md | 357 +++++++++++++++------------ docs/zh-CN/docs/data-table/data-model.md | 399 +++++++++++++++++-------------- 2 files changed, 430 insertions(+), 326 deletions(-) diff --git a/docs/en/docs/data-table/data-model.md b/docs/en/docs/data-table/data-model.md index 77c7c771606..4e6cae8c668 100644 --- a/docs/en/docs/data-table/data-model.md +++ b/docs/en/docs/data-table/data-model.md @@ -50,22 +50,24 @@ We illustrate what aggregation model is and how to use it correctly with practic Assume that the business has the following data table schema: -|ColumnName|Type|AggregationType|Comment| -|---|---|---|---| -| userid | LARGEINT | | user id| -| date | DATE | | date of data filling| -| City | VARCHAR (20) | | User City| -| age | SMALLINT | | User age| -| sex | TINYINT | | User gender| -| Last_visit_date | DATETIME | REPLACE | Last user access time| -| Cost | BIGINT | SUM | Total User Consumption| -| max dwell time | INT | MAX | Maximum user residence time| -| min dwell time | INT | MIN | User minimum residence time| - -The corresponding CREATE TABLE statement would be as follows (omitting the Partition and Distribution information): +| ColumnName | Type | AggregationType | Comment | +|-----------------|--------------|-----------------|-----------------------------| +| userid | LARGEINT | | user id | +| date | DATE | | date of data filling | +| City | VARCHAR (20) | | User City | +| age | SMALLINT | | User age | +| sex | TINYINT | | User gender | +| Last_visit_date | DATETIME | REPLACE | Last user access time | +| Cost | BIGINT | SUM | Total User Consumption | +| max dwell time | INT | MAX | Maximum user residence time | +| min dwell time | INT | MIN | User minimum residence time | + +The corresponding to CREATE TABLE statement would be as follows (omitting the Partition and Distribution information): ``` -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE DATABASE IF NOT EXISTS example_db; + +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg1 ( `user_id` LARGEINT NOT NULL COMMENT "user id", `date` DATE NOT NULL COMMENT "data import time", @@ -102,21 +104,21 @@ If these aggregation methods cannot meet the requirements, you can choose to use Suppose that you have the following import data (raw data): -|user\_id|date|city|age|sex|last\_visit\_date|cost|max\_dwell\_time|min\_dwell\_time| -|---|---|---|---|---|---|---|---|---| -| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10| -| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2| -| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22| -| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5| -| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11| -| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3| -| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6| +| user\_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|----------|------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | And you can import data with the following sql: ```sql -insert into example_db.example_tbl_agg values +insert into example_db.example_tbl_agg1 values (10000,"2017-10-01","Beijing",20,0,"2017-10-01 06:00:00",20,10,10), (10000,"2017-10-01","Beijing",20,0,"2017-10-01 07:00:00",15,2,2), (10001,"2017-10-01","Beijing",30,1,"2017-10-01 17:05:45",2,22,22), @@ -128,28 +130,28 @@ insert into example_db.example_tbl_agg values Assume that this is a table recording the user behaviors when visiting a certain commodity page. The first row of data, for example, is explained as follows: -| Data | Description| -|---|---| -| 10000 | User id, each user uniquely identifies id| -| 2017-10-01 | Data storage time, accurate to date| -| Beijing | User City| -| 20 | User Age| -| 0 | Gender male (1 for female)| -| 2017-10-01 06:00 | User's time to visit this page, accurate to seconds| -| 20 | Consumption generated by the user's current visit| -| 10 | User's visit, time to stay on the page| -| 10 | User's current visit, time spent on the page (redundancy)| +| Data | Description | +|------------------|-----------------------------------------------------------| +| 10000 | User id, each user uniquely identifies id | +| 2017-10-01 | Data storage time, accurate to date | +| Beijing | User City | +| 20 | User Age | +| 0 | Gender male (1 for female) | +| 2017-10-01 06:00 | User's time to visit this page, accurate to seconds | +| 20 | Consumption generated by the user's current visit | +| 10 | User's visit, time to stay on the page | +| 10 | User's current visit, time spent on the page (redundancy) | After this batch of data is imported into Doris correctly, it will be stored in Doris as follows: -|user\_id|date|city|age|sex|last\_visit\_date|cost|max\_dwell\_time|min\_dwell\_time| -|---|---|---|---|---|---|---|---|---| -| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2| -| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22| -| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5| -| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11| -| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3| -| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6| +| user\_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|----------|------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | As you can see, the data of User 10000 have been aggregated to one row, while those of other users remain the same. The explanation for the aggregated data of User 10000 is as follows (the first 5 columns remain unchanged, so it starts with Column 6 `last_visit_date`): @@ -169,37 +171,58 @@ After aggregation, Doris only stores the aggregated data. In other words, the de Here is a modified version of the table schema in Example 1: -| ColumnName | Type | AggregationType | Comment | -| --------------- | ------------ | --------------- | ------------------------------------------------------------ | -| user_id | LARGEINT | | User ID | -| date | DATE | | Date when the data are imported | -| timestamp | DATETIME | | Date and time when the data are imported (with second-level accuracy) | -| city | VARCHAR (20) | | User location city | -| age | SMALLINT | | User age | -| sex | TINYINT | | User gender | -| last visit date | DATETIME | REPLACE | Last visit time of the user | -| cost | BIGINT | SUM | Total consumption of the user | -| max_dwell_time | INT | MAX | Maximum user dwell time | -| min_dwell_time | INT | MIN | Minimum user dwell time | +| ColumnName | Type | AggregationType | Comment | +|-----------------|--------------|-----------------|-----------------------------------------------------------------------| +| user_id | LARGEINT | | User ID | +| date | DATE | | Date when the data are imported | +| timestamp | DATETIME | | Date and time when the data are imported (with second-level accuracy) | +| city | VARCHAR (20) | | User location city | +| age | SMALLINT | | User age | +| sex | TINYINT | | User gender | +| last visit date | DATETIME | REPLACE | Last visit time of the user | +| cost | BIGINT | SUM | Total consumption of the user | +| max_dwell_time | INT | MAX | Maximum user dwell time | +| min_dwell_time | INT | MIN | Minimum user dwell time | A new column `timestamp` has been added to record the date and time when the data are imported (with second-level accuracy). +```sql +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg2 +( + `user_id` LARGEINT NOT NULL COMMENT "用户id", + `date` DATE NOT NULL COMMENT "数据灌入日期时间", + `timestamp` DATETIME NOT NULL COMMENT "数据灌入日期时间戳", + `city` VARCHAR(20) COMMENT "用户所在城市", + `age` SMALLINT COMMENT "用户年龄", + `sex` TINYINT COMMENT "用户性别", + `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", + `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", + `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", + `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" +) +AGGREGATE KEY(`user_id`, `date`, `timestamp` ,`city`, `age`, `sex`) +DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 1" +); +``` + Suppose that the import data are as follows: -| user_id | date | timestamp | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | -| ------- | ---------- | ------------------- | --------- | ---- | ---- | ------------------- | ---- | ---------------- | ---------------- | -| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 | -| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 | -| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | timestamp | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|---------|------------|---------------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | And you can import data with the following sql: ```sql -insert into example_db.example_tbl values +insert into example_db.example_tbl_agg2 values (10000,"2017-10-01","2017-10-01 08:00:05","Beijing",20,0,"2017-10-01 06:00:00",20,10,10), (10000,"2017-10-01","2017-10-01 09:00:05","Beijing",20,0,"2017-10-01 07:00:00",15,2,2), (10001,"2017-10-01","2017-10-01 18:12:10","Beijing",30,1,"2017-10-01 17:05:45",2,22,22), @@ -211,15 +234,15 @@ insert into example_db.example_tbl values After importing, this batch of data will be stored in Doris as follows: -| user_id | date | timestamp | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | -| ------- | ---------- | ------------------- | --------- | ---- | ---- | ------------------- | ---- | ---------------- | ---------------- | -| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 | -| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 | -| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | timestamp | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|---------|------------|---------------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | Beijing | 20 | 0 | 2017-10-01 06:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | Beijing | 20 | 0 | 2017-10-01 07:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | As you can see, the stored data are exactly the same as the import data. No aggregation has ever happened. This is because, the newly added `timestamp` column results in **difference of Keys** among the rows. That is to say, as long as the Keys of the rows are not identical in the import data, Doris can save the complete detailed data even in the Aggregate Model. @@ -227,41 +250,41 @@ As you can see, the stored data are exactly the same as the import data. No aggr Based on Example 1, suppose that you have the following data stored in Doris: -| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | -| ------- | ---------- | --------- | ---- | ---- | ------------------- | ---- | ---------------- | ---------------- | -| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 | -| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|---------|------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | Now you need to import a new batch of data: -| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | -| ------- | ---------- | -------- | ---- | ---- | ------------------- | ---- | ---------------- | ---------------- | -| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 | -| 10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | +| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|---------|------------|----------|-----|-----|---------------------|------|------------------|------------------| +| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 | +| 10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | And you can import data with the following sql: ```sql -insert into example_db.example_tbl_agg values +insert into example_db.example_tbl_agg1 values (10004,"2017-10-03","Shenzhen",35,0,"2017-10-03 11:22:00",44,19,19), (10005,"2017-10-03","Changsha",29,1,"2017-10-03 18:11:02",3,1,1); ``` After importing, the data stored in Doris will be updated as follows: -| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | -| ------- | ---------- | --------- | ---- | ---- | ------------------- | ---- | ---------------- | ---------------- | -| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 | -| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 | -| 10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | +| user_id | date | city | age | sex | last\_visit\_date | cost | max\_dwell\_time | min\_dwell\_time | +|---------|------------|-----------|-----|-----|---------------------|------|------------------|------------------| +| 10000 | 2017-10-01 | Beijing | 20 | 0 | 2017-10-01 07:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | Beijing | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | Shanghai | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | Guangzhou | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | Shenzhen | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | Shenzhen | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 | +| 10005 | 2017-10-03 | Changsha | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | As you can see, the existing data and the newly imported data of User 10004 have been aggregated. Meanwhile, the new data of User 10005 have been added. @@ -278,6 +301,7 @@ At different stages, data will be aggregated to varying degrees. For example, wh AGG_STATE cannot be used as a key column, and when creating a table, you need to declare the signature of the aggregation function. Users do not need to specify a length or default value. The actual storage size of the data depends on the function implementation. CREATE TABLE + ```sql set enable_agg_state=true; create table aggstate( @@ -291,11 +315,10 @@ properties("replication_num" = "1"); ``` -"agg_state" is used to declare the data type as "agg_state," and "max_by/group_concat" are the signatures of aggregation functions. +"agg_state" is used to declare the data type as "agg_state," and "sum/group_concat" are the signatures of aggregation functions. Please note that "agg_state" is a data type, similar to "int," "array," or "string." - "agg_state" can only be used in conjunction with the [state](../sql-manual/sql-functions/combinators/state.md)/[merge](../sql-manual/sql-functions/combinators/merge.md)/[union](../sql-manual/sql-functions/combinators/union.md) function combinators. "agg_state" represents an intermediate result of an aggregation function. For example, with the aggregation function "sum," "agg_state" can represent the intermediate state of summing values like sum(1, 2, 3, 4, 5), rather than the final result. @@ -309,20 +332,23 @@ insert into aggstate values(1,sum_state(3),group_concat_state('c')); ``` At this point, the table contains only one row. Please note that the table below is for illustrative purposes and cannot be selected/displayed directly: -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | + +| k1 | k2 | k3 | +|----|------------|---------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | Insert another record. ```sql insert into aggstate values(2,sum_state(4),group_concat_state('d')); ``` + The table's structure at this moment is... -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | -| 2 | sum(4) | group_concat_state(d) | + +| k1 | k2 | k3 | +|----|------------|---------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | +| 2 | sum(4) | group_concat_state(d) | We can use the 'merge' operation to combine multiple states and return the final result calculated by the aggregation function. @@ -334,9 +360,10 @@ mysql> select sum_merge(k2) from aggstate; | 10 | +---------------+ ``` -`sum_merge` will first combine sum(1,2,3) and sum(4) into sum(1,2,3,4), and return the calculated result. +`sum_merge` will first combine sum(1,2,3) and sum(4) into sum(1,2,3,4), and return the calculated result. Because `group_concat` has a specific order requirement, the result is not stable. + ``` mysql> select group_concat_merge(k3) from aggstate; +------------------------+ @@ -347,17 +374,21 @@ mysql> select group_concat_merge(k3) from aggstate; ``` If you do not want the final aggregation result, you can use 'union' to combine multiple intermediate aggregation results and generate a new intermediate result. + ```sql insert into aggstate select 3,sum_union(k2),group_concat_union(k3) from aggstate ; ``` + The table's structure at this moment is... -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | -| 2 | sum(4) | group_concat_state(d) | -| 3 | sum(1,2,3,4) | group_concat_state(a,b,c,d) | + +| k1 | k2 | k3 | +|----|--------------|-----------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | +| 2 | sum(4) | group_concat_state(d) | +| 3 | sum(1,2,3,4) | group_concat_state(a,b,c,d) | You can achieve this through a query. + ``` mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate; +---------------+------------------------+ @@ -373,18 +404,26 @@ mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate where k1 != 2; | 16 | c,b,a,d,c,b,a | +---------------+------------------------+ ``` + Users can perform more detailed aggregation function operations using `agg_state`. Please note that `agg_state` comes with a certain performance overhead. ## Unique Model -In some multi-dimensional analysis scenarios, users are highly concerned about how to ensure the uniqueness of the Key, that is, how to create uniqueness constraints for the Primary Key. Therefore, we introduce the Unique Model. Prior to Doris 1.2, the Unique Model was essentially a special case of the Aggregate Model and a simplified representation of table schema. The Aggregate Model is implemented by Merge on Read, so it might not deliver high performance in some aggregation queries ( [...] +In some multidimensional analysis scenarios, users are highly concerned about how to ensure the uniqueness of the Key, +that is, how to create uniqueness constraints for the Primary Key. Therefore, we introduce the Unique Model. Prior to Doris 1.2, +the Unique Model was essentially a special case of the Aggregate Model and a simplified representation of table schema. +The Aggregate Model is implemented by Merge on Read, so it might not deliver high performance in some aggregation queries +(see the [Limitations of Aggregate Model](#limitations-of-aggregate-model) section). In Doris 1.2, +we have introduced a new implementation for the Unique Model--Merge on Write, which can help achieve optimal query performance. +For now, Merge on Read and Merge on Write will coexist in the Unique Model for a while, but in the future, +we plan to make Merge on Write the default implementation of the Unique Model. The following will illustrate the two implementations with examples. ### Merge on Read ( Same Implementation as Aggregate Model) | ColumnName | Type | IsKey | Comment | -| ------------- | ------------- | ----- | ---------------------- | +|---------------|---------------|-------|------------------------| | user_id | BIGINT | Yes | User ID | | username | VARCHAR (50) | Yes | Username | | city | VARCHAR (20) | No | User location city | @@ -397,7 +436,7 @@ In some multi-dimensional analysis scenarios, users are highly concerned about h This is a typical user basic information table. There is no aggregation requirement for such data. The only concern is to ensure the uniqueness of the primary key. (The primary key here is user_id + username). The CREATE TABLE statement for the above table is as follows: ``` -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_unique ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `username` VARCHAR (50) NOT NULL COMMENT "Username", @@ -418,7 +457,7 @@ PROPERTIES ( This is the same table schema and the CREATE TABLE statement as those of the Aggregate Model: | ColumnName | Type | AggregationType | Comment | -| ------------- | ------------- | --------------- | ---------------------- | +|---------------|---------------|-----------------|------------------------| | user_id | BIGINT | | User ID | | username | VARCHAR (50) | | Username | | city | VARCHAR (20) | REPLACE | User location city | @@ -429,7 +468,7 @@ This is the same table schema and the CREATE TABLE statement as those of the Agg | register_time | DATETIME | REPLACE | User registration time | ``` -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg3 ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `username` VARCHAR (50) NOT NULL COMMENT "Username", @@ -464,10 +503,10 @@ In Doris 1.2.0, as a new feature, Merge on Write is disabled by default, and use > 1. It is recommended to use version 1.2.4 or above, as this version has > fixed some bugs and stability issues. > 2. Add the configuration item "disable_storage_page_cache=false" to the > be.conf file. Failure to add this configuration item may have a significant > impact on data load performance. -Take the previous table as an example, the corresponding CREATE TABLE statement should be: +Take the previous table as an example, the corresponding to CREATE TABLE statement should be: ``` -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_unique_merge_on_write ( `user_id` LARGEINT NOT NULL COMMENT "User ID", `username` VARCHAR (50) NOT NULL COMMENT "Username", @@ -490,7 +529,7 @@ The table schema produced by the above statement will be different from that of | ColumnName | Type | AggregationType | Comment | -| ------------- | ------------- | --------------- | ---------------------- | +|---------------|---------------|-----------------|------------------------| | user_id | BIGINT | | User ID | | username | VARCHAR (50) | | Username | | city | VARCHAR (20) | NONE | User location city | @@ -512,10 +551,10 @@ On a Unique table with the Merge on Write option enabled, during the import stag ## Duplicate Model -In some multi-dimensional analysis scenarios, there is no need for primary keys or data aggregation. For these cases, we introduce the Duplicate Model to. Here is an example: +In some multidimensional analysis scenarios, there is no need for primary keys or data aggregation. For these cases, we introduce the Duplicate Model to. Here is an example: | ColumnName | Type | SortKey | Comment | -| ---------- | -------------- | ------- | -------------- | +|------------|----------------|---------|----------------| | timstamp | DATETIME | Yes | Log time | | type | INT | Yes | Log type | | error_code | INT | Yes | Error code | @@ -523,10 +562,10 @@ In some multi-dimensional analysis scenarios, there is no need for primary keys | op_id | BIGINT | No | Operator ID | | op_time | DATETIME | No | Operation time | -The corresponding CREATE TABLE statement is as follows: +The corresponding to CREATE TABLE statement is as follows: ``` -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_duplicate ( `timestamp` DATETIME NOT NULL COMMENT "Log time", `type` INT NOT NULL COMMENT "Log type", @@ -542,9 +581,13 @@ PROPERTIES ( ); ``` -Different from the Aggregate and Unique Models, the Duplicate Model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. The DUPLICATE KEY in the CREATE TABLE statement is only used to specify based on which columns the data are sorted. (A more appropriate name than DUPLICATE KEY would be SORTING COLUMN, but it is named as such to specify the data model used. For more information, see [Prefix Index](https://dor [...] +Different from the Aggregate and Unique Models, the Duplicate Model stores the data as they are and executes no aggregation. Even if there are two identical rows of data, they will both be retained. +The DUPLICATE KEY in the CREATE TABLE statement is only used to specify based on which columns the data are sorted. +(A more appropriate name than DUPLICATE KEY would be SORTING COLUMN, but it is named as such to specify the data model used. For more information, +see [Prefix Index](./index/index-overview.md).) For the choice of DUPLICATE KEY, we recommend the first 2-4 columns. -The Duplicate Model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints. For more usage scenarios, see the [Limitations of Aggregate Model](#Limitations of Aggregate Model) section. +The Duplicate Model is suitable for storing raw data without aggregation requirements or primary key uniqueness constraints. +For more usage scenarios, see the [Limitations of Aggregate Model](#limitations-of-aggregate-model) section. ### Duplicate Model Without SORTING COLUMN (Since Doris 2.0) @@ -558,10 +601,10 @@ When users do not need SORTING COLUMN or Prefix Index, they can configure the fo Then, when creating the default model, the sorting column will no longer be specified, and no prefix index will be created for the table to reduce additional overhead in importing and storing. -The corresponding CREATE TABLE statement is as follows: +The corresponding to CREATE TABLE statement is as follows: ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_duplicate_without_keys_by_default ( `timestamp` DATETIME NOT NULL COMMENT "日志时间", `type` INT NOT NULL COMMENT "日志类型", @@ -576,7 +619,7 @@ PROPERTIES ( "enable_duplicate_without_keys_by_default" = "true" ); -MySQL > desc example_tbl; +MySQL > desc example_tbl_duplicate_without_keys_by_default; +------------+---------------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+---------------+------+-------+---------+-------+ @@ -599,7 +642,7 @@ The Aggregate Model only presents the aggregated data. That means we have to ens Suppose that you have the following table schema: | ColumnName | Type | AggregationType | Comment | -| ---------- | -------- | --------------- | ------------------------------- | +|------------|----------|-----------------|---------------------------------| | user\_id | LARGEINT | | User ID | | date | DATE | | Date when the data are imported | | cost | BIGINT | SUM | Total user consumption | @@ -609,14 +652,14 @@ Assume that there are two batches of data that have been imported into the stora **batch 1** | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 50 | | 10002 | 2017-11-21 | 39 | **batch 2** | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 1 | | 10001 | 2017-11-21 | 5 | | 10003 | 2017-11-22 | 22 | @@ -624,7 +667,7 @@ Assume that there are two batches of data that have been imported into the stora As you can see, data about User 10001 in these two import batches have not yet been aggregated. However, in order to ensure that users can only query the aggregated data as follows: | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 51 | | 10001 | 2017-11-21 | 5 | | 10002 | 2017-11-21 | 39 | @@ -651,14 +694,14 @@ For the previous example: **batch 1** | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 50 | | 10002 | 2017-11-21 | 39 | **batch 2** | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 1 | | 10001 | 2017-11-21 | 5 | | 10003 | 2017-11-22 | 22 | @@ -666,37 +709,47 @@ For the previous example: Since the final aggregation result is: | user\_id | date | cost | -| -------- | ---------- | ---- | +|----------|------------|------| | 10001 | 2017-11-20 | 51 | | 10001 | 2017-11-21 | 5 | | 10002 | 2017-11-21 | 39 | | 10003 | 2017-11-22 | 22 | -The correct result of `select count (*) from table;` should be **4**. But if the model only scans the `user_id` column and operates aggregation upon query, the final result will be **3** (10001, 10002, 10003). And if it does not operate aggregation, the final result will be **5** (a total of five rows in two batches). Apparently, both results are wrong. +The correct result of `select count (*) from table;` should be **4**. But if the model only scans the `user_id` +column and operates aggregation upon query, the final result will be **3** (10001, 10002, 10003). +And if it does not operate aggregation, the final result will be **5** (a total of five rows in two batches). Apparently, both results are wrong. -In order to get the correct result, we must read both the `user_id` and `date` column, and **performs aggregation** when querying. That is to say, in the `count (*)` query, Doris must scan all AGGREGATE KEY columns (in this case, `user_id` and `date`) and aggregate them to get the semantically correct results. That means if there are many aggregated columns, `count (*)` queries could involve scanning large amounts of data. +In order to get the correct result, we must read both the `user_id` and `date` column, and **performs aggregation** when querying. +That is to say, in the `count (*)` query, Doris must scan all AGGREGATE KEY columns (in this case, `user_id` and `date`) +and aggregate them to get the semantically correct results. That means if there are many aggregated columns, `count (*)` queries could involve scanning large amounts of data. -Therefore, if you need to perform frequent `count (*)` queries, we recommend that you simulate `count (*)` by adding a column of value 1 and aggregation type SUM. In this way, the table schema in the previous example will be modified as follows: +Therefore, if you need to perform frequent `count (*)` queries, we recommend that you simulate `count (*)` by adding a +column of value 1 and aggregation type SUM. In this way, the table schema in the previous example will be modified as follows: | ColumnName | Type | AggregationType | Comment | -| ---------- | ------ | --------------- | ------------------------------- | +|------------|--------|-----------------|---------------------------------| | user ID | BIGINT | | User ID | | date | DATE | | Date when the data are imported | | Cost | BIGINT | SUM | Total user consumption | | count | BIGINT | SUM | For count queries | -The above adds a count column, the value of which will always be **1**, so the result of `select count (*) from table;`is equivalent to that of `select sum (count) from table;` The latter is much more efficient than the former. However, this method has its shortcomings, too. That is, it requires that users will not import rows with the same values in the AGGREGATE KEY columns. Otherwise, `select sum (count) from table;` can only express the number of rows of the originally imported data [...] +The above adds a count column, the value of which will always be **1**, so the result of `select count (*) from table;` +is equivalent to that of `select sum (count) from table;` The latter is much more efficient than the former. However, +this method has its shortcomings, too. That is, it requires that users will not import rows with the same values in the +AGGREGATE KEY columns. Otherwise, `select sum (count) from table;` can only express the number of rows of the originally imported data, instead of the semantics of `select count (*) from table;` -Another method is to add a `cound` column of value 1 but aggregation type of REPLACE. Then `select sum (count) from table;` and `select count (*) from table;` could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data. +Another method is to add a `cound` column of value 1 but aggregation type of REPLACE. Then `select sum (count) from table;` +and `select count (*) from table;` could produce the same results. Moreover, this method does not require the absence of same AGGREGATE KEY columns in the import data. ### Merge on Write of Unique Model -The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. In Merge on Write, the model adds a `delete bitmap` for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows: +The Merge on Write implementation in the Unique Model does not impose the same limitation as the Aggregate Model. +In Merge on Write, the model adds a `delete bitmap` for each imported rowset to mark the data being overwritten or deleted. With the previous example, after Batch 1 is imported, the data status will be as follows: **batch 1** | user_id | date | cost | delete bit | -| ------- | ---------- | ---- | ---------- | +|---------|------------|------|------------| | 10001 | 2017-11-20 | 50 | false | | 10002 | 2017-11-21 | 39 | false | @@ -705,36 +758,40 @@ After Batch 2 is imported, the duplicate rows in the first batch will be marked **batch 1** | user_id | date | cost | delete bit | -| ------- | ---------- | ---- | ---------- | +|---------|------------|------|------------| | 10001 | 2017-11-20 | 50 | **true** | | 10002 | 2017-11-21 | 39 | false | **batch 2** | user\_id | date | cost | delete bit | -| -------- | ---------- | ---- | ---------- | +|----------|------------|------|------------| | 10001 | 2017-11-20 | 1 | false | | 10001 | 2017-11-21 | 5 | false | | 10003 | 2017-11-22 | 22 | false | -In queries, all data marked `true` in the `delete bitmap` will not be read, so there is no need for data aggregation. Since there are 4 valid rows in the above data, the query result should also be 4. This also enables minimum overhead since it only scans one column of data. +In queries, all data marked `true` in the `delete bitmap` will not be read, so there is no need for data aggregation. +Since there are 4 valid rows in the above data, the query result should also be 4. This also enables minimum overhead since it only scans one column of data. In the test environment, `count(*)` queries in Merge on Write of the Unique Model deliver 10 times higher performance than that of the Aggregate Model. ### Duplicate Model -The Duplicate Model does not impose the same limitation as the Aggregate Model because it does not involve aggregation semantics. For any columns, it can return the semantically correct results in `count (*)` queries. +The Duplicate Model does not impose the same limitation as the Aggregate Model because it does not involve aggregation semantics. +For any columns, it can return the semantically correct results in `count (*)` queries. ## Key Columns -For the Duplicate, Aggregate, and Unique Models, the Key columns will be specified when the table is created, but there exist some differences: In the Duplicate Model, the Key columns of the table can be regarded as just "sorting columns", but not unique identifiers. In Aggregate and Unique Models, the Key columns are both "sorting columns" and "unique identifier columns". +For the Duplicate, Aggregate, and Unique Models, the Key columns will be specified when the table is created, +but there exist some differences: In the Duplicate Model, the Key columns of the table can be regarded as just "sorting columns", +but not unique identifiers. In Aggregate and Unique Models, the Key columns are both "sorting columns" and "unique identifier columns". ## Suggestions for Choosing Data Model Since the data model was established when the table was built, and **irrevocable thereafter, it is very important to select the appropriate data model**. 1. The Aggregate Model can greatly reduce the amount of data scanned and query computation by pre-aggregation. Thus, it is very suitable for report query scenarios with fixed patterns. But this model is unfriendly to `count (*)` queries. Meanwhile, since the aggregation method on the Value column is fixed, semantic correctness should be considered in other types of aggregation queries. -2. The Unique Model guarantees the uniqueness of primary key for scenarios requiring unique primary key. The downside is that it cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries. - 1. Users who have high performance requirements for aggregate queries are recommended to use the newly added Merge on Write implementation since version 1.2. +2. The Unique Model guarantees the uniqueness of primary key for scenarios requiring a unique primary key. The downside is that it cannot exploit the advantage brought by pre-aggregation such as ROLLUP in queries. + 1. Users who have high-performance requirements for aggregate queries are recommended to use the newly added Merge on Write implementation since version 1.2. 2. The Unique Model only supports entire-row updates. If you require primary key uniqueness as well as partial updates of certain columns (such as loading multiple source tables into one Doris table), you can consider using the Aggregate Model, while setting the aggregate type of the non-primary key columns to REPLACE_IF_NOT_NULL. See [CREATE TABLE Manual](../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md) for more details. 3. The Duplicate Model is suitable for ad-hoc queries of any dimensions. Although it may not be able to take advantage of the pre-aggregation feature, it is not limited by what constrains the Aggregate Model and can give full play to the advantage of columnar storage (reading only the relevant columns, but not all Key columns). diff --git a/docs/zh-CN/docs/data-table/data-model.md b/docs/zh-CN/docs/data-table/data-model.md index b0b48539d02..e3bff1d29ee 100644 --- a/docs/zh-CN/docs/data-table/data-model.md +++ b/docs/zh-CN/docs/data-table/data-model.md @@ -51,22 +51,24 @@ Doris 的数据模型主要分为3类: 假设业务有如下数据表模式: -| ColumnName | Type | AggregationType | Comment | -| --------------- | ----------- | --------------- | -------------------- | -| user_id | LARGEINT | | 用户id | -| date | DATE | | 数据灌入日期 | -| city | VARCHAR(20) | | 用户所在城市 | -| age | SMALLINT | | 用户年龄 | -| sex | TINYINT | | 用户性别 | +| ColumnName | Type | AggregationType | Comment | +|-----------------|-------------|-----------------|------------| +| user_id | LARGEINT | | 用户id | +| date | DATE | | 数据灌入日期 | +| city | VARCHAR(20) | | 用户所在城市 | +| age | SMALLINT | | 用户年龄 | +| sex | TINYINT | | 用户性别 | | last_visit_date | DATETIME | REPLACE | 用户最后一次访问时间 | -| cost | BIGINT | SUM | 用户总消费 | -| max_dwell_time | INT | MAX | 用户最大停留时间 | -| min_dwell_time | INT | MIN | 用户最小停留时间 | +| cost | BIGINT | SUM | 用户总消费 | +| max_dwell_time | INT | MAX | 用户最大停留时间 | +| min_dwell_time | INT | MIN | 用户最小停留时间 | 如果转换成建表语句则如下(省略建表语句中的 Partition 和 Distribution 信息) ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE DATABASE IF NOT EXISTS example_db; + +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg1 ( `user_id` LARGEINT NOT NULL COMMENT "用户id", `date` DATE NOT NULL COMMENT "数据灌入日期时间", @@ -105,19 +107,20 @@ PROPERTIES ( 假设我们有以下导入数据(原始数据): -| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | -| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | -| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | 通过sql导入数据: + ```sql -insert into example_db.example_tbl values +insert into example_db.example_tbl_agg1 values (10000,"2017-10-01","北京",20,0,"2017-10-01 06:00:00",20,10,10), (10000,"2017-10-01","北京",20,0,"2017-10-01 07:00:00",15,2,2), (10001,"2017-10-01","北京",30,1,"2017-10-01 17:05:45",2,22,22), @@ -129,28 +132,28 @@ insert into example_db.example_tbl values 我们假设这是一张记录用户访问某商品页面行为的表。我们以第一行数据为例,解释如下: -| 数据 | 说明 | -| ------------------- | -------------------------------------- | -| 10000 | 用户id,每个用户唯一识别id | -| 2017-10-01 | 数据入库时间,精确到日期 | -| 北京 | 用户所在城市 | -| 20 | 用户年龄 | -| 0 | 性别男(1 代表女性) | -| 2017-10-01 06:00:00 | 用户本次访问该页面的时间,精确到秒 | -| 20 | 用户本次访问产生的消费 | -| 10 | 用户本次访问,驻留该页面的时间 | +| 数据 | 说明 | +|---------------------|---------------------| +| 10000 | 用户id,每个用户唯一识别id | +| 2017-10-01 | 数据入库时间,精确到日期 | +| 北京 | 用户所在城市 | +| 20 | 用户年龄 | +| 0 | 性别男(1 代表女性) | +| 2017-10-01 06:00:00 | 用户本次访问该页面的时间,精确到秒 | +| 20 | 用户本次访问产生的消费 | +| 10 | 用户本次访问,驻留该页面的时间 | | 10 | 用户本次访问,驻留该页面的时间(冗余) | 那么当这批数据正确导入到 Doris 中后,Doris 中最终存储如下: -| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | -| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | 可以看到,用户 10000 只剩下了一行**聚合后**的数据。而其余用户的数据和原始数据保持一致。这里先解释下用户 10000 聚合后的数据: @@ -172,37 +175,59 @@ insert into example_db.example_tbl values 接示例1,我们将表结构修改如下: -| ColumnName | Type | AggregationType | Comment | -| --------------- | ----------- | --------------- | ---------------------- | -| user_id | LARGEINT | | 用户id | -| date | DATE | | 数据灌入日期 | +| ColumnName | Type | AggregationType | Comment | +|-----------------|-------------|-----------------|-------------| +| user_id | LARGEINT | | 用户id | +| date | DATE | | 数据灌入日期 | | timestamp | DATETIME | | 数据灌入时间,精确到秒 | -| city | VARCHAR(20) | | 用户所在城市 | -| age | SMALLINT | | 用户年龄 | -| sex | TINYINT | | 用户性别 | -| last_visit_date | DATETIME | REPLACE | 用户最后一次访问时间 | -| cost | BIGINT | SUM | 用户总消费 | -| max_dwell_time | INT | MAX | 用户最大停留时间 | -| min_dwell_time | INT | MIN | 用户最小停留时间 | +| city | VARCHAR(20) | | 用户所在城市 | +| age | SMALLINT | | 用户年龄 | +| sex | TINYINT | | 用户性别 | +| last_visit_date | DATETIME | REPLACE | 用户最后一次访问时间 | +| cost | BIGINT | SUM | 用户总消费 | +| max_dwell_time | INT | MAX | 用户最大停留时间 | +| min_dwell_time | INT | MIN | 用户最小停留时间 | 即增加了一列 `timestamp`,记录精确到秒的数据灌入时间。 同时,将`AGGREGATE KEY`设置为`AGGREGATE KEY(user_id, date, timestamp, city, age, sex)` +```sql +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg2 +( + `user_id` LARGEINT NOT NULL COMMENT "用户id", + `date` DATE NOT NULL COMMENT "数据灌入日期时间", + `timestamp` DATETIME NOT NULL COMMENT "数据灌入日期时间戳", + `city` VARCHAR(20) COMMENT "用户所在城市", + `age` SMALLINT COMMENT "用户年龄", + `sex` TINYINT COMMENT "用户性别", + `last_visit_date` DATETIME REPLACE DEFAULT "1970-01-01 00:00:00" COMMENT "用户最后一次访问时间", + `cost` BIGINT SUM DEFAULT "0" COMMENT "用户总消费", + `max_dwell_time` INT MAX DEFAULT "0" COMMENT "用户最大停留时间", + `min_dwell_time` INT MIN DEFAULT "99999" COMMENT "用户最小停留时间" +) +AGGREGATE KEY(`user_id`, `date`, `timestamp` ,`city`, `age`, `sex`) +DISTRIBUTED BY HASH(`user_id`) BUCKETS 1 +PROPERTIES ( +"replication_allocation" = "tag.location.default: 1" +); +``` + 导入数据如下: -| user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ------------------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | -| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | -| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|---------------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | 通过sql导入数据: + ```sql -insert into example_db.example_tbl values +insert into example_db.example_tbl_agg2 values (10000,"2017-10-01","2017-10-01 08:00:05","北京",20,0,"2017-10-01 06:00:00",20,10,10), (10000,"2017-10-01","2017-10-01 09:00:05","北京",20,0,"2017-10-01 07:00:00",15,2,2), (10001,"2017-10-01","2017-10-01 18:12:10","北京",30,1,"2017-10-01 17:05:45",2,22,22), @@ -214,15 +239,15 @@ insert into example_db.example_tbl values 那么当这批数据正确导入到 Doris 中后,Doris 中最终存储如下: -| user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ------------------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | -| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | -| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | timestamp | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|---------------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 2017-10-01 08:00:05 | 北京 | 20 | 0 | 2017-10-01 06:00:00 | 20 | 10 | 10 | +| 10000 | 2017-10-01 | 2017-10-01 09:00:05 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 15 | 2 | 2 | +| 10001 | 2017-10-01 | 2017-10-01 18:12:10 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 2017-10-02 13:10:00 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 2017-10-02 13:15:00 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 2017-10-01 12:12:48 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 2017-10-03 12:38:20 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | 我们可以看到,存储的数据,和导入数据完全一样,没有发生任何聚合。这是因为,这批数据中,因为加入了 `timestamp` 列,所有行的 Key 都**不完全相同**。也就是说,只要保证导入的数据中,每一行的 Key 都不完全相同,那么即使在聚合模型下,Doris 也可以保存完整的明细数据。 @@ -230,40 +255,41 @@ insert into example_db.example_tbl values 接示例1。假设现在表中已有数据如下: -| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | -| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | +| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 10:20:22 | 11 | 6 | 6 | 我们再导入一批新的数据: -| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 | -| 10005 | 2017-10-03 | 长沙 | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | +| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 11:22:00 | 44 | 19 | 19 | +| 10005 | 2017-10-03 | 长沙 | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | 通过sql导入数据: + ```sql -insert into example_db.example_tbl values +insert into example_db.example_tbl_agg1 values (10004,"2017-10-03","深圳",35,0,"2017-10-03 11:22:00",44,19,19), (10005,"2017-10-03","长沙",29,1,"2017-10-03 18:11:02",3,1,1); ``` 那么当这批数据正确导入到 Doris 中后,Doris 中最终存储如下: -| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | -| ------- | ---------- | ---- | ---- | ---- | ------------------- | ---- | -------------- | -------------- | -| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | -| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | -| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | -| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | -| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | -| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 | -| 10005 | 2017-10-03 | 长沙 | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | +| user_id | date | city | age | sex | last_visit_date | cost | max_dwell_time | min_dwell_time | +|---------|------------|------|-----|-----|---------------------|------|----------------|----------------| +| 10000 | 2017-10-01 | 北京 | 20 | 0 | 2017-10-01 07:00:00 | 35 | 10 | 2 | +| 10001 | 2017-10-01 | 北京 | 30 | 1 | 2017-10-01 17:05:45 | 2 | 22 | 22 | +| 10002 | 2017-10-02 | 上海 | 20 | 1 | 2017-10-02 12:59:12 | 200 | 5 | 5 | +| 10003 | 2017-10-02 | 广州 | 32 | 0 | 2017-10-02 11:20:00 | 30 | 11 | 11 | +| 10004 | 2017-10-01 | 深圳 | 35 | 0 | 2017-10-01 10:00:15 | 100 | 3 | 3 | +| 10004 | 2017-10-03 | 深圳 | 35 | 0 | 2017-10-03 11:22:00 | 55 | 19 | 6 | +| 10005 | 2017-10-03 | 长沙 | 29 | 1 | 2017-10-03 18:11:02 | 3 | 1 | 1 | 可以看到,用户 10004 的已有数据和新导入的数据发生了聚合。同时新增了 10005 用户的数据。 @@ -281,6 +307,7 @@ insert into example_db.example_tbl values 用户不需要指定长度和默认值。实际存储的数据大小与函数实现有关。 建表 + ```sql set enable_agg_state=true; create table aggstate( @@ -293,7 +320,7 @@ distributed BY hash(k1) buckets 3 properties("replication_num" = "1"); ``` -其中agg_state用于声明数据类型为agg_state,max_by/group_concat为聚合函数的签名。 +其中agg_state用于声明数据类型为agg_state,sum/group_concat为聚合函数的签名。 注意agg_state是一种数据类型,同int/array/string agg_state只能配合[state](../sql-manual/sql-functions/combinators/state.md) @@ -309,20 +336,23 @@ insert into aggstate values(1,sum_state(2),group_concat_state('b')); insert into aggstate values(1,sum_state(3),group_concat_state('c')); ``` 此时表只有一行 ( 注意,下面的表只是示意图,不是真的可以select显示出来) -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | + +| k1 | k2 | k3 | +|----|------------|---------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | 再插入一条数据 ```sql insert into aggstate values(2,sum_state(4),group_concat_state('d')); ``` + 此时表的结构为 -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | -| 2 | sum(4) | group_concat_state(d) | + +| k1 | k2 | k3 | +|----|------------|---------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | +| 2 | sum(4) | group_concat_state(d) | 我们可以通过merge操作来合并多个state,并且返回最终聚合函数计算的结果 @@ -334,9 +364,10 @@ mysql> select sum_merge(k2) from aggstate; | 10 | +---------------+ ``` -`sum_merge` 会先把sum(1,2,3) 和 sum(4) 合并成 sum(1,2,3,4) ,并返回计算的结果。 +`sum_merge` 会先把sum(1,2,3) 和 sum(4) 合并成 sum(1,2,3,4) ,并返回计算的结果。 因为group_concat对于顺序有要求,所以结果是不稳定的。 + ``` mysql> select group_concat_merge(k3) from aggstate; +------------------------+ @@ -347,17 +378,20 @@ mysql> select group_concat_merge(k3) from aggstate; ``` 如果不想要聚合的最终结果,可以使用union来合并多个聚合的中间结果,生成一个新的中间结果。 + ```sql insert into aggstate select 3,sum_union(k2),group_concat_union(k3) from aggstate ; ``` 此时的表结构为 -| k1 | k2 | k3 | -| --------------- | ----------- | --------------- | -| 1 | sum(1,2,3) | group_concat_state(a,b,c) | -| 2 | sum(4) | group_concat_state(d) | -| 3 | sum(1,2,3,4) | group_concat_state(a,b,c,d) | + +| k1 | k2 | k3 | +|----|--------------|-----------------------------| +| 1 | sum(1,2,3) | group_concat_state(a,b,c) | +| 2 | sum(4) | group_concat_state(d) | +| 3 | sum(1,2,3,4) | group_concat_state(a,b,c,d) | 可以通过查询 + ``` mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate; +---------------+------------------------+ @@ -373,31 +407,36 @@ mysql> select sum_merge(k2) , group_concat_merge(k3)from aggstate where k1 != 2; | 16 | c,b,a,d,c,b,a | +---------------+------------------------+ ``` + 用户可以通过agg_state做出跟细致的聚合函数操作。 注意 agg_state 存在一定的性能开销 ## Unique 模型 -在某些多维分析场景下,用户更关注的是如何保证 Key 的唯一性,即如何获得 Primary Key 唯一性约束。因此,我们引入了 Unique 数据模型。在1.2版本之前,该模型本质上是聚合模型的一个特例,也是一种简化的表结构表示方式。由于聚合模型的实现方式是读时合并(merge on read),因此在一些聚合查询上性能不佳(参考后续章节[聚合模型的局限性](#聚合模型的局限性)的描述),在1.2版本我们引入了Unique模型新的实现方式,写时合并(merge on write),通过在写入时做一些额外的工作,实现了最优的查询性能。写时合并将在未来替换读时合并成为Unique模型的默认实现方式,两者将会短暂的共存一段时间。下面将对两种实现方式分别举例进行说明。 +在某些多维分析场景下,用户更关注的是如何保证 Key 的唯一性,即如何获得 Primary Key 唯一性约束。 +因此,我们引入了 Unique 数据模型。在1.2版本之前,该模型本质上是聚合模型的一个特例,也是一种简化的表结构表示方式。 +由于聚合模型的实现方式是读时合并(merge on read),因此在一些聚合查询上性能不佳(参考后续章节[聚合模型的局限性](#聚合模型的局限性)的描述), +在1.2版本我们引入了Unique模型新的实现方式,写时合并(merge on write),通过在写入时做一些额外的工作,实现了最优的查询性能。 +写时合并将在未来替换读时合并成为Unique模型的默认实现方式,两者将会短暂的共存一段时间。下面将对两种实现方式分别举例进行说明。 ### 读时合并(与聚合模型相同的实现方式) -| ColumnName | Type | IsKey | Comment | -| ------------- | ------------ | ----- | ------------ | -| user_id | BIGINT | Yes | 用户id | -| username | VARCHAR(50) | Yes | 用户昵称 | -| city | VARCHAR(20) | No | 用户所在城市 | -| age | SMALLINT | No | 用户年龄 | -| sex | TINYINT | No | 用户性别 | -| phone | LARGEINT | No | 用户电话 | -| address | VARCHAR(500) | No | 用户住址 | -| register_time | DATETIME | No | 用户注册时间 | +| ColumnName | Type | IsKey | Comment | +|---------------|--------------|-------|---------| +| user_id | BIGINT | Yes | 用户id | +| username | VARCHAR(50) | Yes | 用户昵称 | +| city | VARCHAR(20) | No | 用户所在城市 | +| age | SMALLINT | No | 用户年龄 | +| sex | TINYINT | No | 用户性别 | +| phone | LARGEINT | No | 用户电话 | +| address | VARCHAR(500) | No | 用户住址 | +| register_time | DATETIME | No | 用户注册时间 | 这是一个典型的用户基础信息表。这类数据没有聚合需求,只需保证主键唯一性。(这里的主键为 user_id + username)。那么我们的建表语句如下: ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_unique ( `user_id` LARGEINT NOT NULL COMMENT "用户id", `username` VARCHAR(50) NOT NULL COMMENT "用户昵称", @@ -417,21 +456,21 @@ PROPERTIES ( 而这个表结构,完全同等于以下使用聚合模型描述的表结构: -| ColumnName | Type | AggregationType | Comment | -| ------------- | ------------ | --------------- | ------------ | -| user_id | BIGINT | | 用户id | -| username | VARCHAR(50) | | 用户昵称 | -| city | VARCHAR(20) | REPLACE | 用户所在城市 | -| age | SMALLINT | REPLACE | 用户年龄 | -| sex | TINYINT | REPLACE | 用户性别 | -| phone | LARGEINT | REPLACE | 用户电话 | -| address | VARCHAR(500) | REPLACE | 用户住址 | -| register_time | DATETIME | REPLACE | 用户注册时间 | +| ColumnName | Type | AggregationType | Comment | +|---------------|--------------|-----------------|---------| +| user_id | BIGINT | | 用户id | +| username | VARCHAR(50) | | 用户昵称 | +| city | VARCHAR(20) | REPLACE | 用户所在城市 | +| age | SMALLINT | REPLACE | 用户年龄 | +| sex | TINYINT | REPLACE | 用户性别 | +| phone | LARGEINT | REPLACE | 用户电话 | +| address | VARCHAR(500) | REPLACE | 用户住址 | +| register_time | DATETIME | REPLACE | 用户注册时间 | 及建表语句: ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_agg3 ( `user_id` LARGEINT NOT NULL COMMENT "用户id", `username` VARCHAR(50) NOT NULL COMMENT "用户昵称", @@ -469,7 +508,7 @@ Unique模型的写时合并实现,与聚合模型就是完全不同的两种 仍然以上面的表为例,建表语句为 ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_unique_merge_on_write ( `user_id` LARGEINT NOT NULL COMMENT "用户id", `username` VARCHAR(50) NOT NULL COMMENT "用户昵称", @@ -490,18 +529,19 @@ PROPERTIES ( 使用这种建表语句建出来的表结构,与聚合模型就完全不同了: -| ColumnName | Type | AggregationType | Comment | -| ------------- | ------------ | --------------- | ------------ | -| user_id | BIGINT | | 用户id | -| username | VARCHAR(50) | | 用户昵称 | -| city | VARCHAR(20) | NONE | 用户所在城市 | -| age | SMALLINT | NONE | 用户年龄 | -| sex | TINYINT | NONE | 用户性别 | -| phone | LARGEINT | NONE | 用户电话 | -| address | VARCHAR(500) | NONE | 用户住址 | -| register_time | DATETIME | NONE | 用户注册时间 | +| ColumnName | Type | AggregationType | Comment | +|---------------|--------------|-----------------|---------| +| user_id | BIGINT | | 用户id | +| username | VARCHAR(50) | | 用户昵称 | +| city | VARCHAR(20) | NONE | 用户所在城市 | +| age | SMALLINT | NONE | 用户年龄 | +| sex | TINYINT | NONE | 用户性别 | +| phone | LARGEINT | NONE | 用户电话 | +| address | VARCHAR(500) | NONE | 用户住址 | +| register_time | DATETIME | NONE | 用户注册时间 | -在开启了写时合并选项的Unique表上,数据在导入阶段就会去将被覆盖和被更新的数据进行标记删除,同时将新的数据写入新的文件。在查询的时候,所有被标记删除的数据都会在文件级别被过滤掉,读取出来的数据就都是最新的数据,消除掉了读时合并中的数据聚合过程,并且能够在很多情况下支持多种谓词的下推。因此在许多场景都能带来比较大的性能提升,尤其是在有聚合查询的情况下。 +在开启了写时合并选项的Unique表上,数据在导入阶段就会去将被覆盖和被更新的数据进行标记删除,同时将新的数据写入新的文件。在查询的时候, +所有被标记删除的数据都会在文件级别被过滤掉,读取出来的数据就都是最新的数据,消除掉了读时合并中的数据聚合过程,并且能够在很多情况下支持多种谓词的下推。因此在许多场景都能带来比较大的性能提升,尤其是在有聚合查询的情况下。 【注意】 1. 新的Merge-on-write实现默认关闭,且只能在建表时通过指定property的方式打开。 @@ -514,19 +554,19 @@ PROPERTIES ( 在某些多维分析场景下,数据既没有主键,也没有聚合需求。因此,我们引入 Duplicate 数据模型来满足这类需求。举例说明。 -| ColumnName | Type | SortKey | Comment | -| ---------- | ------------- | ------- | ------------ | -| timestamp | DATETIME | Yes | 日志时间 | -| type | INT | Yes | 日志类型 | -| error_code | INT | Yes | 错误码 | -| error_msg | VARCHAR(1024) | No | 错误详细信息 | -| op_id | BIGINT | No | 负责人id | -| op_time | DATETIME | No | 处理时间 | +| ColumnName | Type | SortKey | Comment | +|------------|---------------|---------|---------| +| timestamp | DATETIME | Yes | 日志时间 | +| type | INT | Yes | 日志类型 | +| error_code | INT | Yes | 错误码 | +| error_msg | VARCHAR(1024) | No | 错误详细信息 | +| op_id | BIGINT | No | 负责人id | +| op_time | DATETIME | No | 处理时间 | 建表语句如下: ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_duplicate ( `timestamp` DATETIME NOT NULL COMMENT "日志时间", `type` INT NOT NULL COMMENT "日志类型", @@ -542,9 +582,11 @@ PROPERTIES ( ); ``` -这种数据模型区别于 Aggregate 和 Unique 模型。数据完全按照导入文件中的数据进行存储,不会有任何聚合。即使两行数据完全相同,也都会保留。 而在建表语句中指定的 DUPLICATE KEY,只是用来指明底层数据按照那些列进行排序。(更贴切的名称应该为 “Sorted Column”,这里取名 “DUPLICATE KEY” 只是用以明确表示所用的数据模型。关于 “Sorted Column”的更多解释,可以参阅[前缀索引](./index/prefix-index.md))。在 DUPLICATE KEY 的选择上,我们建议适当的选择前 2-4 列就可以。 +这种数据模型区别于 Aggregate 和 Unique 模型。数据完全按照导入文件中的数据进行存储,不会有任何聚合。即使两行数据完全相同,也都会保留。 +而在建表语句中指定的 DUPLICATE KEY,只是用来指明底层数据按照那些列进行排序。(更贴切的名称应该为 “Sorted Column”, +这里取名 “DUPLICATE KEY” 只是用以明确表示所用的数据模型。关于 “Sorted Column”的更多解释,可以参阅[前缀索引](./index/index-overview.md))。在 DUPLICATE KEY 的选择上,我们建议适当的选择前 2-4 列就可以。 -这种数据模型适用于既没有聚合需求,又没有主键唯一性约束的原始数据的存储。更多使用场景,可参阅**聚合模型的局限性**小节。 +这种数据模型适用于既没有聚合需求,又没有主键唯一性约束的原始数据的存储。更多使用场景,可参阅[聚合模型的局限性](#聚合模型的局限性)小节。 <version since="2.0"> @@ -563,7 +605,7 @@ PROPERTIES ( 建表语句如下: ```sql -CREATE TABLE IF NOT EXISTS example_db.example_tbl +CREATE TABLE IF NOT EXISTS example_db.example_tbl_duplicate_without_keys_by_default ( `timestamp` DATETIME NOT NULL COMMENT "日志时间", `type` INT NOT NULL COMMENT "日志类型", @@ -578,7 +620,7 @@ PROPERTIES ( "enable_duplicate_without_keys_by_default" = "true" ); -MySQL > desc example_tbl; +MySQL > desc example_tbl_duplicate_without_keys_by_default; +------------+---------------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------+---------------+------+-------+---------+-------+ @@ -601,10 +643,10 @@ MySQL > desc example_tbl; 假设表结构如下: -| ColumnName | Type | AggregationType | Comment | -| ---------- | -------- | --------------- | ------------ | -| user_id | LARGEINT | | 用户id | -| date | DATE | | 数据灌入日期 | +| ColumnName | Type | AggregationType | Comment | +|------------|----------|-----------------|---------| +| user_id | LARGEINT | | 用户id | +| date | DATE | | 数据灌入日期 | | cost | BIGINT | SUM | 用户总消费 | 假设存储引擎中有如下两个已经导入完成的批次的数据: @@ -612,14 +654,14 @@ MySQL > desc example_tbl; **batch 1** | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 50 | | 10002 | 2017-11-21 | 39 | **batch 2** | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 1 | | 10001 | 2017-11-21 | 5 | | 10003 | 2017-11-22 | 22 | @@ -627,7 +669,7 @@ MySQL > desc example_tbl; 可以看到,用户 10001 分属在两个导入批次中的数据还没有聚合。但是为了保证用户只能查询到如下最终聚合后的数据: | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 51 | | 10001 | 2017-11-21 | 5 | | 10002 | 2017-11-21 | 39 | @@ -643,7 +685,7 @@ SELECT MIN(cost) FROM table; 得到的结果是 5,而不是 1。 -同时,这种一致性保证,在某些查询中,会极大的降低查询效率。 +同时,这种一致性保证,在某些查询中,会极大地降低查询效率。 我们以最基本的 count(*) 查询为例: @@ -651,21 +693,22 @@ SELECT MIN(cost) FROM table; SELECT COUNT(*) FROM table; ``` -在其他数据库中,这类查询都会很快的返回结果。因为在实现上,我们可以通过如“导入时对行进行计数,保存 count 的统计信息”,或者在查询时“仅扫描某一列数据,获得 count 值”的方式,只需很小的开销,即可获得查询结果。但是在 Doris 的聚合模型中,这种查询的开销**非常大**。 +在其他数据库中,这类查询都会很快地返回结果。因为在实现上,我们可以通过如“导入时对行进行计数,保存 count 的统计信息”,或者在查询时“仅扫描某一列数据, +获得 count 值”的方式,只需很小的开销,即可获得查询结果。但是在 Doris 的聚合模型中,这种查询的开销**非常大**。 我们以刚才的数据为例: **batch 1** | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 50 | | 10002 | 2017-11-21 | 39 | **batch 2** | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 1 | | 10001 | 2017-11-21 | 5 | | 10003 | 2017-11-22 | 22 | @@ -673,7 +716,7 @@ SELECT COUNT(*) FROM table; 因为最终的聚合结果为: | user_id | date | cost | -| ------- | ---------- | ---- | +|---------|------------|------| | 10001 | 2017-11-20 | 51 | | 10001 | 2017-11-21 | 5 | | 10002 | 2017-11-21 | 39 | @@ -685,14 +728,16 @@ SELECT COUNT(*) FROM table; 因此,当业务上有频繁的 count(*) 查询时,我们建议用户通过增加一个**值恒为 1 的,聚合类型为 SUM 的列来模拟 count(\*)**。如刚才的例子中的表结构,我们修改如下: -| ColumnName | Type | AggregateType | Comment | -| ---------- | ------ | ------------- | ------------- | -| user_id | BIGINT | | 用户id | -| date | DATE | | 数据灌入日期 | -| cost | BIGINT | SUM | 用户总消费 | +| ColumnName | Type | AggregateType | Comment | +|------------|--------|---------------|-----------| +| user_id | BIGINT | | 用户id | +| date | DATE | | 数据灌入日期 | +| cost | BIGINT | SUM | 用户总消费 | | count | BIGINT | SUM | 用于计算count | -增加一个 count 列,并且导入数据中,该列值**恒为 1**。则 `select count(*) from table;` 的结果等价于 `select sum(count) from table;`。而后者的查询效率将远高于前者。不过这种方式也有使用限制,就是用户需要自行保证,不会重复导入 AGGREGATE KEY 列都相同的行。否则,`select sum(count) from table;` 只能表述原始导入的行数,而不是 `select count(*) from table;` 的语义。 +增加一个 count 列,并且导入数据中,该列值**恒为 1**。则 `select count(*) from table;` 的结果等价于 `select sum(count) from table;`。 +而后者的查询效率将远高于前者。不过这种方式也有使用限制,就是用户需要自行保证,不会重复导入 AGGREGATE KEY 列都相同地行。 +否则,`select sum(count) from table;` 只能表述原始导入的行数,而不是 `select count(*) from table;` 的语义。 另一种方式,就是 **将如上的 `count` 列的聚合类型改为 REPLACE,且依然值恒为 1**。那么 `select sum(count) from table;` 和 `select count(*) from table;` 的结果将是一致的。并且这种方式,没有导入重复行的限制。 @@ -703,7 +748,7 @@ Unique模型的写时合并实现没有聚合模型的局限性,还是以刚 **batch 1** | user_id | date | cost | delete bit | -| ------- | ---------- | ---- | ---------- | +|---------|------------|------|------------| | 10001 | 2017-11-20 | 50 | false | | 10002 | 2017-11-21 | 39 | false | @@ -712,19 +757,20 @@ Unique模型的写时合并实现没有聚合模型的局限性,还是以刚 **batch 1** | user_id | date | cost | delete bit | -| ------- | ---------- | ---- | ---------- | +|---------|------------|------|------------| | 10001 | 2017-11-20 | 50 | **true** | | 10002 | 2017-11-21 | 39 | false | **batch 2** | user_id | date | cost | delete bit | -| ------- | ---------- | ---- | ---------- | +|---------|------------|------|------------| | 10001 | 2017-11-20 | 1 | false | | 10001 | 2017-11-21 | 5 | false | | 10003 | 2017-11-22 | 22 | false | -在查询时,所有在delete bitmap中被标记删除的数据都不会读出来,因此也无需进行做任何数据聚合,上述数据中有效的行数为4行,查询出的结果也应该是4行,也就可以采取开销最小的方式来获取结果,即前面提到的“仅扫描某一列数据,获得 count 值”的方式。 +在查询时,所有在delete bitmap中被标记删除的数据都不会读出来,因此也无需进行做任何数据聚合,上述数据中有效地行数为4行, +查询出的结果也应该是4行,也就可以采取开销最小的方式来获取结果,即前面提到的“仅扫描某一列数据,获得 count 值”的方式。 在测试环境中,count(*) 查询在Unique模型的写时合并实现上的性能,相比聚合模型有10倍以上的提升。 @@ -733,7 +779,8 @@ Unique模型的写时合并实现没有聚合模型的局限性,还是以刚 Duplicate 模型没有聚合模型的这个局限性。因为该模型不涉及聚合语意,在做 count(*) 查询时,任意选择一列查询,即可得到语意正确的结果。 ## key 列 -Duplicate、Aggregate、Unique 模型,都会在建表指定 key 列,然而实际上是有所区别的:对于 Duplicate 模型,表的key列,可以认为只是 “排序列”,并非起到唯一标识的作用。而 Aggregate、Unique 模型这种聚合类型的表,key 列是兼顾 “排序列” 和 “唯一标识列”,是真正意义上的“ key 列”。 +Duplicate、Aggregate、Unique 模型,都会在建表指定 key 列,然而实际上是有所区别的:对于 Duplicate 模型,表的key列, +可以认为只是 “排序列”,并非起到唯一标识的作用。而 Aggregate、Unique 模型这种聚合类型的表,key 列是兼顾 “排序列” 和 “唯一标识列”,是真正意义上的“ key 列”。 ## 数据模型的选择建议 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org