This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 74602a782dc Add three blogs (#279) 74602a782dc is described below commit 74602a782dcfa85bf74859a215100b8b62608200 Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Wed Aug 2 18:00:33 2023 +0800 Add three blogs (#279) --- blog/Data_Update.md | 226 +++++++++++++++++++++++++++++++++++++++++ blog/Poly.md | 105 +++++++++++++++++++ blog/Zhihu.md | 140 +++++++++++++++++++++++++ static/images/Dataupdate_1.png | Bin 0 -> 146321 bytes static/images/Dataupdate_2.png | Bin 0 -> 40204 bytes static/images/Dataupdate_3.png | Bin 0 -> 30428 bytes static/images/Poly_1.png | Bin 0 -> 286095 bytes static/images/Poly_2.png | Bin 0 -> 413181 bytes static/images/Poly_3.png | Bin 0 -> 316024 bytes static/images/Zhihu_1.png | Bin 0 -> 176681 bytes static/images/Zhihu_2.png | Bin 0 -> 139038 bytes static/images/Zhihu_3.png | Bin 0 -> 192461 bytes static/images/Zhihu_4.png | Bin 0 -> 158507 bytes static/images/Zhihu_5.png | Bin 0 -> 65366 bytes static/images/Zhihu_6.png | Bin 0 -> 49858 bytes static/images/Zhihu_7.png | Bin 0 -> 96180 bytes static/images/Zhihu_8.png | Bin 0 -> 80053 bytes 17 files changed, 471 insertions(+) diff --git a/blog/Data_Update.md b/blog/Data_Update.md new file mode 100644 index 00000000000..d8bbd04511f --- /dev/null +++ b/blog/Data_Update.md @@ -0,0 +1,226 @@ +--- +{ + 'title': 'Is Your Latest Data Really the Latest? Check the Data Update Mechanism of Your Database', + 'summary': "This is about how to support both row update and partial column update in a database in a way that is simple in execution and efficient in data quality guarantee.", + 'date': '2023-07-24', + 'author': 'Apache Doris', + 'tags': ['Tech Sharing'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +In databases, data update is to add, delete, or modify data. Timely data update is an important part of high quality data services. + +Technically speaking, there are two types of data updates: you either update a whole row (**Row Update**) or just update part of the columns (**Partial Column Update**). Many databases supports both of them, but in different ways. This post is about one of them, which is simple in execution and efficient in data quality guarantee. + +As an open source analytic database, Apache Doris supports both Row Update and Partial Column Update with one data model: the **Unique Key Model**. It is where you put data that doesn't need to be aggregated. In the Unique Key Model, you can specify one column or the combination of several columns as the Unique Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of data: the newly ingested data record replaces the old. That's how data updates work. + +The idea is straightforward, but in real-life implementation, it happens that the latest data does not arrive the last or doesn't even get written at all, so I'm going to show you how Apache Doris implements data update and avoids messups with its Unique Key Model. + + + +## Row Update + +For data writing to the Unique Key Model, Apache Doris adopts the **Upsert** semantics, which means **Update or Insert**. If the new data record includes a Unique Key that already exists in the table, the new record will replace the old record; if it includes a brand new Unique Key, the new record will be inserted into the table as a whole. The Upsert operation can provide high throughput and guarantee data reliability. + +**Example**: + +In the following table, the Unique Key is the combination of three columns: `user_id, date, group_id`. + +```SQL +mysql> desc test_table; ++-------------+--------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+-------+ +| user_id | BIGINT | Yes | true | NULL | | +| date | DATE | Yes | true | NULL | | +| group_id | BIGINT | Yes | true | NULL | | +| modify_date | DATE | Yes | false | NULL | NONE | +| keyword | VARCHAR(128) | Yes | false | NULL | NONE | ++-------------+--------------+------+-------+---------+-------+ +``` + +Execute `insert into` to write in a data record. Since the table was empty, by the Upsert semantics, it means to add a new row to the table. + +```SQL +mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-28", "foo"); +Query OK, 1 row affected (0.05 sec) +{'label':'insert_2fb45d1833db4348_b612b8791c97b467', 'status':'VISIBLE', 'txnId':'343'} + +mysql> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2023-04-28 | 2 | 2023-04-28 | foo | ++---------+------------+----------+-------------+---------+ +``` + +Then insert two more data records, one of which has the same Unique Key with the previously inserted row. Now, by the Upsert semantics, it means to replace the old row with the new one of the same Unique Key, and insert the record of the new Unique Key. + +```SQL +mysql> insert into test_table values (1, "2023-04-28", 2, "2023-04-29", "foo"), (2, "2023-04-29", 2, "2023-04-29", "bar"); +Query OK, 2 rows affected (0.04 sec) +{'label':'insert_7dd3954468aa4ac1_a63a3852e3573b4c', 'status':'VISIBLE', 'txnId':'344'} + +mysql> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 2 | 2023-04-29 | 2 | 2023-04-29 | bar | +| 1 | 2023-04-28 | 2 | 2023-04-29 | foo | ++---------+------------+----------+-------------+---------+ +``` + +## Partial Column Update + +Besides row update, under many circumstances, data analysts require the convenience of partial column update. For example, in user portraits, they would like to update certain dimensions of their users in real time. Or, if they need to maintain a flat table that is made of data from various source tables, they will prefer partial columm update than complicated join operations as a way of data update. + +Apache Doris supports partial column update with the UPDATE statement. It filters the rows that need to be modified, read them, changes a few values, and write the rows back to the table. + +**Example**: + +Suppose that there is an order table, in which the Order ID is the Unique Key. + +```SQL ++----------+--------------+-----------------+ +| order_id | order_amount | order_status | ++----------+--------------+-----------------+ +| 1 | 100 | Payment Pending | ++----------+--------------+-----------------+ +1 row in set (0.01 sec) +``` + +When the buyer completes the payment, Apache Doris should change the order status of Order ID 1 from "Payment Pending" to "Delivery Pending". This is when the Update command comes into play. + +```SQL +mysql> UPDATE test_order SET order_status = 'Delivery Pending' WHERE order_id = 1; +Query OK, 1 row affected (0.11 sec) +{'label':'update_20ae22daf0354fe0-b5aceeaaddc666c5', 'status':'VISIBLE', 'txnId':'33', 'queryId':'20ae22daf0354fe0-b5aceeaaddc666c5'} +``` + +This is the table after updating. + +```SQL ++----------+--------------+------------------+ +| order_id | order_amount | order_status | ++----------+--------------+------------------+ +| 1 | 100 | Delivery Pending | ++----------+--------------+------------------+ +1 row in set (0.01 sec) +``` + +The execution of the Update command consists of three steps in the system: + +- Step One: Read the row where Order ID = 1 (1, 100, 'Payment Pending') +- Step Two: Modify the order status from "Payment Pending" to "Delivery Pending" (1, 100, 'Delivery Pending') +- Step Three: Insert the new row into the table + + + +The table is in the Unique Key Model, which means for rows of the same Unique Key, only the last inserted one will be reserved, so this is what the table will finally look like: + + + +## Order of Data Updates + +So far this sounds simple, but in the actual world, data update might fail due to reasons such as data format errors, and thus mess up the data writing order. The order of data update matters more than you imagine. For example, in financial transactions, messed-up data writing order might lead to transaction data losses, errors, or duplication, which further leads to bigger problems. + +Apache Doris provides two options for users to guarantee that their data is updated in the correct order: + +**1. Update by the order of transaction commit** + +In Apache Doris, each data ingestion task is a transaction. Each successfully ingested task will be given a data version and the number of data versions is strictly increasing. If the ingestion fails, the transaction will be rolled back, and no new data version will be generated. + + By default, the Upsert semantics follows the order of the transaction commits. If there are two data ingestion tasks involving the same Unique Key, the first task generating data version 2 and the second, data version 3, then according to transaction commit order, data version 3 will replace data version 2. + +**2. Update by the user-defined order** + +In real-time data analytics, data updates often happen in high concurrency. It is possible that there are multiple data ingestion tasks updating the same row, but these tasks are committed in unknown order, so the last saved update remains unknown, too. + +For example, these are two data updates, with "2023-04-30" and "2023-05-01" as the `modify_data`, respectively. If they are written into the system concurrently, but the "2023-05-01" one is successully committed first and the other later, then the "2023-04-30" record will be saved due to its higher data version number, but we know it is not the latest one. + +```Plain +mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb"); +Query OK, 1 row affected (0.04 sec) +{'label':'insert_e2daf8cea5524ee1_94e5c87e7bb74d67', 'status':'VISIBLE', 'txnId':'345'} + +mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa"); +Query OK, 1 row affected (0.03 sec) +{'label':'insert_ef906f685a7049d0_b135b6cfee49fb98', 'status':'VISIBLE', 'txnId':'346'} + +mysql> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 2 | 2023-04-29 | 2 | 2023-04-30 | aaa | +| 1 | 2023-04-28 | 2 | 2023-04-29 | foo | ++---------+------------+----------+-------------+---------+ +``` + +That's why in high-concurrency scenarios, Apache Doris allows data update in user-defined order. Users can designate a column to the Sequence Column. In this way, the system will identity save the latest data version based on value in the Sequence Column. + +**Example:** + +You can designate a Sequence Column by specifying the `function_column.sequence_col` property upon table creation. + +```SQL +CREATE TABLE test.test_table +( + user_id bigint, + date date, + group_id bigint, + modify_date date, + keyword VARCHAR(128) +) +UNIQUE KEY(user_id, date, group_id) +DISTRIBUTED BY HASH (user_id) BUCKETS 32 +PROPERTIES( + "function_column.sequence_col" = 'modify_date', + "replication_num" = "1", + "in_memory" = "false" +); +``` + +Then check and see, the data record with the highest value in the Sequence Column will be saved: + +```SQL +mysql> insert into test_table values (2, "2023-04-29", 2, "2023-05-01", "bbb"); +Query OK, 1 row affected (0.03 sec) +{'label':'insert_3aac37ae95bc4b5d_b3839b49a4d1ad6f', 'status':'VISIBLE', 'txnId':'349'} + +mysql> insert into test_table values (2, "2023-04-29", 2, "2023-04-30", "aaa"); +Query OK, 1 row affected (0.03 sec) +{'label':'insert_419d4008768d45f3_a6912e584cf1b500', 'status':'VISIBLE', 'txnId':'350'} + +mysql> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 2 | 2023-04-29 | 2 | 2023-05-01 | bbb | +| 1 | 2023-04-28 | 2 | 2023-04-29 | foo | ++---------+------------+----------+-------------+---------+ +``` + +## Conclusion + +Congratulations. Now you've gained an overview of how data updates are implemented in Apache Doris. With this knowledge, you can basically guarantee efficiency and accuracy of data updating. But wait, there is so much more about that. As Apache Doris 2.0 is going to provide more powerful Partial Column Update capabilities, with improved execution of the Update statement and the support for more complicated multi-table Join queries, I will show you how to take advantage of them in details [...] + diff --git a/blog/Poly.md b/blog/Poly.md new file mode 100644 index 00000000000..20916cbf13b --- /dev/null +++ b/blog/Poly.md @@ -0,0 +1,105 @@ +--- +{ + 'title': 'For Entry-Level Data Engineers: How to Build a Simple but Solid Data Architecture', + 'summary': "This article aims to provide reference for non-tech companies who are seeking to empower your business with data analytics.", + 'date': '2023-07-31', + 'author': 'Zhenwei Liu', + 'tags': ['Apache Doris'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +For Entry-Level Data Engineers: How to Build a Simple but Solid Data Architecture + + + +This article aims to provide reference for non-tech companies who are seeking to empower your business with data analytics. You will learn the basics about how to build an efficient and easy-to-use data system, and I will walk you through every aspect of it with a use case of Apache Doris, an MPP-based analytic data warehouse. + +## What You Need + +This case is about a ticketing service provider who want a data platform that boasts quick processing, low maintenance costs, and ease of use, and I think they speak for the majority of entry-level database users. + +A prominent feature of ticketing services is the periodic spikes in ticket orders, you know, before the shows go on. So from time to time, the company has a huge amount of new data rushing in and requires real-time processing of it, so they can make timely adjustments during the short sales window. But in other time, they don't want to spend too much energy and funds on maintaining the data system. Furthermore, for a beginner of digital operation who only require basic analytic functions [...] + +## Simple Architecture + +The building blocks of this architecture are simple. You only need Apache Flink and Apache Kafka for data ingestion, and Apache Doris as an analytic data warehouse. + + + +Connecting data sources to the data warehouse is simple, too. The key component, Apache Doris, supports various data loading methods to fit with different data sources. You can perform column mapping, transforming, and filtering during data loading to avoid duplicate collection of data. To ingest a table, users only need to add the table name to the configurations, instead of writing a script themselves. + +## Data Update + +Flink CDC was found to be the optimal choice if you are looking for higher stability in data ingestion. It also allows you to update the dynamically changing tables in real time. The process includes the following steps: + +- Configure Flink CDC for the source MySQL database, so that it allows dynamic updating of the table management configurations (which you can think of as the "metadata"). +- Create two CDC jobs in Flink, one to capture the changed data (the Forward stream), the other to update the table management configurations (the Broadcast stream). +- Configure all tables of the source database at the Sink end (the output end of Flink CDC). When there is newly added table in the source database, the Broadcast stream will be triggered to update the table management configurations. (You just need to configure the tables, instead of "creating" the tables.) + + + +## Layering of Data Warehouse + +Data flows from various sources into the data warehouse, where it is cleaned and organized before it is ready for queries and analysis. The data processing here is divided into five typical layers. Such layering simplifies the data cleaning process because it provides a clear division of labor and makes things easier to locate and comprehend. + +- **ODS**: This is the prep zone of the data warehouse. The unprocessed original data is put in the [Unique Key Model](https://doris.apache.org/docs/dev/data-table/data-model/#unique-model) of Apache Doris, which can avoid duplication of data. +- **DWD**: This layer cleans, formats, and de-identifies data to produce fact tables. Every detailed data record is preserved. Data in this layer is also put into the Unique Key Model. +- **DWS**: This layer produces flat tables of a certain theme (order, user, etc.) based on data from the DWD layer. +- **ADS**: This layer auto-aggregates data, which is implemented by the [Aggregate Key Model](https://doris.apache.org/docs/dev/data-table/data-model/#aggregate-model) of Apache Doris. +- **DIM**: The DIM layer accommodates dimension data (in this case, data about the theaters, projects, and show sessions, etc.), which is used in combination with the order details. + +After the original data goes through these layers, it is available for queries via one data export interface. + +## Reporting + +Like many non-tech business, the ticketing service provider needs a data warehouse mainly for reporting. They derive trends and patterns from all kinds of data reports, and then figure out ways towards efficient management and sales increase. Specifically, this is the information they are observing in their reports: + +- **Statistical Reporting**: These are the most frequently used reports, including sales reports by theater, distribution channel, sales representative, and show. +- **Agile Reporting**: These are reports developed for specific purposes, such as daily and weekly project data reports, sales summary reports, GMV reports, and settlement reports. +- **Data Analysis**: This involves data such as membership orders, attendance rates, and user portraits. +- **Dashboarding**: This is to visually display sales data. + + + +These are all entry-level tasks in data analytics. One of the biggest burdens for the data engineers was to quickly develop new reports as the internal analysts required. The [Aggregate Key Model](https://doris.apache.org/docs/dev/data-table/data-model#aggregate-model) of Apache Doris is designed for this. + +### Quick aggregation to produce reports on demand + +For example, supposing that analysts want a sales report by sales representatives, data engineers can produce that by simple configuration: + +1. Put the original data in the Aggregate Key Model +2. Specify the sales representative ID column and the payment date column as the Key columns, and the order amount column as the Value column + +Then, order amounts of the same sale representative within the specified period of time will be auto-aggregated. Bam! That's the report you need! + +According to the user, this whole process only takes them 10~30 minutes, depending on the complexity of the report required. So the Aggregate Key Model largely releases data engineers from the pressure of report development. + +### Quick response to data queries + +Most data analysts would just want their target data to be returned the second they need it. In this case, the user often leverages two capabilities of Apache Doris to realize quick query response. + +Firstly, Apache Doris is famously fast in Join queries. So if you need to extract information across multiple tables, you are in good hands. Secondly, in data analysis, it often happens that analysts frequently input the same request. For example, they frequently want to check the sales data of different theaters. In this scenario, Apache Doris allows you to create a [Materialized View](https://doris.apache.org/docs/dev/query-acceleration/materialized-view/), which means you pre-aggregat [...] + +## Conclusion + +This is the overview of a simple data architecture and how it can provide the data services you need. It ensures data ingestion stability and quality with Flink CDC, and quick data analysis with Apache Doris. The deployment of this architecture is simple, too. If you plan for a data analytic upgrade for your business, you might refer to this case. If you need advice and help, you may join our [community here](https://t.co/ZxJuNJHXb2). diff --git a/blog/Zhihu.md b/blog/Zhihu.md new file mode 100644 index 00000000000..9bef297d53a --- /dev/null +++ b/blog/Zhihu.md @@ -0,0 +1,140 @@ +--- +{ + 'title': 'Database Dissection: How Fast Data Queries Are Implemented', + 'summary': "What's more important than quick performance itself is the architectural design and mechanism that enable it.", + 'date': '2023-07-16', + 'author': 'Rong Hou', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + + + +In data analytics, fast query performance is more of a result than a guarantee. What's more important than the result itself is the architectural design and mechanism that enables quick performance. This is exactly what this post is about. I will put you into context with a typical use case of Apache Doris, an open-source MPP-based analytic database. + +The user in this case is an all-category Q&A website. As a billion-dollar listed company, they have their own data management platform. What Doris does is to support the data filtering, packaging, analyzing, and monitoring workloads of that platform. Based on their huge data size, the user demands quick data loading and quick response to queries. + +## How to Enable Quick Queries on Huge Dataset + +- **Scenario**: user segmentation for the website +- **Data size**: 100 billion data objects, 2.4 million tags +- **Requirements**: query response time < 1 second; result packaging < 10 seconds + +For these goals, the engineers have made three critical changes in their data processing pipeline. + +### 1.Distribute the data + +User segmentation is when analysts pick out a group of website users that share certain characteristics (tags). In the database system, this process is implemented by a bunch of set operations (union, intersection, and difference). + +**Narration from the engineers:** + +We realize that instead of executing set operations on one big dataset, we can divide our dataset into smaller ones, execute set operations on each of them, and then merge all the results. In this way, each small dataset is computed by one thread/queue. Then we have a queue to do the final merging. It's simple distributed computing thinking. + + + +Example: + +1. Every 1 million users are put into one group with a `group_id`. +2. All user tags in that same group will relate to the corresponding `group_id`. +3. Calculate the union/intersection/difference within each group. (Enable multi-thread mode to increase computation efficiency.) +4. Merge the results from the groups. + +The problem here is, since user tags are randomly distributed across various machines, the computation entails multi-time shuffling, which brings huge network overhead. That leads to the second change. + +### 2.Pre-bind a data group to a machine + +This is enabled by the Colocate mechanism of Apache Doris. The idea of Colocate is to place data chunks that are often accessed together onto the same node, so as to reduce cross-node data transfer and thus, get lower latency. + + + +The implementation is simple: Bind one group key to one machine. Then naturally, data corresponding to that group key will be pre-bound to that machine. + +The following is the query plan before we adopted Collocate: It is complicated, with a lot of data shuffling. + + + +This is the query plan after. It is much simpler, which is why queries are much faster and less costly. + + + +### 3.Merge the operators + +In data queries, the engineers realized that they often use a couple of functions in combination, so they decided to develop compound functions to further improve execution efficiency. They came to the Doris [community](https://t.co/XD4uUSROft) and talked about their thoughts. The Doris developers provided support for them and soon the compound functions are ready for use on Doris. These are a few examples: + +``` +bitmap_and_count == bitmap_count(bitmap_and(bitmap1, bitmap2)) +bitmap_and_not_count == bitmap_count(bitmap_not(bitmap1, bitmap_and(bitmap1, bitmap2)) +orthogonal_bitmap_union_count==bitmap_and(bitmap1,bitmap_and(bitmap2,bitmap3) +``` + +Query execution with one compound function is much faster than that with a chain of simple functions, as you can tell from the lengths of the flow charts: + + + +- **Multiple Simple functions**: This involves three function executions and two intermediate storage. It's a long and slow process. +- **One compound function**: Simple in and out. + +## How to Quickly Ingest Large Amounts of Data + +This is about putting the right workload on the right component. Apache Doris supports a variety of data loading methods. After trials and errors, the user settled on Spark Load and thus decreased their data loading time by 90%. + +**Narration from the engineers:** + +In offline data ingestion, we used to perform most computation in Apache Hive, write the data files to HDFS, and pull data regularly from HDFS to Apache Doris. However, after Doris obtains parquet files from HDFS, it performs a series of operations on them before it can turn them into segment files: decompressing, bucketing, sorting, aggregating, and compressing. These workloads will be borne by Doris backends, which have to undertake a few bitmap operations at the same time. So there is [...] + + + +So we decided on the Spark Load method. It allows us to split the ingestion process into two parts: computation and storage, so we can move all the bucketing, sorting, aggregating, and compressing to Spark clusters. Then Spark writes the output to HDFS, from which Doris pulls data and flushes it to the local disks. + + + +When ingesting 1.2 TB data (that's 110 billion rows), the Spark Load method only took 55 minutes. + +## A Vectorized Execution Engine + +In addition to the above changes, a large part of the performance of a database relies on its execution engine. In the case of Apache Doris, it has fully vectorized its storage and computation layers since version 1.1. The longtime user also witnessed this revolution, so we invited them to test how the vectorized engine worked. + +They compared query response time before and after the vectorization in seven of its frequent scenarios: + +- Scenario 1: Simple user segmentation (hundreds of filtering conditions), data packaging of a multi-million user group. +- Scenario 2: Complicated user segmentation (thousands of filtering conditions), data packaging of a tens-of-million user group. +- Scenario 3: Multi-dimensional filtering (6 dimensions), single-table query, **single-date flat table**, data aggregation, 180 million rows per day. +- Scenario 4: Multi-dimensional filtering (6 dimensions), single-table query, **multi-date flat table**, data aggregation, 180 million rows per day. +- Scenario 5: **Single-table query**, COUNT, 180 million rows per day. +- Scenario 6: **Multi-table query**, (Table A: 180 million rows, SUM, COUNT; Table B: 1.5 million rows, bitmap aggregation), aggregate Table A and Table B, join them with Table C, and then join the sub-tables, six joins in total. +- Scenario 7: Single-table query, 500 million rows of itemized data + +The results are as below: + + + +## Conclusion + +In short, what contributed to the fast data loading and data queries in this case? + +- The Colocate mechanism that's designed for distributed computing +- Collaboration between database users and [developers](https://t.co/ZxJuNJHXb2) that enables the operator merging +- Support for a wide range of data loading methods to choose from +- A vectorized engine that brings overall performance increase + +It takes efforts from both the database developers and users to make fast performance possible. The user's experience and knowledge of their own status quo will allow them to figure out the quickest path, while a good database design will help pave the way and make users' life easier. \ No newline at end of file diff --git a/static/images/Dataupdate_1.png b/static/images/Dataupdate_1.png new file mode 100644 index 00000000000..065b36a48a1 Binary files /dev/null and b/static/images/Dataupdate_1.png differ diff --git a/static/images/Dataupdate_2.png b/static/images/Dataupdate_2.png new file mode 100644 index 00000000000..ec817a7327a Binary files /dev/null and b/static/images/Dataupdate_2.png differ diff --git a/static/images/Dataupdate_3.png b/static/images/Dataupdate_3.png new file mode 100644 index 00000000000..62c0d37797d Binary files /dev/null and b/static/images/Dataupdate_3.png differ diff --git a/static/images/Poly_1.png b/static/images/Poly_1.png new file mode 100644 index 00000000000..6906d77b767 Binary files /dev/null and b/static/images/Poly_1.png differ diff --git a/static/images/Poly_2.png b/static/images/Poly_2.png new file mode 100644 index 00000000000..7076b6122b9 Binary files /dev/null and b/static/images/Poly_2.png differ diff --git a/static/images/Poly_3.png b/static/images/Poly_3.png new file mode 100644 index 00000000000..d9c417b8a47 Binary files /dev/null and b/static/images/Poly_3.png differ diff --git a/static/images/Zhihu_1.png b/static/images/Zhihu_1.png new file mode 100644 index 00000000000..3f83fa7c2a3 Binary files /dev/null and b/static/images/Zhihu_1.png differ diff --git a/static/images/Zhihu_2.png b/static/images/Zhihu_2.png new file mode 100644 index 00000000000..50e9534e0f0 Binary files /dev/null and b/static/images/Zhihu_2.png differ diff --git a/static/images/Zhihu_3.png b/static/images/Zhihu_3.png new file mode 100644 index 00000000000..6a5f5e7c5f6 Binary files /dev/null and b/static/images/Zhihu_3.png differ diff --git a/static/images/Zhihu_4.png b/static/images/Zhihu_4.png new file mode 100644 index 00000000000..6f103e3cb76 Binary files /dev/null and b/static/images/Zhihu_4.png differ diff --git a/static/images/Zhihu_5.png b/static/images/Zhihu_5.png new file mode 100644 index 00000000000..c391ab19b8b Binary files /dev/null and b/static/images/Zhihu_5.png differ diff --git a/static/images/Zhihu_6.png b/static/images/Zhihu_6.png new file mode 100644 index 00000000000..61e7e54dcd1 Binary files /dev/null and b/static/images/Zhihu_6.png differ diff --git a/static/images/Zhihu_7.png b/static/images/Zhihu_7.png new file mode 100644 index 00000000000..256d5ec177d Binary files /dev/null and b/static/images/Zhihu_7.png differ diff --git a/static/images/Zhihu_8.png b/static/images/Zhihu_8.png new file mode 100644 index 00000000000..c5ad192c382 Binary files /dev/null and b/static/images/Zhihu_8.png differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org