This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 087aec0dfe0 add blogs & update dead links (#319) 087aec0dfe0 is described below commit 087aec0dfe00019e782ed3633fe1521f8256adba Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Thu Oct 12 10:48:30 2023 +0800 add blogs & update dead links (#319) --- ...-Why-We-Went-from-ClickHouse-to-Apache-Doris.md | 2 +- blog/Tencent-LLM.md | 2 +- ...ambda-Architecture-for-40-Faster-Performance.md | 97 ------ ...s-a-next-generation-real-time-data-warehouse.md | 172 ++++++++++ blog/log-analysis-elasticsearch-vs-apache-doris.md | 352 +++++++++++++++++++++ ...rom-clickhouse-to-apache-doris-what-happened.md | 161 ++++++++++ blog/release-note-2.0.0.md | 4 +- static/images/Introduction_3.png | Bin 0 -> 149120 bytes static/images/Introduction_4.png | Bin 0 -> 143564 bytes static/images/LAS_1.png | Bin 0 -> 96696 bytes static/images/LAS_2.png | Bin 0 -> 104517 bytes static/images/LAS_3.png | Bin 0 -> 73001 bytes static/images/LAS_4.png | Bin 0 -> 67765 bytes static/images/LAS_5.png | Bin 0 -> 49078 bytes static/images/introduction_1.png | Bin 0 -> 321089 bytes static/images/introduction_10.png | Bin 0 -> 141948 bytes static/images/introduction_11.png | Bin 0 -> 77773 bytes static/images/introduction_2.png | Bin 0 -> 326940 bytes static/images/introduction_5.png | Bin 0 -> 138772 bytes static/images/introduction_6.png | Bin 0 -> 58618 bytes static/images/introduction_7.png | Bin 0 -> 194930 bytes static/images/introduction_8.png | Bin 0 -> 277049 bytes static/images/introduction_9.png | Bin 0 -> 173567 bytes static/images/youzan-1.png | Bin 0 -> 96929 bytes static/images/youzan-2.png | Bin 0 -> 86410 bytes static/images/youzan-3.png | Bin 0 -> 385862 bytes static/images/youzan-4.png | Bin 0 -> 382306 bytes static/images/youzan-5.png | Bin 0 -> 143511 bytes static/images/youzan-6.png | Bin 0 -> 402306 bytes 29 files changed, 689 insertions(+), 101 deletions(-) diff --git a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md index 8f5aae89c27..e93ab50fd45 100644 --- a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md +++ b/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md @@ -1,6 +1,6 @@ --- { - 'title': 'Tencent Data Engineer: Why We G from ClickHouse to Apache Doris?', + 'title': 'Tencent Data Engineer: Why We Went from ClickHouse to Apache Doris?', 'summary': "Evolution of the data processing architecture of Tencent Music Entertainment towards better performance and simpler maintenance.", 'date': '2023-03-07', 'author': 'Jun Zhang & Kai Dai', diff --git a/blog/Tencent-LLM.md b/blog/Tencent-LLM.md index 3775c5a82d0..d8c39cb847c 100644 --- a/blog/Tencent-LLM.md +++ b/blog/Tencent-LLM.md @@ -28,7 +28,7 @@ specific language governing permissions and limitations under the License. --> -Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as an OLAP engine](https://doris.apache.org/blog/Tencent%20Music/) for our data management system. Back then, we were struggling with the auto-generation of SQL statements. As days pass, we have made progresses big enough to be references for you (I think), so here I am again. +Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as an OLAP engine](https://doris.apache.org/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris) for our data management system. Back then, we were struggling with the auto-generation of SQL statements. As days pass, we have made progresses big enough to be references for you (I think), so here I am again. We have adopted Large Language Models (LLM) to empower our Doris-based OLAP services. diff --git a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md b/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md deleted file mode 100644 index d38a2f7902b..00000000000 --- a/blog/Zipping-up-the-Lambda-Architecture-for-40-Faster-Performance.md +++ /dev/null @@ -1,97 +0,0 @@ ---- -{ - 'title': 'Zipping up the Lambda Architecture for 40% Faster Performance', - 'summary': "Instead of pooling real-time and offline data after they are fully ready for queries, Douyu engineers use Apache Doris to share part of the pre-query computation burden.", - 'date': '2023-05-05', - 'author': 'Tongyang Han', - 'tags': ['Best Practice'], -} - ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - -Author: Tongyang Han, Senior Data Engineer at Douyu - -The Lambda architecture has been common practice in big data processing. The concept is to separate stream (real time data) and batch (offline data) processing, and that's exactly what we did. These two types of data of ours were processed in two isolated tubes before they were pooled together and ready for searches and queries. - - - -Then we run into a few problems: - -1. **Isolation of real-time and offline data warehouses** - 1. I know this is kind of the essence of Lambda architecture, but that means we could not reuse real-time data since it was not layered as offline data, so further customized development was required. -2. **Complex Pipeline from Data Sources to Data Application** - 1. Data had to go through multi-step processing before it reached our data users. As our architecture involved too many components, navigating and maintaining these tech stacks was a lot of work. -3. **Lack of management of real-time data sources** - 1. In extreme cases, this worked like a data silo and we had no way to find out whether the ingested data was duplicated or reusable. - -So we decided to "zip up" the Lambda architecture a little bit. By "zipping up", I mean to introduce an OLAP engine that is capable of processing, storing, and analyzing data, so real-time data and offline data converge a little earlier than they used to. It is not a revolution of Lambda, but a minor change in the choice of components, which made our real-time data processing 40% faster. - -## **Zipping up Lambda Architecture** - -I am going to elaborate on how this is done using our data tagging process as an example. - -Previously, our offline tags were produced by the data warehouse, put into a flat table, and then written in **HBase**, while real-time tags were produced by **Flink**, and put into **HBase** directly. Then **Spark** would work as the computing engine. - - - -The problem with this stemmed from the low computation efficiency of **Flink** and **Spark**. - -- **Real-time tag production**: When computing real-time tags that involve data within a long time range, Flink did not deliver stable performance and consumed more resources than expected. And when a task failed, it would take a really long time for checkpoint recovery. -- **Tag query**: As a tag query engine, Spark could be slow. - -As a solution, we replaced **HBase** and **Spark** with **Apache Doris**, a real-time analytic database, and moved part of the computational logic of the foregoing wide-time-range real-time tags from **Flink** to **Apache Doris**. - - - -Instead of putting our flat tables in HBase, we place them in Apache Doris. These tables are divided into partitions based on time sensitivity. Offline tags will be updated daily while real-time tags will be updated in real time. We organize these tables in the Aggregate Model of Apache Doris, which allows partial update of data. - -Instead of using Spark for queries, we parse the query rules into SQL for execution in Apache Doris. For pattern matching, we use Redis to cache the hot data from Apache Doris, so the system can respond to such queries much faster. - - - -## **Computational Pipeline of Wide-Time-Range Real-Time Tags** - -In some cases, the computation of wide-time-range real-time tags entails the aggregation of historical (offline) data with real-time data. The following figure shows our old computational pipeline for these tags. - - - -As you can see, it required multiple tasks to finish computing one real-time tag. Also, in complicated aggregations that involve a collection of aggregation operations, any improper resource allocation could lead to back pressure or waste of resources. This adds to the difficulty of task scheduling. The maintenance and stability guarantee of such a long pipeline could be an issue, too. - -To improve on that, we decided to move such aggregation workload to Apache Doris. - - - -We have around 400 million customer tags in our system, and each customer is attached with over 300 tags. We divide customers into more than 10,000 groups, and we have to update 5000 of them on a daily basis. The above improvement has sped up the computation of our wide-time-range real-time queries by **40%**. - -## Overwrite - -To atomically replace data tables and partitions in Apache Doris, we customized the [Doris-Spark-Connector](https://github.com/apache/doris-spark-connector), and added an "Overwrite" mode to the Connector. - -When a Spark job is submitted, Apache Doris will call an interface to fetch information of the data tables and partitions. - -- If it is a non-partitioned table, we create a temporary table for the target table, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary table; -- If it is a dynamic partitioned table, we create a temporary partition for the target partition, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary partition; -- If it is a non-dynamic partitioned table, we need to extend the Doris-Spark-Connector parameter configuration first. Then we create a temporary partition and take steps as above. - -## Conclusion - -One prominent advantage of Lambda architecture is the stability it provides. However, in our practice, the processing of real-time data and offline data sometimes intertwines. For example, the computation of certain real-time tags requires historical (offline) data. Such interaction becomes a root cause of instability. Thus, instead of pooling real-time and offline data after they are fully ready for queries, we use an OLAP engine to share part of the pre-query computation burden and mak [...] diff --git a/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md new file mode 100644 index 00000000000..7018172cdee --- /dev/null +++ b/blog/introduction-to-apache-doris-a-next-generation-real-time-data-warehouse.md @@ -0,0 +1,172 @@ +--- +{ + 'title': 'Introduction to Apache Doris: A Next-Generation Real-Time Data Warehouse', + 'summary': "This is a technical overview of Apache Doris, introducing how it enables fast query performance with it architectural design, features, and mechanisms.", + 'date': '2023-10-03', + 'author': 'Apache Doris', + 'tags': ['Tech Sharing'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +## What is Apache Doris? + +[Apache Doris](https://doris.apache.org/) is an open-source real-time data warehouse. It can collect data from various data sources, including relational databases (MySQL, PostgreSQL, SQL Server, Oracle, etc.), logs, and time series data from IoT devices. It is capable of reporting, ad-hoc analysis, federated queries, and log analysis, so it can be used to support dashboarding, self-service BI, A/B testing, user behavior analysis and the like. + +Apache Doris supports both batch import and stream writing. It can be well integrated with Apache Spark, Apache Hive, Apache Flink, Airbyte, DBT, and Fivetran. It can also connect to data lakes such as Apache Hive, Apache Hudi, Apache Iceberg, Delta Lake, and Apache Paimon. + + + +## Performance + +As a real-time OLAP engine, Apache Doris hasn a competitive edge in query speed. According to the TPC-H and SSB-Flat benchmarking results, Doris can deliver much faster performance than Presto, Greenplum, and ClickHouse. + +As for its self-volution, it has increased its query speed by over 10 times in the past two years, both in complex queries and flat table analysis. + + + +## Architectural Design + +Behind the fast speed of Apache Doris is the architectural design, features, and mechanisms that contribute to the performance of Doris. + +First of all, Apache Doris has a cost-based optimizer (CBO) that can figure out the most efficient execution plan for complicated big queries. It has a fully vectorized execution engine so it can reduce virtual function calls and cache misses. It is MPP-based (Massively Parallel Processing) so it can give full play to the user's machines and cores. In Doris, query execution is data-driven, which means whether a query gets executed is determined by whether its relevant data is ready, and [...] + +## Fast Point Queries for A Column-Oriented Database + +Apache Doris is a column-oriented database so it can make data compression and data sharding easier and faster. But this might not be suitable for cases such as customer-facing services. In these cases, a data platform will have to handle requests from a large number of users concurrently (these are called "high-concurrency point queries"), and having a columnar storage engine will amplify I/O operations per second, especially when data is arranged in flat tables. + +To fix that, Apache Doris enables hybrid storage, which means to have row storage and columnar storage at the same time. + + + +In addition, since point queries are all simple queries, it will be unnecessary and wasteful to call out the query planner, so Doris executes a short circuit plan for them to reduce overhead. + +Another big source of overheads in high-concurrency point queries is SQL parsing. For that, Doris has prepared statements. The idea is to pre-compute the SQL statement and cache them, so they can be reused for similar queries. + + + +## Data Ingestion + +Apache Doris provides a range of methods for data ingestion. + +**Real-Time stream writing**: + +- **[Stream Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/stream-load-manual?_highlight=stream&_highlight=loa)**: You can apply this method to write local files or data streams via HTTP. It is linearly scalable and can reach a throughput of 10 million records per second in some use cases. +- **[Flink-Doris-Connector](https://doris.apache.org/docs/1.2/ecosystem/flink-doris-connector/)**: With built-in Flink CDC, this Connector ingests data from OLTP databases to Doris. So far, we have realized auto-synchronization of data from MySQL and Oracle to Doris. +- **[Routine Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual)**: This is to subscribe data from Kafka message queues. +- **[Insert Into](https://doris.apache.org/docs/dev/data-operate/import/import-way/insert-into-manual)**: This is especially useful when you try to do ETL in Doris internally, like writing data from one Doris table to another. + +**Batch writing**: + +- **[Spark Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/spark-load-manual)**: With this method, you can leverage Spark resources to pre-process data from HDFS and object storage before writing to Doris. +- **[Broker Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/broker-load-manual)**: This supports HDFS and S3 protocol. +- `insert into <internal table> select from <external table>`: This simple statement allows you to connect Doris to various storage systems, data lakes, and databases. + +## Data Update + +For data updates, what Apache Doris has to offer is that, it supports both Merge on Read and Merge on Write, the former for low-frequency batch updates and the latter for real-time writing. With Merge on Write, the latest data will be ready by the time you execute queries, and that's why it can improve your query speed by 5 to 10 times compared to Merge on Read. + +From an implementation perspective, these are a few common data update operations, and Doris supports them all: + +- **Upsert**: to replace or update a whole row +- **Partial column update**: to update just a few columns in a row +- **Conditional updating**: to filter out some data by combining a few conditions in order to replace or delete it +- **Insert Overwrite**: to rewrite a table or partition + +In some cases, data updates happen concurrently, which means there is numerous new data coming in and trying to modify the existing data record, so the updating order matters a lot. That's why Doris allows you to decide the order, either by the order of transaction commit or that of the sequence column (something that you specify in the table in advance). Doris also supports data deletion based on the specified predicate, and that's how conditional updating is done. + +## Service Availability & Data Reliability + +Apart from fast performance in queries and data ingestion, Apache Doris also provides service availability guarantee, and this is how: + +Architecturally, Doris has two processes: frontend and backend. Both of them are easily scalable. The frontend nodes manage the clusters, metadata and handle user requests; the backend nodes execute the queries and are capable of auto data balancing and auto-restoration. It supports cluster upgrading and scaling to avoid interruption to services. + + + +## Cross Cluster Replication + +Enterprise users, especially those in finance or e-commerce, will need to backup their clusters or their entire data center, just in case of force majeure. So Doris 2.0 provides Cross Cluster Replication (CCR). With CCR, users can do a lot: + +- **Disaster recovery**: for quick restoration of data services +- **Read-write separation**: master cluster + slave cluster; one for reading, one for writing +- **Isolated upgrade of clusters**: For cluster scaling, CCR allows users to pre-create a backup cluster for a trial run so they can clear out the possible incompatibility issues and bugs. + +Tests show that Doris CCR can reach a data latency of minutes. In the best case, it can reach the upper speed limit of the hardware environment. + + + +## Multi-Tenant Management + +Apache Doris has sophisticated Role-Based Access Control, and it allows fine-grained privilege control on the level of databases, tables, rows, and columns. + + + +For resource isolation, Doris used to implement a hard isolation plan, which is to divide the backend nodes into resource groups, and assign the Resource Groups to different workloads. This is a hard isolation plan. It was simple and neat. But sometimes users can make the most out of their computing resource because some Resource Groups are idle. + + + +Thus, instead of Resource Groups, Doris 2.0 introduces Workload Group. A soft limit is set for a Workload Group about how many resources it can use. When that soft limit is hit, and meanwhile there are some idle resources available. The idle resources will be shared across the workload groups. Users can also prioritize the workload groups in terms of their access to idle resources. + + + +## Easy to Use + +As many capabilities as Apache Doris provides, it is also easy to use. It supports standard SQL and is compatible with MySQL protocol and most BI tools on the market. + +Another effort that we've made to improve usability is a feature called Light Schema Change. This means if users need to add or delete some columns in a table, they just need to update the metadata in the frontend but don't have to modify all the data files. Light Schema Change can be done within milliseconds. It also allows changes to indexes and data type of columns. The combination of Light Schema Change and Flink-Doris-Connector means synchronization of upstream tables within milliseconds. + +## Semi-Structured Data Analysis + +Common examples of semi-structure data include logs, observability data, and time series data. These cases require schema-free support, lower cost, and capabilities in multi-dimensional analysis and full-text search. + +In text analysis, mostly, people use the LIKE operator, so we put a lot of effort into improving the performance of it, including pushing down the LIKE operator down to the storage layer (to reduce data scanning), and introducing the NGram Bloomfilter, the Hyperscan regex matching library, and the Volnitsky algorithm (for sub-string matching). + + + +We have also introduced inverted index for text tokenization. It is a power tool for fuzzy keyword search, full-text search, equivalence queries, and range queries. + +## Data Lakehouse + +For users to build a high-performing data lakehouse and a unified query gateway, Doris can map, cache, and auto-refresh the meta data from external sources. It supports Hive Metastore and almost all open data lakehouse formats. You can connect it to relational databases, Elasticsearch, and many other sources. And it allows you to reuse your own authentication systems, like Kerberos and Apache Ranger, on the external tables. + +Benchmark results show that Apache Doris is 3~5 times faster than Trino in queries on Hive tables. It is the joint result of a few features: + +1. Efficient query engine +2. Hot data caching mechanism +3. Compute nodes +4. Views in Doris + +The [Compute Nodes](https://doris.apache.org/docs/dev/advanced/compute-node) is a newly introduced solution in version 2.0 for data lakehousing. Unlike normal backend nodes, Compute Nodes are stateless and do not store any data. Neither are they involved in data balancing during cluster scaling. Thus, they can join the cluster flexibly and easily during computation peak times. + +Also, Doris allows you to write the computation results of external tables into Doris to form a view. This is a similar thinking to Materialized Views: to trade space for speed. After a query on external tables is executed, the results can be put in Doris internally. When there are similar queries following up, the system can directly read the results of previous queries from Doris, and that speeds things up. + +## Tiered Storage + +The main purpose of tiered storage is to save money. [Tiered storage ](https://doris.apache.org/docs/dev/advanced/cold-hot-separation?_highlight=cold)means to separate hot data and cold data into different storage, with hot data being the data that is frequently accessed and cold data that isn't. It allows users to put hot data in the quick but expensive disks (such as SSD and HDD), and cold data in object storage. + + + +Roughly speaking, for a data asset consisting of 80% cold data, tiered storage will reduce your storage cost by 70%. + +## The Apache Doris Community + +This is an overview of Apache Doris, an open-source real-time data warehouse. It is actively evolving with an agile release schedule, and the [community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog) embraces any questions, ideas, and feedback. diff --git a/blog/log-analysis-elasticsearch-vs-apache-doris.md b/blog/log-analysis-elasticsearch-vs-apache-doris.md new file mode 100644 index 00000000000..28c57378f85 --- /dev/null +++ b/blog/log-analysis-elasticsearch-vs-apache-doris.md @@ -0,0 +1,352 @@ +--- +{ + 'title': 'Log Analysis: Elasticsearch VS Apache Doris', + 'summary': "As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth.", + 'date': '2023-09-28', + 'author': 'Apache Doris', + 'tags': ['Tech Sharing'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +As a major part of a company's data asset, logs brings values to businesses in three aspects: system observability, cyber security, and data analysis. They are your first resort for troubleshooting, your reference for improving system security, and your data mine where you can extract information that points to business growth. + +Logs are the sequential records of events in the computer system. If you think about how logs are generated and used, you will know what an ideal log analysis system should look like: + +- **It should have schema-free support.** Raw logs are unstructured free texts and basically impossible for aggregation and calculation, so you needed to turn them into structured tables (the process is called "ETL") before putting them into a database or data warehouse for analysis. If there was a schema change, lots of complicated adjustments needed to put into ETL and the structured tables. Therefore, semi-structured logs, mostly in JSON format, emerged. You can add or delete fields i [...] +- **It should be low-cost.** Logs are huge and they are generated continuously. A fairly big company produces 10~100 TBs of log data. For business or compliance reasons, it should keep the logs around for half a year or longer. That means to store a log size measured in PB, so the cost is considerable. +- **It should be capable of real-time processing.** Logs should be written in real time, otherwise engineers won't be able to catch the latest events in troubleshooting and security tracking. Plus, a good log system should provide full-text searching capabilities and respond to interactive queries quickly. + +## The Elasticsearch-Based Log Analysis Solution + +A popular log processing solution within the data industry is the **ELK stack: Elasticsearch, Logstash, and Kibana**. The pipeline can be split into five modules: + +- **Log collection**: Filebeat collects local log files and writes them to a Kafka message queue. +- **Log transmission**: Kafka message queue gathers and caches logs. +- **Log transfer**: Logstash filters and transfers log data in Kafka. +- **Log storage**: Logstash writes logs in JSON format into Elasticsearch for storage. +- **Log query**: Users search for logs via Kibana visualization or send a query request via Elasticsearch DSL API. + + + +The ELK stack has outstanding real-time processing capabilities, but frictions exist. + +### Inadequate Schema-Free Support + +The Index Mapping in Elasticsearch defines the table scheme, which includes the field names, data types, and whether to enable index creation. + + + +Elasticsearch also boasts a Dynamic Mapping mechanism that automatically adds fields to the Mapping according to the input JSON data. This provides some sort of schema-free support, but it's not enough because: + +- Dynamic Mapping often creates too many fields when processing dirty data, which interrupts the whole system. +- The data type of fields is immutable. To ensure compatibility, users often configure "text" as the data type, but that results in much slower query performance than binary data types such as integer. +- The index of fields is immutable, too. Users cannot add or delete indexes for a certain field, so they often create indexes for all fields to facilitate data filtering in queries. But too many indexes require extra storage space and slow down data ingestion. + +### Inadequate Analytic Capability + +Elasticsearch has its unique Domain Specific Language (DSL), which is very different from the tech stack that most data engineers and analysts are familiar with, so there is a steep learning curve. Moreover, Elasticsearch has a relatively closed ecosystem so there might be strong resistance in integration with BI tools. Most importantly, Elastisearch only supports single-table analysis and is lagging behind the modern OLAP demands for multi-table join, sub-query, and views. + + + +### High Cost & Low Stability + +Elasticsearch users have been complaining about the computation and storage costs. The root reason lies in the way Elasticsearch works. + +- **Computation cost**: In data writing, Elasticsearch also executes compute-intensive operations including inverted index creation, tokenization, and inverted index ranking. Under these circumstances, data is written into Elasticsearch at a speed of around 2MB/s per core. When CPU resources are tight, data writing requirements often get rejected during peak times, which further leads to higher latency. +- **Storage cost**: To speed up retrieval, Elasticsearch stores the forward indexes, inverted indexes, and docvalues of the original data, consuming a lot more storage space. The compression ratio of a single data copy is only 1.5:1, compared to the 5:1 in most log solutions. + +As data and cluster size grows, maintaining stability can be another issue: + +- **During data writing peaks**: Clusters are prone to overload during data writing peaks. + +- **During queries**: Since all queries are processed in the memory, big queries can easily lead to JVM OOM. + +- **Slow recovery**: For a cluster failure, Elasticsearch should reload indexes, which is resource-intensive, so it will take many minutes to recover. That challenges service availability guarantee. + + + +## A More Cost-Effective Option + +Reflecting on the strengths and limitations of the Elasticsearch-based solution, the Apache Doris developers have optimized Apache Doris for log processing. + +- **Increase writing throughout**: The performance of Elasticsearch is bottlenecked by data parsing and inverted index creation, so we improved Apache Doris in these factors: we quickened data parsing and index creation by SIMD instructions and CPU vector instructions; then we removed those data structures unnecessary for log analysis scenarios, such as forward indexes, to simplify index creation. +- **Reduce storage costs**: We removed forward indexes, which represented 30% of index data. We adopted columnar storage and the ZSTD compression algorithm, and thus achieved a compression ratio of 5:1 to 10:1. Given that a large part of the historical logs are rarely accessed, we introduced tiered storage to separate hot and cold data. Logs that are older than a specified time period will be moved to object storage, which is much less expensive. This can reduce storage costs by around 70%. + +Benchmark tests with ES Rally, the official testing tool for Elasticsearch, showed that Apache Doris was around 5 times as fast as Elasticsearch in data writing, 2.3 times as fast in queries, and it consumed only 1/5 of the storage space that Elasticsearch used. On the test dataset of HTTP logs, it achieved a writing speed of 550 MB/s and a compression ratio of 10:1. + + + +The below figure show what a typical Doris-based log processing system looks like. It is more inclusive and allows for more flexible usage from data ingestion, analysis, and application: + +- **Ingestion**: Apache Doris supports various ingestion methods for log data. You can push logs to Doris via HTTP Output using Logstash, you can use Flink to pre-process the logs before you write them into Doris, or you can load logs from Flink or object storage to Doris via Routine Load and S3 Load. +- **Analysis**: You can put log data in Doris and conduct join queries across logs and other data in the data warehouse. +- **Application**: Apache Doris is compatible with MySQL protocol, so you can integrate a wide variety of data analytic tools and clients to Doris, such as Grafana and Tableau. You can also connect applications to Doris via JDBC and ODBC APIs. We are planning to build a Kibana-like system to visualize logs. + + + +Moreover, Apache Doris has better scheme-free support and a more user-friendly analytic engine. + +### Native Support for Semi-Structured Data + +**Firstly, we worked on the data types.** We optimized the string search and regular expression matching for "text" through vectorization and brought a performance increase of 2~10 times. For JSON strings, Apache Doris will parse and store them as a more compacted and efficient binary format, which can speed up queries by 4 times. We also added a new data type for complicated data: Array Map. It can structuralize concatenated strings to allow for higher compression rate and faster queries. + +**Secondly, Apache Doris supports schema evolution.** This means you can adjust the schema as your business changes. You can add or delete fields and indexes, and change the data types for fields. + +Apache Doris provides Light Schema Change capabilities, so you can add or delete fields within milliseconds: + +```SQL +-- Add a column. Result will be returned in milliseconds. +ALTER TABLE lineitem ADD COLUMN l_new_column INT; +``` + + + +You can also add index only for your target fields, so you can avoid overheads from unnecessary index creation. After you add an index, by default, the system will generate the index for all incremental data, and you can specify which historical data partitions that need the index. + +```SQL +-- Add inverted index. Doris will generate inverted index for all new data afterward. +ALTER TABLE table_name ADD INDEX index_name(column_name) USING INVERTED; + +-- Build index for the specified historical data partitions. +BUILD INDEX index_name ON table_name PARTITIONS(partition_name1, partition_name2); +``` + +### SQL-Based Analytic Engine + +The SQL-based analytic engine makes sure that data engineers and analysts can smoothly grasp Apache Doris in a short time and bring their experience with SQL to this OLAP engine. Building on the rich features of SQL, users can execute data retrieval, aggregation, multi-table join, sub-query, UDF, logic views, and materialized views to serve their own needs. + +With MySQL compatibility, Apache Doris can be integrated with most GUI and BI tools in the big data ecosystem, so users can realize more complex and diversified data analysis. + +### Performance in Use Case + +A gaming company has transitioned from the ELK stack to the Apache Doris solution. Their Doris-based log system used 1/6 of the storage space that they previously needed. + +In a cybersecurity company who built their log analysis system utilizing inverted index in Apache Doris, they supported a data writing speed of 300,000 rows per second with 1/5 of the server resources that they formerly used. + +## Hands-On Guide + +Now let's go through the three steps of building a log analysis system with Apache Doris. + +Before you start, [download](https://doris.apache.org/download/) Apache Doris 2.0 or newer versions from the website and [deploy](https://doris.apache.org/docs/dev/install/standard-deployment/) clusters. + +### Step 1: Create Tables + +This is an example of table creation. + +Explanations for the configurations: + +- The DATETIMEV2 time field is specified as the Key in order to speed up queries for the latest N log records. +- Indexes are created for the frequently accessed fields, and fields that require full-text search are specified with Parser parameters. +- "PARTITION BY RANGE" means to partition the data by RANGE based on time fields, [Dynamic Partition](https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/) is enabled for auto-management. +- "DISTRIBUTED BY RANDOM BUCKETS AUTO" means to distribute the data into buckets randomly and the system will automatically decide the number of buckets based on the cluster size and data volume. +- "log_policy_1day" and "log_s3" means to move logs older than 1 day to S3 storage. + +```Go +CREATE DATABASE log_db; +USE log_db; + +CREATE RESOURCE "log_s3" +PROPERTIES +( + "type" = "s3", + "s3.endpoint" = "your_endpoint_url", + "s3.region" = "your_region", + "s3.bucket" = "your_bucket", + "s3.root.path" = "your_path", + "s3.access_key" = "your_ak", + "s3.secret_key" = "your_sk" +); + +CREATE STORAGE POLICY log_policy_1day +PROPERTIES( + "storage_resource" = "log_s3", + "cooldown_ttl" = "86400" +); + +CREATE TABLE log_table +( + `ts` DATETIMEV2, + `clientip` VARCHAR(20), + `request` TEXT, + `status` INT, + `size` INT, + INDEX idx_size (`size`) USING INVERTED, + INDEX idx_status (`status`) USING INVERTED, + INDEX idx_clientip (`clientip`) USING INVERTED, + INDEX idx_request (`request`) USING INVERTED PROPERTIES("parser" = "english") +) +ENGINE = OLAP +DUPLICATE KEY(`ts`) +PARTITION BY RANGE(`ts`) () +DISTRIBUTED BY RANDOM BUCKETS AUTO +PROPERTIES ( +"replication_num" = "1", +"storage_policy" = "log_policy_1day", +"deprecated_dynamic_schema" = "true", +"dynamic_partition.enable" = "true", +"dynamic_partition.time_unit" = "DAY", +"dynamic_partition.start" = "-3", +"dynamic_partition.end" = "7", +"dynamic_partition.prefix" = "p", +"dynamic_partition.buckets" = "AUTO", +"dynamic_partition.replication_num" = "1" +); +``` + +### Step 2: Ingest the Logs + +Apache Doris supports various ingestion methods. For real-time logs, we recommend the following three methods: + +- Pull logs from Kafka message queue: Routine Load +- Logstash: write logs into Doris via HTTP API +- Self-defined writing program: write logs into Doris via HTTP API + +**Ingest from Kafka** + +For JSON logs that are written into Kafka message queues, create [Routine Load](https://doris.apache.org/docs/dev/data-operate/import/import-way/routine-load-manual/) so Doris will pull data from Kafka. The following is an example. The `property.*` configurations are optional: + +```SQL +-- Prepare the Kafka cluster and topic ("log_topic") + +-- Create Routine Load, load data from Kafka log_topic to "log_table" +CREATE ROUTINE LOAD load_log_kafka ON log_db.log_table +COLUMNS(ts, clientip, request, status, size) +PROPERTIES ( + "max_batch_interval" = "10", + "max_batch_rows" = "1000000", + "max_batch_size" = "109715200", + "strict_mode" = "false", + "format" = "json" +) +FROM KAFKA ( + "kafka_broker_list" = "host:port", + "kafka_topic" = "log_topic", + "property.group.id" = "your_group_id", + "property.security.protocol"="SASL_PLAINTEXT", + "property.sasl.mechanism"="GSSAPI", + "property.sasl.kerberos.service.name"="kafka", + "property.sasl.kerberos.keytab"="/path/to/xxx.keytab", + "property.sasl.kerberos.principal"="x...@yyy.com" +); +``` + +You can check how the Routine Load runs via the `SHOW ROUTINE LOAD` command. + +**Ingest via Logstash** + +Configure HTTP Output for Logstash, and then data will be sent to Doris via HTTP Stream Load. + +1. Specify the batch size and batch delay in `logstash.yml` to improve data writing performance. + +```Plain +pipeline.batch.size: 100000 +pipeline.batch.delay: 10000 +``` + +2. Add HTTP Output to the log collection configuration file `testlog.conf`, URL => the Stream Load address in Doris. + +- Since Logstash does not support HTTP redirection, you should use a backend address instead of a FE address. +- Authorization in the headers is `http basic auth`. It is computed with `echo -n 'username:password' | base64`. +- The `load_to_single_tablet` in the headers can reduce the number of small files in data ingestion. + +```Plain +output { + http { + follow_redirects => true + keepalive => false + http_method => "put" + url => "http://172.21.0.5:8640/api/logdb/logtable/_stream_load" + headers => [ + "format", "json", + "strip_outer_array", "true", + "load_to_single_tablet", "true", + "Authorization", "Basic cm9vdDo=", + "Expect", "100-continue" + ] + format => "json_batch" + } +} +``` + +**Ingest via self-defined program** + +This is an example of ingesting data to Doris via HTTP Stream Load. + +Notes: + +- Use `basic auth` for HTTP authorization, use `echo -n 'username:password' | base64` in computation +- `http header "format:json"`: the data type is specified as JSON +- `http header "read_json_by_line:true"`: each line is a JSON record +- `http header "load_to_single_tablet:true"`: write to one tablet each time +- For the data writing clients, we recommend a batch size of 100MB~1GB. Future versions will enable Group Commit at the server end and reduce batch size from clients. + +```Bash +curl \ +--location-trusted \ +-u username:password \ +-H "format:json" \ +-H "read_json_by_line:true" \ +-H "load_to_single_tablet:true" \ +-T logfile.json \ +http://fe_host:fe_http_port/api/log_db/log_table/_stream_load +``` + +### Step 3: Execute Queries + +Apache Doris supports standard SQL, so you can connect to Doris via MySQL client or JDBC and then execute SQL queries. + +```SQL +mysql -h fe_host -P fe_mysql_port -u root -Dlog_db +``` + +**A few common queries in log analysis:** + +- Check the latest 10 records. + +```SQL +SELECT * FROM log_table ORDER BY ts DESC LIMIT 10; +``` + +- Check the latest 10 records of Client IP "8.8.8.8". + +```SQL +SELECT * FROM log_table WHERE clientip = '8.8.8.8' ORDER BY ts DESC LIMIT 10; +``` + +- Retrieve the latest 10 records with "error" or "404" in the "request" field. **MATCH_ANY** is a SQL syntax keyword for full-text search in Doris. It means to find the records that include any one of the specified keywords. + +```SQL +SELECT * FROM log_table WHERE request MATCH_ANY 'error 404' ORDER BY ts DESC LIMIT 10; +``` + +- Retrieve the latest 10 records with "image" and "faq" in the "request" field. **MATCH_ALL** is also a SQL syntax keyword for full-text search in Doris. It means to find the records that include all of the specified keywords. + +```SQL +SELECT * FROM log_table WHERE request MATCH_ALL 'image faq' ORDER BY ts DESC LIMIT 10; +``` + +## Conclusion + +If you are looking for an efficient log analytic solution, Apache Doris is friendly to anyone equipped with SQL knowledge; if you find friction with the ELK stack, try Apache Doris provides better schema-free support, enables faster data writing and queries, and brings much less storage burden. + +But we won't stop here. We are going to provide more features to facilitate log analysis. We plan to add more complicated data types to inverted index, and support BKD index to make Apache Doris a fit for geo data analysis. We also plan to expand capabilities in semi-structured data analysis, such as working on the complex data types (Array, Map, Struct, JSON) and high-performance string matching algorithm. And we welcome any [user feedback and development advice](https://t.co/ZxJuNJHXb2). diff --git a/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md new file mode 100644 index 00000000000..d5e63c95caf --- /dev/null +++ b/blog/migrating-from-clickhouse-to-apache-doris-what-happened.md @@ -0,0 +1,161 @@ +--- +{ + 'title': 'Migrating from ClickHouse to Apache Doris: What Happened?', + 'summary': "A user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment. ", + 'date': '2023-10-11', + 'author': 'Chuang Li', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +Migrating from one OLAP database to another is huge. Even if you're unhappy with your current data tool and have found some promising candidate, you might still hesitate to do the big surgery on your data architecture, because you're uncertain about how things are going to work. So you need experience shared by someone who has walked the path. + +Luckily, a user of Apache Doris has written down their migration process from ClickHouse to Doris, including why they need the change, what needs to be taken care of, and how they compare the performance of the two databases in their environment. + +To decide whether you want to continue reading, check if you tick one of the following boxes: + +- You need your join queries to be executed faster. +- You need flexible data updates. +- You need real-time data analysis. +- You need to minimize your components. + +If you do, this post might be of some help to you. + +## Replacing Kylin, ClickHouse, and Druid with Apache Doris + +The user undergoing this change is an e-commerce SaaS provider. Its data system serves realtime and offline reporting, customer segmentation, and log analysis. Initially, they used different OLAP engines for these various purposes: + +- **Apache Kylin for offline reporting**: The system provides offline reporting services for over 5 million sellers. The big ones among them have more than 10 million registered members and 100,000 SKU, and the detailed information is put into over 400 data cubes on the platform. +- **ClickHouse for customer segmentation and Top-N log queries**: This entails high-frequency updates, high QPS, and complicated SQL. +- **Apache Druid for real-time reporting**: Sellers extract data they need by combining different dimensions, and such real-time reporting requires quick data updates, quick query response, and strong stability of the system. + + + +The three components have their own sore spots. + +- **Apache Kylin** runs well with a fixed table schema, but every time you want to add a dimension, you need to create a new data cube and refill the historical data in it. +- **ClickHouse** is not designed for multi-table processing, so you might need an extra solution for federated queries and multi-table join queries. And in this case, it was below expectation in high-concurrency scenarios. +- **Apache Druid** implements idempotent writing so it does not support data updating or deletion itself. That means when there is something wrong at the upstream, you will need a full data replacement. And such data fixing is a multi-step process if you think it all the way through, because of all the data backups and movements. Plus, newly ingested data will not be accessible for queries until it is put in segments in Druid. That means a longer window such that data inconsistency betwe [...] + +As they work together, this architecture might be too demanding to navigate because it requires knowledge of all these components in terms of development, monitoring, and maintenance. Also, every time the user scales a cluster, they must stop the current cluster and migrate all databases and tables, which is not only a big undertaking but also a huge interruption to business. + + + +Apache Doris fills these gaps. + +- **Query performance**: Doris is good at high-concurrency queries and join queries, and it is now equipped with inverted index to speed up searches in logs. +- **Data update**: The Unique Key model of Doris supports both large-volume update and high-freqency real-time writing, and the Duplicate Key model and Unique Key model supports partial column update. It also provides exactly-once guarantee in data writing and ensures consistency between base tables, materialized views, and replicas. +- **Maintenance**: Doris is MySQL-compatible. It supports easy scaling and light schema change. It comes with its own integration tools such as Flink-Doris-Connector and Spark-Doris-Connector. + +So they plan on the migration. + +## The Replacement Surgery + +ClickHouse was the main performance bottleneck in the old data architecture and why the user wanted the change in the first place, so they started with ClickHouse. + +### Changes in SQL statements + +**Table creation statements** + + + +The user built their own SQL rewriting tool that can convert a ClickHouse table creation statement into a Doris table creation statement. The tool can automate the following changes: + +- **Mapping the field types**: It converts ClickHouse field types into the corresponding ones in Doris. For example, it converts String as a Key into Varchar, and String as a partitioning field into Date V2. +- **Setting the number of historical partitions in dynamic partitioning tables**: Some tables have historical partitions and the number of partitions should be specified upon table creation in Doris, otherwise a "No Partition" error will be thrown. +- **Determining the number of buckets**: It decides the number of buckets based on the data volume of historical partitions; for non-partitioned tables, it decides the bucketing configurations based on the historical data volume. +- **Determining TTL**: It decides the time to live of partitions in dynamic partitioning tables. +- **Setting the import sequence**: For the Unique Key model of Doris, it can specify the data import order based on the Sequence column to ensure orderliness in data ingestion. + + + +**Query statements** + +Similarly, they have their own tool to transform the ClickHouse query statements into Doris query statements. This is to prepare for the comparison test between ClickHouse and Doris. The key considerations in the conversions include: + +- **Conversion of table names**: This is simple given the mapping rules in table creation statements. +- **Conversion of functions**: For example, the `COUNTIF` function in ClickHouse is equivalent to `SUM(CASE WHEN_THEN 1 ELSE 0)`, `Array Join` is equivalent to `Explode` and `Lateral View`, and `ORDER BY` and `GROUP BY` should be converted to window functions. +- **Difference** **in semantics**: ClickHouse goes by its own protocol while Doris is MySQL-compatible, so there needs to be alias for subqueries. In this use case, subqueries are common in customer segmentation, so they use `sqlparse` + +### Changes in data ingestion methods + + + +Apache Doris provides broad options of data writing methods. For the real-time link, the user adopts Stream Load to ingest data from NSQ and Kafka. + +For the sizable offline data, the user tested different methods and here are the takeouts: + +1. **Insert Into** + +Using Multi-Catalog to read external data sources and ingesting with Insert Into can serve most needs in this use case. + +2. **Stream Load** + +The Spark-Doris-Connector is a more general method. It can handle large data volumes and ensure writing stability. The key is to find the right writing pace and parallelism. + +The Spark-Doris-Connector also supports Bitmap. It allows you to move the computation workload of Bitmap data in Spark clusters. + +Both the Spark-Doris-Connector and the Flink-Doris-Connector rely on Stream Load. CSV is the recommended format choice. Tests on the user's billions of rows showed that CSV was 40% faster than JSON. + +3. **Spark Load** + +The Spark Load method utilizes Spark resources for data shuffling and ranking. The computation results are put in HDFS, and then Doris reads the files from HDFS directly (via Broker Load). This approach is ideal for huge data ingestion. The more data there is, the faster and more resource-efficient the ingestion is. + +## Pressure Test + +The user compared performance of the two components on their SQL and join query scenarios, and calculated the CPU and memory consumption of Apache Doris. + +### SQL query performance + +Apache Doris outperformed ClickHouse in 10 of the 16 SQL queries, and the biggest performance gap was a ratio of almost 30. Overall, Apache Doris was 2~3 times faster than ClickHouse. + + + +### Join query performance + +For join query tests, the user used different sizes of main tables and dimension tables. + +- **Primary tables**: user activity table (4 billion rows), user attribute table (25 billion rows), and user attribute table (96 billion rows) +- **Dimension tables**: 1 million rows, 10 million rows, 50 million rows, 100 million rows, 500 million rows, 1 billion rows, and 2.5 billion rows. + +The tests include **full join queries** and **filtering join queries**. Full join queries join all rows of the primary table and dimension tables, while filtering join queries retrieve data of a certain seller ID with a `WHERE` filter. The results are concluded as follows: + +**Primary table (4 billion rows):** + +- Full join queries: Doris outperforms ClickHouse in full join queries with all dimension tables. The performance gap widens as the dimension tables get larger. The largest difference is a ratio of 5. +- Filtering join queries: Based on the seller ID, the filter screened out 41 million rows from the primary table. With small dimension tables, Doris was 2~3 times faster than ClickHouse; with large dimension tables, Doris was over 10 times faster; with dimension tables larger than 100 million rows, ClickHouse threw an OOM error and Doris functions normally. + +**Primary table (25 billion rows):** + +- Full join queries: Doris outperforms ClickHouse in full join queries with all dimension tables. ClickHouse produced an OOM error with dimension tables larger than 50 million rows. +- Filtering join queries: The filter screened out 570 million rows from the primary table. Doris responded within seconds and ClickHouse finished within minutes and broke down when joining large dimension tables. + +**Primary table (96 billion rows):** + +Doris delivered relatively quick performance in all queries and ClickHouse was unable to execute all of them. + +In terms of CPU and memory consumption, Apache Doris maintained stable cluster loads in all sizes of join queries. + +## Future Directions + +As the migration goes on, the user works closely with the [Doris community](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog), and their feedback has contributed to the making of [Apache Doris 2.0.0](https://doris.apache.org/docs/dev/releasenotes/release-2.0.0/). We will continue assisting them in their migration from Kylin and Druid to Doris, and we look forward to see their Doris-based unified data platform come into being. diff --git a/blog/release-note-2.0.0.md b/blog/release-note-2.0.0.md index 60fed524577..626f2ba7f50 100644 --- a/blog/release-note-2.0.0.md +++ b/blog/release-note-2.0.0.md @@ -188,9 +188,9 @@ To separate storage and computation is a way to realize elastic scaling of resou Apache Doris 2.0 provides two solutions to address the needs of the first two types of users. 1. **Compute nodes**. We introduced stateless compute nodes in version 2.0. Unlike the mix nodes, the compute nodes do not save any data and are not involved in workload balancing of data tablets during cluster scaling. Thus, they are able to quickly join the cluster and share the computing pressure during peak times. In addition, in data lakehouse analysis, these nodes will be the first ones to execute queries on remote storage (HDFS/S3) so there will be no resource competition between [...] - 1. Doc: https://doris.apache.org/docs/dev/advanced/compute_node/ + 1. Doc: https://doris.apache.org/docs/dev/advanced/compute-node 2. **Hot-cold data separation**. Hot/cold data refers to data that is frequently/seldom accessed, respectively. Generally, it makes more sense to store cold data in low-cost storage. Older versions of Apache Doris support lifecycle management of table partitions: As hot data cooled down, it would be moved from SSD to HDD. However, data was stored with multiple replicas on HDD, which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in object storage, which is even chea [...] - 1. Read more: https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data:-What,-Why,-and-How?/ + 1. Read more: https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How For neater separate of computation and storage, the VeloDB team is going to contribute the Cloud Compute-Storage-Separation solution to the Apache Doris project. The performance and stability of it has stood the test of hundreds of companies in their production environment. The merging of code will be finished by October this year, and all Apache Doris users will be able to get an early taste of it in September. diff --git a/static/images/Introduction_3.png b/static/images/Introduction_3.png new file mode 100644 index 00000000000..7e3e112c235 Binary files /dev/null and b/static/images/Introduction_3.png differ diff --git a/static/images/Introduction_4.png b/static/images/Introduction_4.png new file mode 100644 index 00000000000..4d7c893cf8e Binary files /dev/null and b/static/images/Introduction_4.png differ diff --git a/static/images/LAS_1.png b/static/images/LAS_1.png new file mode 100644 index 00000000000..c7f75990173 Binary files /dev/null and b/static/images/LAS_1.png differ diff --git a/static/images/LAS_2.png b/static/images/LAS_2.png new file mode 100644 index 00000000000..c86d90a3c1a Binary files /dev/null and b/static/images/LAS_2.png differ diff --git a/static/images/LAS_3.png b/static/images/LAS_3.png new file mode 100644 index 00000000000..6d3194f1e11 Binary files /dev/null and b/static/images/LAS_3.png differ diff --git a/static/images/LAS_4.png b/static/images/LAS_4.png new file mode 100644 index 00000000000..e298b60940e Binary files /dev/null and b/static/images/LAS_4.png differ diff --git a/static/images/LAS_5.png b/static/images/LAS_5.png new file mode 100644 index 00000000000..eedf9bc8ab1 Binary files /dev/null and b/static/images/LAS_5.png differ diff --git a/static/images/introduction_1.png b/static/images/introduction_1.png new file mode 100644 index 00000000000..0568bba99ec Binary files /dev/null and b/static/images/introduction_1.png differ diff --git a/static/images/introduction_10.png b/static/images/introduction_10.png new file mode 100644 index 00000000000..da0624ae8bb Binary files /dev/null and b/static/images/introduction_10.png differ diff --git a/static/images/introduction_11.png b/static/images/introduction_11.png new file mode 100644 index 00000000000..e98f99cd01e Binary files /dev/null and b/static/images/introduction_11.png differ diff --git a/static/images/introduction_2.png b/static/images/introduction_2.png new file mode 100644 index 00000000000..ecaa8adec66 Binary files /dev/null and b/static/images/introduction_2.png differ diff --git a/static/images/introduction_5.png b/static/images/introduction_5.png new file mode 100644 index 00000000000..5616299f505 Binary files /dev/null and b/static/images/introduction_5.png differ diff --git a/static/images/introduction_6.png b/static/images/introduction_6.png new file mode 100644 index 00000000000..71a83250a1b Binary files /dev/null and b/static/images/introduction_6.png differ diff --git a/static/images/introduction_7.png b/static/images/introduction_7.png new file mode 100644 index 00000000000..d56b89fb340 Binary files /dev/null and b/static/images/introduction_7.png differ diff --git a/static/images/introduction_8.png b/static/images/introduction_8.png new file mode 100644 index 00000000000..b7de900ceb3 Binary files /dev/null and b/static/images/introduction_8.png differ diff --git a/static/images/introduction_9.png b/static/images/introduction_9.png new file mode 100644 index 00000000000..ef2d4401594 Binary files /dev/null and b/static/images/introduction_9.png differ diff --git a/static/images/youzan-1.png b/static/images/youzan-1.png new file mode 100644 index 00000000000..a966e51cd81 Binary files /dev/null and b/static/images/youzan-1.png differ diff --git a/static/images/youzan-2.png b/static/images/youzan-2.png new file mode 100644 index 00000000000..cf64e1e3808 Binary files /dev/null and b/static/images/youzan-2.png differ diff --git a/static/images/youzan-3.png b/static/images/youzan-3.png new file mode 100644 index 00000000000..e90fb32534a Binary files /dev/null and b/static/images/youzan-3.png differ diff --git a/static/images/youzan-4.png b/static/images/youzan-4.png new file mode 100644 index 00000000000..9dda8649452 Binary files /dev/null and b/static/images/youzan-4.png differ diff --git a/static/images/youzan-5.png b/static/images/youzan-5.png new file mode 100644 index 00000000000..48b3ef45a92 Binary files /dev/null and b/static/images/youzan-5.png differ diff --git a/static/images/youzan-6.png b/static/images/youzan-6.png new file mode 100644 index 00000000000..46f46a68a5a Binary files /dev/null and b/static/images/youzan-6.png differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org