This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new cf0e7865b2a (new) update url and pics (#312) cf0e7865b2a is described below commit cf0e7865b2aefce189e81f3727e947d8ccf0ac89 Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Sat Oct 7 11:22:06 2023 +0800 (new) update url and pics (#312) --- ...chitecture-for-40-Percent-Faster-Performance.md | 97 ++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/blog/Zipping-up-the-Lambda-Architecture-for-40-Percent-Faster-Performance.md b/blog/Zipping-up-the-Lambda-Architecture-for-40-Percent-Faster-Performance.md new file mode 100644 index 00000000000..d38a2f7902b --- /dev/null +++ b/blog/Zipping-up-the-Lambda-Architecture-for-40-Percent-Faster-Performance.md @@ -0,0 +1,97 @@ +--- +{ + 'title': 'Zipping up the Lambda Architecture for 40% Faster Performance', + 'summary': "Instead of pooling real-time and offline data after they are fully ready for queries, Douyu engineers use Apache Doris to share part of the pre-query computation burden.", + 'date': '2023-05-05', + 'author': 'Tongyang Han', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +Author: Tongyang Han, Senior Data Engineer at Douyu + +The Lambda architecture has been common practice in big data processing. The concept is to separate stream (real time data) and batch (offline data) processing, and that's exactly what we did. These two types of data of ours were processed in two isolated tubes before they were pooled together and ready for searches and queries. + + + +Then we run into a few problems: + +1. **Isolation of real-time and offline data warehouses** + 1. I know this is kind of the essence of Lambda architecture, but that means we could not reuse real-time data since it was not layered as offline data, so further customized development was required. +2. **Complex Pipeline from Data Sources to Data Application** + 1. Data had to go through multi-step processing before it reached our data users. As our architecture involved too many components, navigating and maintaining these tech stacks was a lot of work. +3. **Lack of management of real-time data sources** + 1. In extreme cases, this worked like a data silo and we had no way to find out whether the ingested data was duplicated or reusable. + +So we decided to "zip up" the Lambda architecture a little bit. By "zipping up", I mean to introduce an OLAP engine that is capable of processing, storing, and analyzing data, so real-time data and offline data converge a little earlier than they used to. It is not a revolution of Lambda, but a minor change in the choice of components, which made our real-time data processing 40% faster. + +## **Zipping up Lambda Architecture** + +I am going to elaborate on how this is done using our data tagging process as an example. + +Previously, our offline tags were produced by the data warehouse, put into a flat table, and then written in **HBase**, while real-time tags were produced by **Flink**, and put into **HBase** directly. Then **Spark** would work as the computing engine. + + + +The problem with this stemmed from the low computation efficiency of **Flink** and **Spark**. + +- **Real-time tag production**: When computing real-time tags that involve data within a long time range, Flink did not deliver stable performance and consumed more resources than expected. And when a task failed, it would take a really long time for checkpoint recovery. +- **Tag query**: As a tag query engine, Spark could be slow. + +As a solution, we replaced **HBase** and **Spark** with **Apache Doris**, a real-time analytic database, and moved part of the computational logic of the foregoing wide-time-range real-time tags from **Flink** to **Apache Doris**. + + + +Instead of putting our flat tables in HBase, we place them in Apache Doris. These tables are divided into partitions based on time sensitivity. Offline tags will be updated daily while real-time tags will be updated in real time. We organize these tables in the Aggregate Model of Apache Doris, which allows partial update of data. + +Instead of using Spark for queries, we parse the query rules into SQL for execution in Apache Doris. For pattern matching, we use Redis to cache the hot data from Apache Doris, so the system can respond to such queries much faster. + + + +## **Computational Pipeline of Wide-Time-Range Real-Time Tags** + +In some cases, the computation of wide-time-range real-time tags entails the aggregation of historical (offline) data with real-time data. The following figure shows our old computational pipeline for these tags. + + + +As you can see, it required multiple tasks to finish computing one real-time tag. Also, in complicated aggregations that involve a collection of aggregation operations, any improper resource allocation could lead to back pressure or waste of resources. This adds to the difficulty of task scheduling. The maintenance and stability guarantee of such a long pipeline could be an issue, too. + +To improve on that, we decided to move such aggregation workload to Apache Doris. + + + +We have around 400 million customer tags in our system, and each customer is attached with over 300 tags. We divide customers into more than 10,000 groups, and we have to update 5000 of them on a daily basis. The above improvement has sped up the computation of our wide-time-range real-time queries by **40%**. + +## Overwrite + +To atomically replace data tables and partitions in Apache Doris, we customized the [Doris-Spark-Connector](https://github.com/apache/doris-spark-connector), and added an "Overwrite" mode to the Connector. + +When a Spark job is submitted, Apache Doris will call an interface to fetch information of the data tables and partitions. + +- If it is a non-partitioned table, we create a temporary table for the target table, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary table; +- If it is a dynamic partitioned table, we create a temporary partition for the target partition, ingest data into it, and then perform atomic replacement. If the data ingestion fails, we clear the temporary partition; +- If it is a non-dynamic partitioned table, we need to extend the Doris-Spark-Connector parameter configuration first. Then we create a temporary partition and take steps as above. + +## Conclusion + +One prominent advantage of Lambda architecture is the stability it provides. However, in our practice, the processing of real-time data and offline data sometimes intertwines. For example, the computation of certain real-time tags requires historical (offline) data. Such interaction becomes a root cause of instability. Thus, instead of pooling real-time and offline data after they are fully ready for queries, we use an OLAP engine to share part of the pre-query computation burden and mak [...] --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org