This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 3ab5803fd86 Revert "update url and pics (#310)" (#311) 3ab5803fd86 is described below commit 3ab5803fd8619c5a7278f4085e7aeca07243f2c3 Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Sat Oct 7 11:21:51 2023 +0800 Revert "update url and pics (#310)" (#311) This reverts commit f918c6a355606a32157e537139653fc7ce18a051. --- ...e-Found-the-Replacement-for-Druid.md => 360.md} | 16 ++-- ...ta-Compaction-in-3-Minutes.md => Compaction.md} | 10 +-- ...ehouse-10X-Performance.md => Data Lakehouse.md} | 24 +++--- ...echanism-of-Your-Database.md => Data_Update.md} | 8 +- ...ture-for-40%-Faster-Performance.md => Douyu.md} | 12 +-- ...e-the-Most-out-of-Your-Data.md => Duyansoft.md} | 14 +-- ...-MySQL-Database-for-Data-Analysis.md => FDC.md} | 8 +- ...t-and-Cold-Data-What-Why-and-How.md => HCDS.md} | 28 +++--- ...a-High-Performing-Risk-Data-Mart.md => HYXJ.md} | 14 +-- ...currency-by-20-Times.md => High_concurrency.md} | 12 +-- ...ive-Than-Elasticsearch.md => Inverted Index.md} | 16 ++-- ...Per-Day-and-Keep-Big-Queries-Within-1-Second.md | 94 --------------------- ...dbye-to-OOM-Crashes.md => Memory_Management.md} | 20 ++--- ...r-Traditional-Industry.md => Midland Realty.md} | 16 ++-- ...r-BI-Engineer-We-Need-Fast-Joins.md => Moka.md} | 12 +-- ...hboards-Without-Causing-a-Mess.md => Pingan.md} | 8 +- ...mple-but-Solid-Data-Architecture.md => Poly.md} | 6 +- ...ckHouse-to-Apache-Doris.md => Tencent Music.md} | 30 +++---- blog/Tencent-LLM.md | 18 ++-- ...stgreSQL-with-Apache-Doris.md => Tianyancha.md} | 24 +++--- ...k-Management-What-to-Consider.md => Xingyun.md} | 6 +- ...st-Data-Queries-Are-Implemented.md => Zhihu.md} | 18 ++-- blog/release-note-2.0.0.md | 26 +++--- ...pache-Doris-with-AI-chatbots.md => scenario.md} | 4 +- static/images/Unicom-1.png | Bin 239216 -> 0 bytes static/images/Unicom-2.png | Bin 200516 -> 0 bytes 26 files changed, 172 insertions(+), 272 deletions(-) diff --git a/blog/AB-Testing-was-a-Handful-Until-we-Found-the-Replacement-for-Druid.md b/blog/360.md similarity index 94% rename from blog/AB-Testing-was-a-Handful-Until-we-Found-the-Replacement-for-Druid.md rename to blog/360.md index 5b00fb1e943..2a33dbdafd0 100644 --- a/blog/AB-Testing-was-a-Handful-Until-we-Found-the-Replacement-for-Druid.md +++ b/blog/360.md @@ -39,7 +39,7 @@ Let me show you our long-term struggle with our previous Druid-based data platfo This was our real-time datawarehouse, where Apache Storm was the real-time data processing engine and Apache Druid pre-aggregated the data. However, Druid did not support certain paging and join queries, so we wrote data from Druid to MySQL regularly, making MySQL the "materialized view" of Druid. But that was only a duct tape solution as it couldn't support our ever enlarging real-time data size. So data timeliness was unattainable. - + ## Platform Architecture 2.0 @@ -47,7 +47,7 @@ This was our real-time datawarehouse, where Apache Storm was the real-time data This time, we replaced Storm with Flink, and MySQL with TiDB. Flink was more powerful in terms of semantics and features, while TiDB, with its distributed capability, was more maintainable than MySQL. But architecture 2.0 was nowhere near our goal of end-to-end data consistency, either, because when processing huge data, enabling TiDB transactions largely slowed down data writing. Plus, Druid itself did not support standard SQL, so there were some learning costs and frictions in usage. - + ## Platform Architecture 3.0 @@ -55,7 +55,7 @@ This time, we replaced Storm with Flink, and MySQL with TiDB. Flink was more pow We replaced Apache Druid with Apache Doris as the OLAP engine, which could also serve as a unified data serving gateway. So in Architecture 3.0, we only need to maintain one set of query logic. And we layered our real-time datawarehouse to increase reusability of real-time data. - + Turns out the combination of Flink and Doris was the answer. We can exploit their features to realize quick computation and data consistency. Keep reading and see how we make it happen. @@ -67,7 +67,7 @@ Then we tried moving part of such workload to the computation engine. So we trie Our third shot was to aggregate data locally in Flink right after we split it. As is shown below, we create a window in the memory of one operator for local aggregation; then we further aggregate it using the global hash windows. Since two operators chained together are in one thread, transferring data between operators consumes much less network resources. **The two-step aggregation method, combined with the** **[Aggregate model](https://doris.apache.org/docs/dev/data-table/data-model)* [...] - + For convenience in A/B testing, we make the test tag ID the first sorted field in Apache Doris, so we can quickly locate the target data using sorted indexes. To further minimize data processing in queries, we create materialized views with the frequently used dimensions. With constant modification and updates, the materialized views are applicable in 80% of our queries. @@ -83,7 +83,7 @@ To ensure end-to-end data integrity, we developed a Sink-to-Doris component. It It is the result of our long-term evolution. We used to ensure data consistency by implementing "one writing for one tag ID". Then we realized we could make good use of the transactions in Apache Doris and the two-stage commit of Apache Flink. - + As is shown above, this is how two-stage commit works to guarantee data consistency: @@ -98,17 +98,17 @@ We make it possible to split a single checkpoint into multiple transactions, so This is how we implement Sink-to-Doris. The component has blocked API calls and topology assembly. With simple configuration, we can write data into Apache Doris via Stream Load. - + ### Cluster Monitoring For cluster and host monitoring, we adopted the metrics templates provided by the Apache Doris community. For data monitoring, in addition to the template metrics, we added Stream Load request numbers and loading rates. - + Other metrics of our concerns include data writing speed and task processing time. In the case of anomalies, we will receive notifications in the form of phone calls, messages, and emails. - + ## Key Takeaways diff --git a/blog/Understanding-Data-Compaction-in-3-Minutes.md b/blog/Compaction.md similarity index 97% rename from blog/Understanding-Data-Compaction-in-3-Minutes.md rename to blog/Compaction.md index 0bfaccedc30..ac6055970b6 100644 --- a/blog/Understanding-Data-Compaction-in-3-Minutes.md +++ b/blog/Compaction.md @@ -36,7 +36,7 @@ In particular, the data (which is the inflowing cargo in this metaphor) comes in - If an item needs to be discarded or replaced, since no line-jump is allowed on the conveyor belt (append-only), you can only put a "note" (together with the substitution item) at the end of the queue on the belt to remind the "storekeepers", who will later perform replacing or discarding for you. - If needed, the "storekeepers" are even kind enough to pre-process the cargo for you (pre-aggregating data to reduce computation burden during data reading). - + As helpful as the "storekeepers" are, they can be troublemakers at times — that's why "team management" matters. For the compaction mechanism to work efficiently, you need wise planning and scheduling, or else you might need to deal with high memory and CPU usage, if not OOM in the backend or write error. @@ -66,7 +66,7 @@ The combination of these three strategies is an example of cost-effective planni As columnar storage is the future for analytic databases, the execution of compaction should adapt to that. We call it vertical compaction. I illustrate this mechanism with the figure below: - + Hope all these tiny blocks and numbers don't make you dizzy. Actually, vertical compaction can be broken down into four simple steps: @@ -85,7 +85,7 @@ Segment compaction is the way to avoid that. It allows you to compact data at th This is a flow chart that explains how segment compaction works: - + Segment compaction will be triggered once the number of newly generated files exceeds a certain limit (let's say, 10). It is executed asynchronously by a specialized merging thread. Every 10 files will be merged into one, and the original 10 files will be deleted. Segment compaction does not prolong the data ingestion process by much, but it can largely accelerate data queries. @@ -95,7 +95,7 @@ Time series data analysis is an increasingly common analytic scenario. Time series data is "born orderly". It is already arranged chronologically, it is written at a regular pace, and every batch of it is of similar size. It is like the least-worried-about child in the family. Correspondingly, we have a tailored compaction method for it: ordered data compaction. - + Ordered data compaction is even simpler: @@ -127,4 +127,4 @@ Every data engineer has somehow been harassed by complicated parameters and conf ## Conclusion -This is how we keep our "storekeepers" working efficiently and cost-effectively. If you wonder how these strategies and optimization work in real practice, we tested Apache Doris with ClickBench. It reaches a **compaction speed of 300,000 row/s**; in high-concurrency scenarios, it maintains **a stable compaction score of around 50**. Also, we are planning to implement auto-tuning and increase observability for the compaction mechanism. If you are interested in the [Apache Doris](https:// [...] +This is how we keep our "storekeepers" working efficiently and cost-effectively. If you wonder how these strategies and optimization work in real practice, we tested Apache Doris with ClickBench. It reaches a **compaction speed of 300,000 row/s**; in high-concurrency scenarios, it maintains **a stable compaction score of around 50**. Also, we are planning to implement auto-tuning and increase observability for the compaction mechanism. If you are interested in the [Apache Doris](https:// [...] \ No newline at end of file diff --git a/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance.md b/blog/Data Lakehouse.md similarity index 94% rename from blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance.md rename to blog/Data Lakehouse.md index 7f26a7eba44..2be2b4be159 100644 --- a/blog/Building-the-Next-Generation-Data-Lakehouse-10X-Performance.md +++ b/blog/Data Lakehouse.md @@ -51,7 +51,7 @@ To turn these visions into reality, a data query engine needs to figure out the Apache Doris 1.2.2 supports a wide variety of data lake formats and data access from various external data sources. Besides, via the Table Value Function, users can analyze files in object storage or HDFS directly. - + @@ -73,7 +73,7 @@ Older versions of Doris support a two-tiered metadata structure: database and ta 1. You can map to the whole external data source and ingest all metadata from it. 2. You can manage the properties of the specified data source at the catalog level, such as connection, privileges, and data ingestion details, and easily handle multiple data sources. - + @@ -114,7 +114,7 @@ This also paves the way for developers who want to connect to more data sources Access to external data sources is often hindered by network conditions and data resources. This requires extra efforts of a data query engine to guarantee reliability, stability, and real-timeliness in metadata access. - + Doris enables high efficiency in metadata access by **Meta Cache**, which includes Schema Cache, Partition Cache, and File Cache. This means that Doris can respond to metadata queries on thousands of tables in milliseconds. In addition, Doris supports manual refresh of metadata at the Catalog/Database/Table level. Meanwhile, it enables auto synchronization of metadata in Hive Metastore by monitoring Hive Metastore Event, so any changes can be updated within seconds. @@ -122,13 +122,13 @@ Doris enables high efficiency in metadata access by **Meta Cache**, which includ External data sources usually come with their own privilege management services. Many companies use one single tool (such as Apache Ranger) to provide authorization for their multiple data systems. Doris supports a custom authorization plugin, which can be connected to the user's own privilege management system via the Doris Access Controller interface. As a user, you only need to specify the authorization plugin for a newly created catalog, and then you can readily perform authorization [...] - + ### Data Access Doris supports data access to external storage systems, including HDFS and S3-compatible object storage: - + @@ -156,7 +156,7 @@ Doris caches files from remote storage in local high-performance disks as a way 1. **Block cache**: Doris supports the block cache of remote files and can automatically adjust the block size from 4KB to 4MB based on the read request. The block cache method reduces read/write amplification and read latency in cold caches. 2. **Consistent hashing for caching**: Doris applies consistent hashing to manage cache locations and schedule data scanning. By doing so, it prevents cache failures brought about by the online and offlining of nodes. It can also increase cache hit rate and query service stability. - + #### Execution Engine @@ -165,7 +165,7 @@ Developers surely don't want to rebuild all the general features for every new d - **Layer the logic**: All data queries in Doris, including those on internal tables, use the same operators, such as Join, Sort, and Agg. The only difference between queries on internal and external data lies in data access. In Doris, anything above the scan nodes follows the same query logic, while below the scan nodes, the implementation classes will take care of access to different data sources. - **Use a general framework for scan operators**: Even for the scan nodes, different data sources have a lot in common, such as task splitting logic, scheduling of sub-tasks and I/O, predicate pushdown, and Runtime Filter. Therefore, Doris uses interfaces to handle them. Then, it implements a unified scheduling logic for all sub-tasks. The scheduler is in charge of all scanning tasks in the node. With global information of the node in hand, the schedular is able to do fine-grained manage [...] - + #### Query Optimizer @@ -175,9 +175,9 @@ Doris supports a range of statistical information from various data sources, inc We tested Doris and Presto/Trino on HDFS in flat table scenarios (ClickBench) and multi-table scenarios (TPC-H). Here are the results: - + - + @@ -187,7 +187,7 @@ As is shown, with the same computing resources and on the same dataset, Apache D Querying external data sources requires no internal storage of Doris. This makes elastic stateless computing nodes possible. Apache Doris 2.0 is going to implement Elastic Compute Node, which is dedicated to supporting query workloads of external data sources. - + Stateless computing nodes are open for quick scaling so users can easily cope with query workloads during peaks and valleys and strike a balance between performance and cost. In addition, Doris has optimized itself for Kubernetes cluster management and node scheduling. Now Master nodes can automatically manage the onlining and offlining of Elastic Compute Nodes, so users can govern their cluster workloads in cloud-native and hybrid cloud scenarios without difficulty. @@ -199,7 +199,7 @@ Apache Doris has been adopted by a financial institution for risk management. Th - Doris makes it possible to perform real-time federated queries using Elasticsearch Catalog and achieve a response time of mere milliseconds. - Doris enables the decoupling of daily batch processing and statistical analysis, bringing less resource consumption and higher system stability. - + @@ -252,5 +252,5 @@ http://doris.apache.org https://github.com/apache/doris -Find Apache Doris developers on [Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog). + diff --git a/blog/Is-Your-Latest-Data-Really-the-Latest-Check-the-Data-Update-Mechanism-of-Your-Database.md b/blog/Data_Update.md similarity index 94% rename from blog/Is-Your-Latest-Data-Really-the-Latest-Check-the-Data-Update-Mechanism-of-Your-Database.md rename to blog/Data_Update.md index cb6f406e867..d8bbd04511f 100644 --- a/blog/Is-Your-Latest-Data-Really-the-Latest-Check-the-Data-Update-Mechanism-of-Your-Database.md +++ b/blog/Data_Update.md @@ -32,11 +32,11 @@ In databases, data update is to add, delete, or modify data. Timely data update Technically speaking, there are two types of data updates: you either update a whole row (**Row Update**) or just update part of the columns (**Partial Column Update**). Many databases supports both of them, but in different ways. This post is about one of them, which is simple in execution and efficient in data quality guarantee. -As an open source analytic database, Apache Doris supports both Row Update and Partial Column Update with one data model: the [**Unique Key Model**](https://doris.apache.org/docs/dev/data-table/data-model#unique-model). It is where you put data that doesn't need to be aggregated. In the Unique Key Model, you can specify one column or the combination of several columns as the Unique Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of data: the newly ingested data [...] +As an open source analytic database, Apache Doris supports both Row Update and Partial Column Update with one data model: the **Unique Key Model**. It is where you put data that doesn't need to be aggregated. In the Unique Key Model, you can specify one column or the combination of several columns as the Unique Key (a.k.a. Primary Key). For one Unique Key, there will always be one row of data: the newly ingested data record replaces the old. That's how data updates work. The idea is straightforward, but in real-life implementation, it happens that the latest data does not arrive the last or doesn't even get written at all, so I'm going to show you how Apache Doris implements data update and avoids messups with its Unique Key Model. - + ## Row Update @@ -134,11 +134,11 @@ The execution of the Update command consists of three steps in the system: - Step Two: Modify the order status from "Payment Pending" to "Delivery Pending" (1, 100, 'Delivery Pending') - Step Three: Insert the new row into the table - + The table is in the Unique Key Model, which means for rows of the same Unique Key, only the last inserted one will be reserved, so this is what the table will finally look like: - + ## Order of Data Updates diff --git a/blog/Zipping-up-the-Lambda-Architecture-for-40%-Faster-Performance.md b/blog/Douyu.md similarity index 94% rename from blog/Zipping-up-the-Lambda-Architecture-for-40%-Faster-Performance.md rename to blog/Douyu.md index d38a2f7902b..52ca06552f4 100644 --- a/blog/Zipping-up-the-Lambda-Architecture-for-40%-Faster-Performance.md +++ b/blog/Douyu.md @@ -32,7 +32,7 @@ Author: Tongyang Han, Senior Data Engineer at Douyu The Lambda architecture has been common practice in big data processing. The concept is to separate stream (real time data) and batch (offline data) processing, and that's exactly what we did. These two types of data of ours were processed in two isolated tubes before they were pooled together and ready for searches and queries. - + Then we run into a few problems: @@ -51,7 +51,7 @@ I am going to elaborate on how this is done using our data tagging process as an Previously, our offline tags were produced by the data warehouse, put into a flat table, and then written in **HBase**, while real-time tags were produced by **Flink**, and put into **HBase** directly. Then **Spark** would work as the computing engine. - + The problem with this stemmed from the low computation efficiency of **Flink** and **Spark**. @@ -60,25 +60,25 @@ The problem with this stemmed from the low computation efficiency of **Flink** a As a solution, we replaced **HBase** and **Spark** with **Apache Doris**, a real-time analytic database, and moved part of the computational logic of the foregoing wide-time-range real-time tags from **Flink** to **Apache Doris**. - + Instead of putting our flat tables in HBase, we place them in Apache Doris. These tables are divided into partitions based on time sensitivity. Offline tags will be updated daily while real-time tags will be updated in real time. We organize these tables in the Aggregate Model of Apache Doris, which allows partial update of data. Instead of using Spark for queries, we parse the query rules into SQL for execution in Apache Doris. For pattern matching, we use Redis to cache the hot data from Apache Doris, so the system can respond to such queries much faster. - + ## **Computational Pipeline of Wide-Time-Range Real-Time Tags** In some cases, the computation of wide-time-range real-time tags entails the aggregation of historical (offline) data with real-time data. The following figure shows our old computational pipeline for these tags. - + As you can see, it required multiple tasks to finish computing one real-time tag. Also, in complicated aggregations that involve a collection of aggregation operations, any improper resource allocation could lead to back pressure or waste of resources. This adds to the difficulty of task scheduling. The maintenance and stability guarantee of such a long pipeline could be an issue, too. To improve on that, we decided to move such aggregation workload to Apache Doris. - + We have around 400 million customer tags in our system, and each customer is attached with over 300 tags. We divide customers into more than 10,000 groups, and we have to update 5000 of them on a daily basis. The above improvement has sped up the computation of our wide-time-range real-time queries by **40%**. diff --git a/blog/Improving-Query-Speed-to-Make-the-Most-out-of-Your-Data.md b/blog/Duyansoft.md similarity index 94% rename from blog/Improving-Query-Speed-to-Make-the-Most-out-of-Your-Data.md rename to blog/Duyansoft.md index 0d7a64081fa..53d825001eb 100644 --- a/blog/Improving-Query-Speed-to-Make-the-Most-out-of-Your-Data.md +++ b/blog/Duyansoft.md @@ -29,7 +29,7 @@ under the License. > Author: Junfei Liu, Senior Architect of Duyansoft - + The world is getting more and more value out of data, as exemplified by the currently much-talked-about ChatGPT, which I believe is a robotic data analyst. However, in today’s era, what’s more important than the data itself is the ability to locate your wanted information among all the overflowing data quickly. So in this article, I will talk about how I improved overall data processing efficiency by optimizing the choice and usage of data warehouses. @@ -37,13 +37,13 @@ The world is getting more and more value out of data, as exemplified by the curr The choice of data warehouses was never high on my worry list until 2021. I have been working as a data engineer for a Fintech SaaS provider since its incorporation in 2014. In the company’s infancy, we didn’t have too much data to juggle. We only needed a simple tool for OLTP and business reporting, and the traditional databases would cut the mustard. - + But as the company grew, the data we received became overwhelmingly large in volume and increasingly diversified in sources. Every day, we had tons of user accounts logging in and sending myriads of requests. It was like collecting water from a thousand taps to put out a million scattered pieces of fire in a building, except that you must bring the exact amount of water needed for each fire spot. Also, we got more and more emails from our colleagues asking if we could make data analysis [...] The first thing we did was to revolutionize our data processing architecture. We used DataHub to collect all our transactional or log data and ingest it into an offline data warehouse for data processing (analyzing, computing. etc.). Then the results would be exported to MySQL and then forwarded to QuickBI to display the reports visually. We also replaced MongoDB with a real-time data warehouse for business queries. - + This new architecture worked, but there remained a few pebbles in our shoes: @@ -61,7 +61,7 @@ To begin with, we tried to move the largest tables from MySQL to [Apache Doris]( As for now, we are using two Doris clusters: one to handle point queries (high QPS) from our users and the other for internal ad-hoc queries and reporting. As a result, users have reported smoother experience and we can provide more features that are used to be bottlenecked by slow query execution. Moving our dimension tables to Doris also brought less data errors and higher development efficiency. - + Both the FE and BE processes of Doris can be scaled out, so tens of PBs of data stored in hundreds of devices can be put into one single cluster. In addition, the two types of processes implement a consistency protocol to ensure service availability and data reliability. This removes dependency on Hadoop and thus saves us the cost of deploying Hadoop clusters. @@ -89,11 +89,11 @@ In addition to the various monitoring metrics of Doris, we deployed an audit log Slow SQL queries: - + Some of our often-used monitoring metrics: - + **Tradeoff Between Resource Usage and Real-Time Availability:** @@ -116,4 +116,4 @@ As we set out to find a single data warehouse that could serve all our needs, we **Try** [**Apache Doris**](https://github.com/apache/doris) **out!** -It is an open source real-time analytical database based on MPP architecture. It supports both high-concurrency point queries and high-throughput complex analysis. Or you can start your free trial of [**SelectDB**](https://en.selectdb.com/), a cloud-native real-time data warehouse developed based on the Apache Doris open source project by the same key developers. +It is an open source real-time analytical database based on MPP architecture. It supports both high-concurrency point queries and high-throughput complex analysis. Or you can start your free trial of [**SelectDB**](https://en.selectdb.com/), a cloud-native real-time data warehouse developed based on the Apache Doris open source project by the same key developers. \ No newline at end of file diff --git a/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis.md b/blog/FDC.md similarity index 97% rename from blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis.md rename to blog/FDC.md index 46ad6d5f3de..531489100d3 100644 --- a/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis.md +++ b/blog/FDC.md @@ -1,6 +1,6 @@ --- { - 'title': 'Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis', + 'title': 'Auto-Synchronization of an Entire MySQL Database for Data Analysis', 'summary': "Flink-Doris-Connector 1.4.0 allows users to ingest a whole database (MySQL or Oracle) that contains thousands of tables into Apache Doris, in one step.", 'date': '2023-08-16', 'author': 'Apache Doris', @@ -94,11 +94,11 @@ When it comes to synchronizing a whole database (containing hundreds or even tho Under pressure test, the system showed high stability, with key metrics as follows: - + - + - + According to feedback from early adopters, the Connector has also delivered high performance and system stability in 10,000-table database synchronization in their production environment. This proves that the combination of Apache Doris and Flink CDC is capable of large-scale data synchronization with high efficiency and reliability. diff --git a/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How.md b/blog/HCDS.md similarity index 90% rename from blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How.md rename to blog/HCDS.md index 984110f0d45..5975a505fac 100644 --- a/blog/Tiered-Storage-for-Hot-and-Cold-Data-What-Why-and-How.md +++ b/blog/HCDS.md @@ -1,6 +1,6 @@ --- { - 'title': 'Tiered Storage for Hot and Cold Data: What, Why, and How?', + 'title': 'Hot-Cold Data Separation: What, Why, and How?', 'summary': "Hot data is the frequently accessed data, while cold data is the one you seldom visit but still need. Separating them is for higher efficiency in computation and storage.", 'date': '2023-06-23', 'author': 'Apache Doris', @@ -28,7 +28,7 @@ specific language governing permissions and limitations under the License. --> -Apparently tiered storage is hot now. But first of all: +Apparently hot-cold data separation is hot now. But first of all: ## What is Hot/Cold Data? @@ -38,21 +38,21 @@ For example, orders of the past six months are "hot" and logs from years ago are ## Why Separate Hot and Cold Data? -Tiered storage is an idea often seen in real life: You put your favorite book on your bedside table, your Christmas ornament in the attic, and your childhood art project in the garage or a cheap self-storage space on the other side of town. The purpose is a tidy and efficient life. +Hot-Cold Data Separation is an idea often seen in real life: You put your favorite book on your bedside table, your Christmas ornament in the attic, and your childhood art project in the garage or a cheap self-storage space on the other side of town. The purpose is a tidy and efficient life. Similarly, companies separate hot and cold data for more efficient computation and more cost-effective storage, because storage that allows quick read/write is always expensive, like SSD. On the other hand, HDD is cheaper but slower. So it is more sensible to put hot data on SSD and cold data on HDD. If you are looking for an even lower-cost option, you can go for object storage. -In data analytics, tiered storage is implemented by a tiered storage mechanism in the database. For example, Apache Doris supports three-tiered storage: SSD, HDD, and object storage. For newly ingested data, after a specified cooldown period, it will turn from hot data into cold data and be moved to object storage. In addition, object storage only preserves one copy of data, which further cuts down storage costs and the relevant computation/network overheads. +In data analytics, hot-cold data separation is implemented by a tiered storage mechanism in the database. For example, Apache Doris supports three-tiered storage: SSD, HDD, and object storage. For newly ingested data, after a specified cooldown period, it will turn from hot data into cold data and be moved to object storage. In addition, object storage only preserves one copy of data, which further cuts down storage costs and the relevant computation/network overheads. - + -How much can you save by tiered storage? Here is some math. +How much can you save by hot-cold data separation? Here is some math. In public cloud services, cloud disks generally cost 5~10 times as much as object storage. If 80% of your data asset is cold data and you put it in object storage instead of cloud disks, you can expect a cost reduction of around 70%. -Let the percentage of cold data be "rate", the price of object storage be "OS", and the price of cloud disk be "CloudDisk", this is how much you can save by tiered storage instead of putting all your data on cloud disks: +Let the percentage of cold data be "rate", the price of object storage be "OS", and the price of cloud disk be "CloudDisk", this is how much you can save by hot-cold data separation instead of putting all your data on cloud disks: - + Now let's put real-world numbers in this formula: @@ -62,9 +62,9 @@ AWS pricing, US East (Ohio): - **Throughput Optimized HDD (st 1)**: 102 USD per TB per month - **General Purpose SSD (gp2)**: 158 USD per TB per month - + -## How Is Tiered Storage Implemented? +## How Is Hot-Cold Separation Implemented? Till now, hot-cold separation sounds nice, but the biggest concern is: how can we implement it without compromising query performance? This can be broken down to three questions: @@ -78,9 +78,9 @@ In what follows, I will show you how Apache Doris addresses them one by one. Accessing cold data from object storage will indeed be slow. One solution is to cache cold data in local disks for use in queries. In Apache Doris 2.0, when a query requests cold data, only the first-time access will entail a full network I/O operation from object storage. Subsequent queries will be able to read data directly from local cache. -The granularity of caching matters, too. A coarse granularity might lead to a waste of cache space, but a fine granularity could be the reason for low I/O efficiency. Apache Doris bases its caching on data blocks. It downloads cold data blocks from object storage onto local Block Cache. This is the "pre-heating" process. With cold data fully pre-heated, queries on tables with tiered storage will be basically as fast as those on tablets without. We drew this conclusion from test results o [...] +The granularity of caching matters, too. A coarse granularity might lead to a waste of cache space, but a fine granularity could be the reason for low I/O efficiency. Apache Doris bases its caching on data blocks. It downloads cold data blocks from object storage onto local Block Cache. This is the "pre-heating" process. With cold data fully pre-heated, queries on tables with hot-cold data separation will be basically as fast as those on tablets without. We drew this conclusion from test [...] - + - ***Test Data****: SSB SF100 dataset* - ***Configuration****: 3 × 16C 64G, a cluster of 1 frontend and 3 backends* @@ -93,7 +93,7 @@ In object storage, only one copy of cold data is preserved. Within Apache Doris, Implementation-wise, the Doris frontend picks a local replica as the Leader. Updates to the Leader will be synchronized to all other local replicas via a regular report mechanism. Also, as the Leader uploads data to object storage, the relevant metadata will be updated to other local replicas, too. - + ### Reduced I/O and CPU Overhead @@ -103,7 +103,7 @@ A thread in Doris backend will regularly pick N tablets from the cold data and s ## Tutorial -Separating tiered storage in storage is a huge cost saver and there have been ways to ensure the same fast query performance. Executing hot-cold data separation is a simple 6-step process, so you can find out how it works yourself: +Separating hot and cold data in storage is a huge cost saver and there have been ways to ensure the same fast query performance. Executing hot-cold data separation is a simple 6-step process, so you can find out how it works yourself: diff --git a/blog/Step-by-step-Guide-to-Building-a-High-Performing-Risk-Data-Mart.md b/blog/HYXJ.md similarity index 96% rename from blog/Step-by-step-Guide-to-Building-a-High-Performing-Risk-Data-Mart.md rename to blog/HYXJ.md index 8aa15b79981..97aad2303ca 100644 --- a/blog/Step-by-step-Guide-to-Building-a-High-Performing-Risk-Data-Mart.md +++ b/blog/HYXJ.md @@ -39,7 +39,7 @@ I will walk you through how the risk data mart works following the data flow: So these are the three data sources of our risk data mart. - + This whole architecture is built with CDH 6.0. The workflows in it can be divided into real-time data streaming and offline risk analysis. @@ -48,7 +48,7 @@ This whole architecture is built with CDH 6.0. The workflows in it can be divide To give a brief overview, these are the components that support the four features of our data processing platform: - + As you see, Apache Hive is central to this architecture. But in practice, it takes minutes for Apache Hive to execute analysis, so our next step is to increase query speed. @@ -72,7 +72,7 @@ In addition, since our risk control analysts and modeling engineers are using Hi We wanted a unified gateway to manage our heterogenous data sources. That's why we introduced Apache Doris. - + But doesn't that make things even more complicated? Actually, no. @@ -80,7 +80,7 @@ We can connect various data sources to Apache Doris and simply conduct queries o We create Elasticsearch Catalog and Hive Catalog in Apache Doris. These catalogs map to the external data in Elasticsearch and Hive, so we can conduct federated queries across these data sources using Apache Doris as a unified gateway. Also, we use the [Spark-Doris-Connector](https://github.com/apache/doris-spark-connector) to allow data communication between Spark and Doris. So basically, we replace Apache Hive with Apache Doris as the central hub of our data architecture. - + How does that affect our data processing efficiency? @@ -101,7 +101,7 @@ Backup cluster: 4 frontends + 4 backends, m5d.16xlarge This is the monitoring board: - + As is shown, the queries are fast. We expected that it would take at least 10 nodes but in real cases, we mainly conduct queries via Catalogs, so we can handle this with a relatively small cluster size. The compatibility is good, too. It doesn't rock the rest of our existing system. @@ -109,7 +109,7 @@ As is shown, the queries are fast. We expected that it would take at least 10 no To accelerate the regular data ingestion from Hive to Apache Doris 1.2.2, we have a solution that goes as follows: - + **Main components:** @@ -197,7 +197,7 @@ group by product_no; **After**: For data synchronization, we call Spark on YARN using SeaTunnel. It can be finished within 11 minutes (100 million pieces per minute ), and the ingested data only takes up **378G** of storage space. - + ## Summary diff --git a/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times.md b/blog/High_concurrency.md similarity index 98% rename from blog/How-We-Increased-Database-Query-Concurrency-by-20-Times.md rename to blog/High_concurrency.md index 348a1dd1e82..85d3240482c 100644 --- a/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times.md +++ b/blog/High_concurrency.md @@ -152,7 +152,7 @@ Normally, an SQL statement is executed in three steps: For complex queries on massive data, it is better to follow the plan created by the Query Optimizer. However, for high-concurrency point queries requiring low latency, that plan is not only unnecessary but also brings extra overheads. That's why we implement a short-circuit plan for point queries. - + Once the FE receives a point query request, a short-circuit plan will be produced. It is a lightweight plan that involves no equivalent transformation, logic optimization or physical optimization. Instead, it conducts some basic analysis on the AST, creates a fixed plan accordingly, and finds ways to reduce overhead of the optimizer. @@ -182,13 +182,13 @@ rpc tablet_fetch_data(PTabletKeyLookupRequest) returns (PTabletKeyLookupResponse In high-concurrency queries, part of the CPU overhead comes from SQL analysis and parsing in FE. To reduce such overhead, in FE, we provide prepared statements that are fully compatible with MySQL protocol. With prepared statements, we can achieve a four-time performance increase for primary key point queries. - + The idea of prepared statements is to cache precomputed SQL and expressions in HashMap in memory, so they can be directly used in queries when applicable. Prepared statements adopt MySQL binary protocol for transmission. The protocol is implemented in the mysql_row_buffer.[h|cpp] file, and uses MySQL binary encoding. Under this protocol, the client (for example, JDBC Client) sends a pre-compiled statement to FE via `PREPARE` MySQL Command. Next, FE will parse and analyze the statement and cache it in the HashMap as shown in the figure above. Next, the client, using `EXECUTE` MySQL Command, will replace the placeholder, encode it into binar [...] - + Apart from caching prepared statements in FE, we also cache reusable structures in BE. These structures include pre-allocated computation blocks, query descriptors, and output expressions. Serializing and deserializing these structures often cause a CPU hotspot, so it makes more sense to cache them. The prepared statement for each query comes with a UUID named CacheID. So when BE executes the point query, it will find the corresponding class based on the CacheID, and then reuse the struc [...] @@ -220,13 +220,13 @@ resultSet = readStatement.executeQuery(); Apache Doris has a Page Cache feature, where each page caches the data of one column. - + As mentioned above, we have introduced row storage in Doris. The problem with this is, one row of data consists of multiple columns, so in the case of big queries, the cached data might be erased. Thus, we also introduced row cache to increase row cache hit rate. Row cache reuses the LRU Cache mechanism in Apache Doris. When the caching starts, the system will initialize a threshold value. If that threshold is hit, the old cached rows will be phased out. For a primary key query statement, the performance gap between cache hit and cache miss can be huge (we are talking about dozens of times less disk I/O and memory access here). So the introduction of row cache can remarkably enhance point query performance. - + To enable row cache, you can specify the following configuration in BE: @@ -283,7 +283,7 @@ SELECT * from usertable WHERE YCSB_KEY = ? We run the test with the optimizations (row storage, short-circuit, and prepared statement) enabled, and then did it again with all of them disabled. Here are the results: - + With optimizations enabled, **the average query latency decreased by a whopping 96%, the 99th percentile latency was only 1/28 of that without optimizations, and it has achieved a query concurrency of over 30,000 QPS.** This is a huge leap in performance and an over 20-time increase in concurrency. diff --git a/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch.md b/blog/Inverted Index.md similarity index 96% rename from blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch.md rename to blog/Inverted Index.md index d26f743c127..a3286745161 100644 --- a/blog/Building-A-Log-Analytics-Solution-10-Times-More-Cost-Effective-Than-Elasticsearch.md +++ b/blog/Inverted Index.md @@ -53,7 +53,7 @@ There exist two common log processing solutions within the industry, exemplified - **Inverted index (Elasticsearch)**: It is well-embraced due to its support for full-text search and high performance. The downside is the low throughput in real-time writing and the huge resource consumption in index creation. - **Lightweight index / no index (Grafana Loki)**: It is the opposite of inverted index because it boasts high real-time write throughput and low storage cost but delivers slow queries. - + ## Introduction to Inverted Index @@ -63,7 +63,7 @@ Inverted indexing was originally used to retrieve words or phrases in texts. The Upon data writing, the system tokenizes texts into **terms**, and stores these terms in a **posting list** which maps terms to the ID of the row where they exist. In text queries, the database finds the corresponding **row ID** of the keyword (term) in the posting list, and fetches the target row based on the row ID. By doing so, the system won't have to traverse the whole dataset and thus improves query speeds by orders of magnitudes. - + In inverted indexing of Elasticsearch, quick retrieval comes at the cost of writing speed, writing throughput, and storage space. Why? Firstly, tokenization, dictionary sorting, and inverted index creation are all CPU- and memory-intensive operations. Secondly, Elasticssearch has to store the original data, the inverted index, and an extra copy of data stored in columns for query acceleration. That's triple redundancy. @@ -87,14 +87,14 @@ In [Apache Doris](https://github.com/apache/doris), we opt for the other way. Bu In Apache Doris, data is arranged in the following format. Indexes are stored in the Index Region: - + We implement inverted indexes in a non-intrusive manner: 1. **Data ingestion & compaction**: As a segment file is written into Doris, an inverted index file will be written, too. The index file path is determined by the segment ID and the index ID. Rows in segments correspond to the docs in indexes, so are the RowID and the DocID. 2. **Query**: If the `where` clause includes a column with inverted index, the system will look up in the index file, return a DocID list, and convert the DocID list into a RowID Bitmap. Under the RowID filtering mechanism of Apache Doris, only the target rows will be read. This is how queries are accelerated. - + Such non-intrusive method separates the index file from the data files, so you can make any changes to the inverted indexes without worrying about affecting the data files themselves or other indexes. @@ -160,12 +160,12 @@ For a fair comparison, we ensure uniformity of testing conditions, including ben - **Results of Apache Doris**: - - Writing Speed: 550 MB/s, **4.2 times that of Elasticsearch** +- - Writing Speed: 550 MB/s, **4.2 times that of Elasticsearch** - Compression Ratio: 10:1 - Storage Usage: **20% that of Elasticsearch** - Response Time: **43% that of Elasticsearch** - + ### Apache Doris VS ClickHouse @@ -179,7 +179,7 @@ As ClickHouse launched inverted index as an experimental feature in v23.1, we te **Result**: Apache Doris was **4.7 times, 12 times, 18.5 times** faster than ClickHouse in the three queries, respectively. - + ## Usage & Example @@ -246,4 +246,4 @@ For more feature introduction and usage guide, see documentation: [Inverted Inde In a word, what contributes to Apache Doris' 10-time higher cost-effectiveness than Elasticsearch is its OLAP-tailored optimizations for inverted indexing, supported by the columnar storage engine, massively parallel processing framework, vectorized query engine, and cost-based optimizer of Apache Doris. -As proud as we are about our own inverted indexing solution, we understand that self-published benchmarks can be controversial, so we are open to [feedback](https://t.co/KcxAtAJZjZ) from any third-party users and see how [Apache Doris](https://github.com/apache/doris) works in real-world cases. +As proud as we are about our own inverted indexing solution, we understand that self-published benchmarks can be controversial, so we are open to [feedback](https://t.co/KcxAtAJZjZ) from any third-party users and see how [Apache Doris](https://github.com/apache/doris) works in real-world cases. \ No newline at end of file diff --git a/blog/Log-Analysis-How-to-Digest-15-Billion-Logs-Per-Day-and-Keep-Big-Queries-Within-1-Second.md b/blog/Log-Analysis-How-to-Digest-15-Billion-Logs-Per-Day-and-Keep-Big-Queries-Within-1-Second.md deleted file mode 100644 index cd47c6a3f91..00000000000 --- a/blog/Log-Analysis-How-to-Digest-15-Billion-Logs-Per-Day-and-Keep-Big-Queries-Within-1-Second.md +++ /dev/null @@ -1,94 +0,0 @@ ---- -{ - 'title': 'Log Analysis: How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second', - 'summary': "This article describes a large-scale data warehousing use case to provide reference for data engineers who are looking for log analytic solutions. It introduces the log processing architecture and real case practice in data ingestion, storage, and queries.", - 'date': '2023-09-16', - 'author': 'Yuqi Liu', - 'tags': ['Best Practice'], -} ---- - -<!-- -Licensed to the Apache Software Foundation (ASF) under one -or more contributor license agreements. See the NOTICE file -distributed with this work for additional information -regarding copyright ownership. The ASF licenses this file -to you under the Apache License, Version 2.0 (the -"License"); you may not use this file except in compliance -with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - -Unless required by applicable law or agreed to in writing, -software distributed under the License is distributed on an -"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -KIND, either express or implied. See the License for the -specific language governing permissions and limitations -under the License. ---> - - - -This data warehousing use case is about **scale**. The user is [China Unicom](https://en.wikipedia.org/wiki/China_Unicom), one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, an [...] - -From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this post is about what their log processing architecture looks like, and how they realize stable data ingestion, low-cost storage, and quick queries with it. - -## System Architecture - -This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing. - - - -- **ODS**: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay. -- **DWD**: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the [Duplicate Key model](https://doris.apache.org/docs/dev/data-table/data-model#duplicate-model) of Apache Doris. -- **DWS**: This layer aggregates data from DWD and lays the foundation for queries and analysis. -- **ADS**: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model. - -Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with ClickHouse, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins. - - - -Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0. - -## Real-Case Practice - -### Stable ingestion of 15 billion logs per day - -In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads. - -A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations: - -- **Flink Checkpoint**: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions. -- **Data Pre-Aggregation**: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing. -- **Doris Compaction**: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up `max_tablet_version_num` to avoid version accumulation. - -These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the [Unique Key model](https://doris.apache.org/docs/dev/data-table/data-model#unique-model) in Doris can ensure quicker data updates. - -### Storage strategies to reduce costs by 50% - -The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs. - -- **ZSTD (ZStandard) compression algorithm**: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1. -- **Tiered storage of hot and cold data**: This is supported by the [new feature](https://blog.devgenius.io/hot-cold-data-separation-what-why-and-how-5f7c73e7a3cf) of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder", it will be moved to object st [...] -- **Differentiated replica numbers for different data partitions**: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy. - -With these three strategies, the user has reduced their storage costs by 50%. - -### Differentiated query strategies based on data size - -Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes: - -- **Less than 100G**: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance of data within a partition, they use the snowflake ID as the bucketing field. They also set a starting offset of 20 days, which means data of the recent 20 days will be kept. In this way, they find the balance point between data backlog and analytic needs. -- **100G~1T**: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle. -- **More than 100T**: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. **In this way, we enable queries of 2 billion log records to be done in 1~2s.** - -These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds. - -## Ongoing Plans - -The user is now testing with the newly added [inverted index](https://doris.apache.org/docs/dev/data-table/index/inverted-index?_highlight=inverted) in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. T [...] - - - - - diff --git a/blog/Say-Goodbye-to-OOM-Crashes.md b/blog/Memory_Management.md similarity index 97% rename from blog/Say-Goodbye-to-OOM-Crashes.md rename to blog/Memory_Management.md index d7ec381b453..b658ea278b6 100644 --- a/blog/Say-Goodbye-to-OOM-Crashes.md +++ b/blog/Memory_Management.md @@ -31,7 +31,7 @@ under the License. What guarantees system stability in large data query tasks? It is an effective memory allocation and monitoring mechanism. It is how you speed up computation, avoid memory hotspots, promptly respond to insufficient memory, and minimize OOM errors. - + From a database user's perspective, how do they suffer from bad memory management? This is a list of things that used to bother our users: @@ -47,13 +47,13 @@ Luckily, those dark days are behind us, because we have improved our memory mana In Apache Doris, we have a one-and-only interface for memory allocation: **Allocator**. It will make adjustments as it sees appropriate to keep memory usage efficient and under control. Also, MemTrackers are in place to track the allocated or released memory size, and three different data structures are responsible for large memory allocation in operator execution (we will get to them immediately). - + ### Data Structures in Memory As different queries have different memory hotspot patterns in execution, Apache Doris provides three different in-memory data structures: **Arena**, **HashTable**, and **PODArray**. They are all under the reign of the Allocator. - + **1. Arena** @@ -99,7 +99,7 @@ Memory reuse is executed in data scanning, too. Before the scanning starts, a nu The MemTracker system before Apache Doris 1.2.0 was in a hierarchical tree structure, consisting of process_mem_tracker, query_pool_mem_tracker, query_mem_tracker, instance_mem_tracker, ExecNode_mem_tracker and so on. MemTrackers of two neighbouring layers are of parent-child relationship. Hence, any calculation mistakes in a child MemTracker will be accumulated all the way up and result in a larger scale of incredibility. - + In Apache Doris 1.2.0 and newer, we made the structure of MemTrackers much simpler. MemTrackers are only divided into two types based on their roles: **MemTracker Limiter** and the others. MemTracker Limiter, monitoring memory usage, is unique in every query/ingestion/compaction task and global object; while the other MemTrackers traces the memory hotspots in query execution, such as HashTables in Join/Aggregation/Sort/Window functions and intermediate data in serialization, to give a pi [...] @@ -107,7 +107,7 @@ The parent-child relationship between MemTracker Limiter and other MemTrackers i MemTrackers (including MemTracker Limiter and the others) are put into a group of Maps. They allow users to print overall MemTracker type snapshot, Query/Load/Compaction task snapshot, and find out the Query/Load with the most memory usage or the most memory overusage. - + ### How MemTracker Works @@ -122,21 +122,21 @@ Now let me explain with a simplified query execution process. - When the scanning is done, all MemTrackers in the Scanner Thread TLS Stack will be removed. When the ScanNode scheduling is done, the ScanNode MemTracker will be removed from the fragment execution thread. Then, similarly, when an aggregation node is scheduled, an **AggregationNode MemTracker** will be added to the fragment execution thread TLS Stack, and get removed after the scheduling is done. - If the query is completed, the Query MemTracker will be removed from the fragment execution thread TLS Stack. At this point, this stack should be empty. Then, from the QueryProfile, you can view the peak memory usage during the whole query execution as well as each phase (scanning, aggregation, etc.). - + ### How to Use MemTracker The Doris backend Web page demonstrates real-time memory usage, which is divided into types: Query/Load/Compaction/Global. Current memory consumption and peak consumption are shown. - + The Global types include MemTrackers of Cache and TabletMeta. - + From the Query types, you can see the current memory consumption and peak consumption of the current query and the operators it involves (you can tell how they are related from the labels). For memory statistics of historical queries, you can check the Doris FE audit logs or BE INFO logs. - + ## Memory Limit @@ -152,7 +152,7 @@ While in Apache Doris 2.0, we have realized exception safety for queries. That m On a regular basis, Doris backend retrieves the physical memory of processes and the currently available memory size from the system. Meanwhile, it collects MemTracker snapshots of all Query/Load/Compaction tasks. If a backend process exceeds its memory limit or there is insufficient memory, Doris will free up some memory space by clearing Cache and cancelling a number of queries or data ingestion tasks. These will be executed by an individual GC thread regularly. - + If the process memory consumed is over the SoftMemLimit (81% of total system memory, by default), or the available system memory drops below the Warning Water Mark (less than 3.2GB), **Minor GC** will be triggered. At this moment, query execution will be paused at the memory allocation step, the cached data in data ingestion tasks will be force flushed, and part of the Data Page Cache and the outdated Segment Cache will be released. If the newly released memory does not cover 10% of the [...] diff --git a/blog/Building-a-Data-Warehouse-for-Traditional-Industry.md b/blog/Midland Realty.md similarity index 95% rename from blog/Building-a-Data-Warehouse-for-Traditional-Industry.md rename to blog/Midland Realty.md index 8b84a27d4a4..f963f0b0750 100644 --- a/blog/Building-a-Data-Warehouse-for-Traditional-Industry.md +++ b/blog/Midland Realty.md @@ -38,7 +38,7 @@ Now let's get started. Logically, our data architecture can be divided into four parts. - + - **Data integration**: This is supported by Flink CDC, DataX, and the Multi-Catalog feature of Apache Doris. - **Data management**: We use Apache Dolphinscheduler for script lifecycle management, privileges in multi-tenancy management, and data quality monitoring. @@ -53,7 +53,7 @@ We create our dimension tables and fact tables centering each operating entity i Our data warehouse is divided into five conceptual layers. We use Apache Doris and Apache DolphinScheduler to schedule the DAG scripts between these layers. - + Every day, the layers go through an overall update besides incremental updates in case of changes in historical status fields or incomplete data synchronization of ODS tables. @@ -91,7 +91,7 @@ This is our configuration for TBs of legacy data and GBs of incremental data. Yo 1. To integrate offline data and log data, we use DataX, which supports CSV format and readers of many relational databases, and Apache Doris provides a DataX-Doris-Writer. - + 2. We use Flink CDC to synchronize data from source tables. Then we aggregate the real-time metrics utilizing the Materialized View or the Aggregate Model of Apache Doris. Since we only have to process part of the metrics in a real-time manner and we don't want to generate too many database connections, we use one Flink job to maintain multiple CDC source tables. This is realized by the multi-source merging and full database sync features of Dinky, or you can implement a Flink DataStream [...] @@ -123,11 +123,11 @@ EXECUTE CDCSOURCE demo_doris WITH ( 3. We use SQL scripts or "Shell + SQL" scripts, and we perform script lifecycle management. At the ODS layer, we write a general DataX job file and pass parameters for each source table ingestion, instead of writing a DataX job for each source table. In this way, we make things much easier to maintain. We manage the ETL scripts of Apache Doris on DolphinScheduler, where we also conduct version control. In case of any errors in the production environment, we can always rollback. - + 4. After ingesting data with ETL scripts, we create a page in our reporting tool. We assign different privileges to different accounts using SQL, including the privilege of modifying rows, fields, and global dictionary. Apache Doris supports privilege control over accounts, which works the same as that in MySQL. - + We also use Apache Doris data backup for disaster recovery, Apache Doris audit logs to monitor SQL execution efficiency, Grafana+Loki for cluster metric alerts, and Supervisor to monitor the daemon processes of node components. @@ -165,10 +165,8 @@ For us, it is important to create a data dictionary because it largely reduces p Actually, before we evolved into our current data architecture, we tried Hive, Spark and Hadoop to build an offline data warehouse. It turned out that Hadoop was overkill for a traditional company like us since we didn't have too much data to process. It is important to find the component that suits you most. - + (Our old off-line data warehouse) -On the other hand, to smoothen our big data transition, we need to make our data platform as simple as possible in terms of usage and maintenance. That's why we landed on Apache Doris. It is compatible with MySQL protocol and provides a rich collection of functions so we don't have to develop our own UDFs. Also, it is composed of only two types of processes: frontends and backends, so it is easy to scale and track. - -Find Apache Doris developers on [Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog). +On the other hand, to smoothen our big data transition, we need to make our data platform as simple as possible in terms of usage and maintenance. That's why we landed on Apache Doris. It is compatible with MySQL protocol and provides a rich collection of functions so we don't have to develop our own UDFs. Also, it is composed of only two types of processes: frontends and backends, so it is easy to scale and track \ No newline at end of file diff --git a/blog/Listen-to-That-Poor-BI-Engineer-We-Need-Fast-Joins.md b/blog/Moka.md similarity index 95% rename from blog/Listen-to-That-Poor-BI-Engineer-We-Need-Fast-Joins.md rename to blog/Moka.md index 7d43fb4eab4..08d82ceeea2 100644 --- a/blog/Listen-to-That-Poor-BI-Engineer-We-Need-Fast-Joins.md +++ b/blog/Moka.md @@ -35,13 +35,13 @@ Business intelligence (BI) tool is often the last stop of a data processing pipe I work as an engineer that supports a human resource management system. One prominent selling point of our services is **self-service** **BI**. That means we allow users to customize their own dashboards: they can choose the fields they need and relate them to form the dataset as they want. - + Join query is a more efficient way to realize self-service BI. It allows people to break down their data assets into many smaller tables instead of putting it all in a flat table. This would make data updates much faster and more cost-effective, because updating the whole flat table is not always the optimal choice when you have plenty of new data flowing in and old data being updated or deleted frequently, as is the case for most data input. In order to maximize the time value of data, we need data updates to be executed really quickly. For this purpose, we looked into three OLAP databases on the market. They are all fast in some way but there are some differences. - + Greenplum is really quick in data loading and batch DML processing, but it is not good at handling high concurrency. There is a steep decline in performance as query concurrency rises. This can be risky for a BI platform that tries to ensure stable user experience. ClickHouse is mind-blowing in single-table queries, but it only allows batch update and batch delete, so that's less timely. @@ -51,7 +51,7 @@ JOIN, my old friend JOIN, is always a hassle. Join queries are demanding for bot We tested our candidate OLAP engines with our common join queries and our most notorious slow queries. - + As the number of tables joined grows, we witness a widening performance gap between Apache Doris and ClickHouse. In most join queries, Apache Doris was about 5 times faster than ClickHouse. In terms of slow queries, Apache Doris responded to most of them within less than 1 second, while the performance of ClickHouse fluctuated within a relatively large range. @@ -79,15 +79,15 @@ Human resource data is subject to very strict and fine-grained access control po How does all this add to complexity in engineering? Any user who inputs a query on our BI platform must go through multi-factor authentication, and the authenticated information will all be inserted into the SQL via `in` and then passed on to the OLAP engine. Therefore, the more fine-grained the privilege controls are, the longer the SQL will be, and the more time the OLAP system will spend on ID filtering. That's why our users are often tortured by high latency. - + So how did we fix that? We use the [Bloom Filter index](https://doris.apache.org/docs/dev/data-table/index/bloomfilter/) in Apache Doris. - + By adding Bloom Filter indexes to the relevant ID fields, we improve the speed of privileged queries by 30% and basically eliminate timeout errors. - + Tips on when you should use the Bloom Filter index: diff --git a/blog/Database-in-Fintech-How-to-Support-ten-thousand-Dashboards-Without-Causing-a-Mess.md b/blog/Pingan.md similarity index 94% rename from blog/Database-in-Fintech-How-to-Support-ten-thousand-Dashboards-Without-Causing-a-Mess.md rename to blog/Pingan.md index 07cadf15983..f8d49ab4808 100644 --- a/blog/Database-in-Fintech-How-to-Support-ten-thousand-Dashboards-Without-Causing-a-Mess.md +++ b/blog/Pingan.md @@ -50,11 +50,11 @@ When the metrics are soundly put in place, you can ingest new data into your dat As is mentioned, some metrics are produced by combining multiple fields in the source table. In data engineering, that is a multi-table join query. Based on the optimization experience of an Apache Doris user, we recommend flat tables instead of Star/Snowflake Schema. The user reduced the query response time on tables of 100 million rows **from 5s to 63ms** after such a change. - + The flat table solution also eliminates jitter. - + ## Enable SQL Caching to Reduce Resource Consumption @@ -65,13 +65,13 @@ Analysts often check data reports of the same metrics on a regular basis. These - A TPS (Transactions Per Second) of 300 is reached, with CPU, memory, disk, and I/O usage under 80%; - Under the recommended cluster size, over 10,000 metrics can be cached, which means you can save a lot of computation resources. - + ## Conclusion The complexity of data analysis in the financial industry lies in the data itself other than the engineering side. Thus, the underlying data architecture should focus on facilitating the unified and efficient management of data. Apache Doris provides the flexibility of simple metric registration and the ability of fast and resource-efficient metric computation. In this case, the user is able to handle 10,000 active financial metrics in 10,000 dashboards with 30% less ETL efforts. -Find Apache Doris developers on [Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog). + diff --git a/blog/For-Entry-Level-Data-Engineers-How-to-Build-a-Simple-but-Solid-Data-Architecture.md b/blog/Poly.md similarity index 97% rename from blog/For-Entry-Level-Data-Engineers-How-to-Build-a-Simple-but-Solid-Data-Architecture.md rename to blog/Poly.md index d682110bc6f..a0bfb3f9836 100644 --- a/blog/For-Entry-Level-Data-Engineers-How-to-Build-a-Simple-but-Solid-Data-Architecture.md +++ b/blog/Poly.md @@ -41,7 +41,7 @@ A prominent feature of ticketing services is the periodic spikes in ticket order The building blocks of this architecture are simple. You only need Apache Flink and Apache Kafka for data ingestion, and Apache Doris as an analytic data warehouse. - + Connecting data sources to the data warehouse is simple, too. The key component, Apache Doris, supports various data loading methods to fit with different data sources. You can perform column mapping, transforming, and filtering during data loading to avoid duplicate collection of data. To ingest a table, users only need to add the table name to the configurations, instead of writing a script themselves. @@ -53,7 +53,7 @@ Flink CDC was found to be the optimal choice if you are looking for higher stabi - Create two CDC jobs in Flink, one to capture the changed data (the Forward stream), the other to update the table management configurations (the Broadcast stream). - Configure all tables of the source database at the Sink end (the output end of Flink CDC). When there is newly added table in the source database, the Broadcast stream will be triggered to update the table management configurations. (You just need to configure the tables, instead of "creating" the tables.) - + ## Layering of Data Warehouse @@ -76,7 +76,7 @@ Like many non-tech business, the ticketing service provider needs a data warehou - **Data Analysis**: This involves data such as membership orders, attendance rates, and user portraits. - **Dashboarding**: This is to visually display sales data. - + These are all entry-level tasks in data analytics. One of the biggest burdens for the data engineers was to quickly develop new reports as the internal analysts required. The [Aggregate Key Model](https://doris.apache.org/docs/dev/data-table/data-model#aggregate-model) of Apache Doris is designed for this. diff --git a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md b/blog/Tencent Music.md similarity index 93% rename from blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md rename to blog/Tencent Music.md index 8f5aae89c27..bf649e35a17 100644 --- a/blog/Tencent-Data-Engineers-Why-We-Went-from-ClickHouse-to-Apache-Doris.md +++ b/blog/Tencent Music.md @@ -1,6 +1,6 @@ --- { - 'title': 'Tencent Data Engineer: Why We G from ClickHouse to Apache Doris?', + 'title': 'Tencent Data Engineer: Why We Go from ClickHouse to Apache Doris?', 'summary': "Evolution of the data processing architecture of Tencent Music Entertainment towards better performance and simpler maintenance.", 'date': '2023-03-07', 'author': 'Jun Zhang & Kai Dai', @@ -27,9 +27,9 @@ specific language governing permissions and limitations under the License. --> - + -This article is co-written by me and my colleague Kai Dai. We are both data platform engineers at [Tencent Music](https://www.tencentmusic.com/en-us/) (NYSE: TME), a music streaming service provider with a whopping 800 million monthly active users. To drop the number here is not to brag but to give a hint of the sea of data that my poor coworkers and I have to deal with everyday. +This article is co-written by me and my colleague Kai Dai. We are both data platform engineers at Tencent Music (NYSE: TME), a music streaming service provider with a whopping 800 million monthly active users. To drop the number here is not to brag but to give a hint of the sea of data that my poor coworkers and I have to deal with everyday. # What We Use ClickHouse For? @@ -37,7 +37,7 @@ The music library of Tencent Music contains data of all forms and types: recorde Specifically, we do all-round analysis of the songs, lyrics, melodies, albums, and artists, turn all this information into data assets, and pass them to our internal data users for inventory counting, user profiling, metrics analysis, and group targeting. - + We stored and processed most of our data in Tencent Data Warehouse (TDW), an offline data platform where we put the data into various tag and metric systems and then created flat tables centering each object (songs, artists, etc.). @@ -47,7 +47,7 @@ After that, our data analysts used the data under the tags and metrics they need The data processing pipeline looked like this: - + # The Problems with ClickHouse @@ -71,7 +71,7 @@ Statistically speaking, these features have cut our storage cost by 42% and deve During our usage of Doris, we have received lots of support from the open source Apache Doris community and timely help from the SelectDB team, which is now running a commercial version of Apache Doris. - + # Further Improvement to Serve Our Needs @@ -81,7 +81,7 @@ Speaking of the datasets, on the bright side, our data analysts are given the li Our solution is to introduce a semantic layer in our data processing pipeline. The semantic layer is where all the technical terms are translated into more comprehensible concepts for our internal data users. In other words, we are turning the tags and metrics into first-class citizens for data definement and management. - + **Why would this help?** @@ -95,7 +95,7 @@ Explicitly defining the tags and metrics at the semantic layer was not enough. I For this sake, we made the semantic layer the heart of our data management system: - + **How does it work?** @@ -111,7 +111,7 @@ As you can see, Apache Doris has played a pivotal role in our solution. Optimizi ## What We Want? - + Currently, we have 800+ tags and 1300+ metrics derived from the 80+ source tables in TDW. @@ -126,7 +126,7 @@ When importing data from TDW to Doris, we hope to achieve: 1. **Generate Flat Tables in Flink Instead of TDW** - + Generating flat tables in TDW has a few downsides: @@ -143,7 +143,7 @@ On the contrary, generating flat tables in Doris is much easier and less expensi As is shown below, Flink has aggregated the five lines of data, of which “ID”=1, into one line in Doris, reducing the data writing pressure on Doris. - + This can largely reduce storage costs since TDW no long has to maintain two copies of data and KafKa only needs to store the new data pending for ingestion. What’s more, we can add whatever ETL logic we want into Flink and reuse lots of development logic for offline and real-time data ingestion. @@ -184,7 +184,7 @@ max_cumulative_compaction_num_singleton_deltas - Optimization of the BE commit logic: conduct regular caching of BE lists, commit them to the BE nodes batch by batch, and use finer load balancing granularity. - + **4. Use Dori-on-ES in Queries** @@ -214,7 +214,7 @@ I. When Doris BE pulls data from Elasticsearch (1024 lines at a time by default) II. After the data pulling, Doris BE needs to conduct Join operations with local metric tables via SHUFFLE/BROADCAST, which can cost a lot. - + Thus, we make the following optimizations: @@ -224,7 +224,7 @@ Thus, we make the following optimizations: - Use ES to compress the queried data; turn multiple data fetch into one and reduce network I/O overhead. - Make sure that Doris BE only pulls the data of buckets related to the local metric tables and conducts local Join operations directly to avoid data shuffling between Doris BEs. - + As a result, we reduce the query response time for large group targeting from 60 seconds to a surprising 3.7 seconds. @@ -257,5 +257,3 @@ http://doris.apache.org **Apache Doris Github**: https://github.com/apache/doris - -Find Apache Doris developers on [Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog) diff --git a/blog/Tencent-LLM.md b/blog/Tencent-LLM.md index 3775c5a82d0..9e7c0fba8e6 100644 --- a/blog/Tencent-LLM.md +++ b/blog/Tencent-LLM.md @@ -36,7 +36,7 @@ We have adopted Large Language Models (LLM) to empower our Doris-based OLAP serv Our incentive was to save our internal staff from the steep learning curve of SQL writing. Thus, we used LLM as an intermediate. It transforms natural language questions into SQL statements and sends the SQLs to the OLAP engine for execution. - + Like every AI-related experience, we came across some friction: @@ -53,7 +53,7 @@ For problem No.1, we introduce a semantic layer between the LLM and the OLAP eng Besides that, the semantic layer can optimize the computation logic. When analysts input a question that involves a complicated query, let's say, a multi-table join, the semantic layer can split that into multiple single-table queries to reduce semantic distortion. - + ### 2. LLM parsing rules @@ -61,7 +61,7 @@ To increase cost-effectiveness in using LLM, we evaluate the computation complex For example, when an analyst inputs "tell me the earnings of the major musical platforms", the LLM identifies that this question only entails several metrics or dimensions, so it will not further parse it but send it straight for SQL generation and execution. This can largely shorten query response time and reduce API expenses. - + ### 3. Schema Mapper and external knowledge base @@ -69,7 +69,7 @@ To empower the LLM with niche knowledge, we added a Schema Mapper upstream from We are constantly testing and optimizing the Schema Mapper. We categorize and rate content in the external knowledge base, and do various levels of mapping (full-text mapping and fuzzy mapping) to enable better semantic parsing. - + ### 4. Plugins @@ -78,7 +78,7 @@ We used plugins to connect the LLM to more fields of information, and we have di - **Embedding local files**: This is especially useful when we need to "teach" the LLM the latest regulatory policies, which are often text files. Firstly, the system vectorizes the local text file, executes semantic searches to find matching or similar terms in the local file, extracts the relevant contents and puts them into the LLM parsing window to generate output. - **Third-party plugins**: The marketplace is full of third-party plugins that are designed for all kinds of sectors. With them, the LLM is able to deal with wide-ranging topics. Each plugin has its own prompts and calling function. Once the input question hits a prompt, the relevant plugin will be called. - + After we are done with above four optimizations, the SuperSonic framework comes into being. @@ -86,7 +86,7 @@ After we are done with above four optimizations, the SuperSonic framework comes Now let me walk you through this [framework](https://github.com/tencentmusic/supersonic): - + - An analyst inputs a question. - The Schema Mapper maps the question to an external knowledge base. @@ -97,7 +97,7 @@ Now let me walk you through this [framework](https://github.com/tencentmusic/sup **Example** - + To answer whether a certain song can be performed on variety shows, the system retrieves the OLAP data warehouse for details about the song, and presents it with results from the Commercial Use Query third-party plugin. @@ -107,7 +107,7 @@ As for the OLAP part of this framework, after several rounds of architectural ev Raw data is sorted into tags and metrics, which are custom-defined by the analysts. The tags and metrics are under unified management in order to avoid inconsistent definitions. Then, they are combined into various tagsets and metricsets for various queries. - + We have drawn two main takeaways for you from our architectural optimization experience. @@ -146,4 +146,4 @@ When the number of aggregation tasks or data volume becomes overwhelming for Fli ## What's Next -With an aim to reduce costs and increase service availability, we plan to test the newly released Storage-Compute Separation and Cross-Cluster Replication of Doris, and we embrace any ideas and inputs about the SuperSonic framework and the Apache Doris project. +With an aim to reduce costs and increase service availability, we plan to test the newly released Storage-Compute Separation and Cross-Cluster Replication of Doris, and we embrace any ideas and inputs about the SuperSonic framework and the Apache Doris project. \ No newline at end of file diff --git a/blog/Replacing-Apache-Hive-Elasticsearch-and-PostgreSQL-with-Apache-Doris.md b/blog/Tianyancha.md similarity index 90% rename from blog/Replacing-Apache-Hive-Elasticsearch-and-PostgreSQL-with-Apache-Doris.md rename to blog/Tianyancha.md index 2d07e649cfa..e2c1678725b 100644 --- a/blog/Replacing-Apache-Hive-Elasticsearch-and-PostgreSQL-with-Apache-Doris.md +++ b/blog/Tianyancha.md @@ -37,15 +37,15 @@ Our old data warehouse consisted of the most popular components of the time, inc As you can imagine, a long and complicated data pipeline is high-maintenance and detrimental to development efficiency. Moreover, they are not capable of ad-hoc queries. So as an upgrade to our data warehouse, we replaced most of these components with [Apache Doris](https://github.com/apache/doris), a unified analytic database. - + - + ## Data Flow This is a lateral view of our data warehouse, from which you can see how the data flows. - + For starters, binlogs from MySQL will be ingested into Kafka via Canal, while user activity logs will be transferred to Kafka via Apache Flume. In Kafka, data will be cleaned and organized into flat tables, which will be later turned into aggregated tables. Then, data will be passed from Kafka to Apache Doris, which serves as the storage and computing engine. @@ -59,7 +59,7 @@ This is how Apache Doris replaces the roles of Hive, Elasticsearch, and PostgreS **After**: Since Apache Doris has all the itemized data, whenever it is faced with a new request, it can simply pull the metadata and configure the query conditions. Then it is ready for ad-hoc queries. In short, it only requires low-code configuration to respond to new requests. - + ## User Segmentation @@ -71,7 +71,7 @@ Tables in Elasticsearch and PostgreSQL were unreusable, making this architecture In this Doris-centered user segmentation process, we don't have to pre-define new tags. Instead, tags can be auto-generated based on the task conditions. The processing pipeline has the flexibility that can make our user-group-based A/B testing easier. Also, as both the itemized data and user group packets are in Apache Doris, we don't have to attend to the read and write complexity between multiple components. - + ## Trick to Speed up User Segmentation by 70% @@ -79,9 +79,9 @@ Due to risk aversion reasons, random generation of `user_id` is the choice for m To solve that, we created consecutive and dense mappings for these user IDs. **In this way, we decreased our user segmentation latency by 70%.** - + - + ### Example @@ -89,13 +89,13 @@ To solve that, we created consecutive and dense mappings for these user IDs. **I We adopt the Unique model for user ID mapping tables, where the user ID is the unique key. The mapped consecutive IDs usually start from 1 and are strictly increasing. - + **Step 2: Create a user group table:** We adopt the Aggregate model for user group tables, where user tags serve as the aggregation keys. - + Supposing that we need to pick out the users whose IDs are between 0 and 2000000. @@ -104,13 +104,13 @@ The following snippets use non-consecutive (`tyc_user_id`) and consecutive (`tyc - Non-Consecutive User IDs: **1843ms** - Consecutive User IDs: **543ms** - + ## Conclusion We have 2 clusters in Apache Doris accommodating tens of TBs of data, with almost a billion new rows flowing in every day. We used to witness a steep decline in data ingestion speed as data volume expanded. But after upgrading our data warehouse with Apache Doris, we increased our data writing efficiency by 75%. Also, in user segmentation with a result set of less than 5 million, it is able to respond within milliseconds. Most importantly, our data warehouse has been simpler and friendli [...] - + Lastly, I would like to share with you something that interested us most when we first talked to the [Apache Doris community](https://t.co/KcxAtAJZjZ): @@ -118,5 +118,3 @@ Lastly, I would like to share with you something that interested us most when we - It is well-integrated with the data ecosystem and can smoothly interface with most data sources and data formats. - It allows us to implement elastic scaling of clusters using the command line interface. - It outperforms ClickHouse in **join queries**. - -Find Apache Doris developers on [Slack](https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-1t3wfymur-0soNPATWQ~gbU8xutFOLog) diff --git a/blog/Choosing-an-OLAP-Engine-for-Financial-Risk-Management-What-to-Consider.md b/blog/Xingyun.md similarity index 97% rename from blog/Choosing-an-OLAP-Engine-for-Financial-Risk-Management-What-to-Consider.md rename to blog/Xingyun.md index 3603ce0f723..8a9a99782d4 100644 --- a/blog/Choosing-an-OLAP-Engine-for-Financial-Risk-Management-What-to-Consider.md +++ b/blog/Xingyun.md @@ -46,7 +46,7 @@ To speed up the highly concurrent point queries, you can create [Materialized Vi To facilitate queries on large tables, you can leverage the [Colocation Join](https://doris.apache.org/docs/dev/query-acceleration/join-optimization/colocation-join/) mechanism. Colocation Join minimizes data transfer between computation nodes to reduce overheads brought by data movement. Thus, it can largely improve query speed when joining large tables. - + ## Log Analysis @@ -54,7 +54,7 @@ Log analysis is important in financial data processing. Real-time processing and Retrieval is a major part of log analysis, so [Apache Doris 2.0](https://doris.apache.org/docs/dev/releasenotes/release-2.0.0) supports inverted index, which is a way to accelerate text searching and equivalence/range queries on numerics and datetime. It allows users to quickly locate the log record that they need among the massive data. The JSON storage feature in Apache Doris is reported to reduce storage costs of user activity logs by 70%, and the variety of parse functions provided c [...] - + ## Easy Maintenance @@ -64,4 +64,4 @@ In addition to the easy deployment, Apache Doris has a few mechanisms that are d This is overall data architecture in the case. The user utilizes Apache Flume for log data collection, and DataX for data update. Data from multiple sources will be collected into Apache Doris to form a data mart, from which analysts extract information to generate reports and dashboards for reference in risk control and business decisions. As for stability of the data mart itself, Grafana and Prometheus are used to monitor memory usage, compaction score and query response time of Apache [...] - + \ No newline at end of file diff --git a/blog/Database-Dissection-How-Fast-Data-Queries-Are-Implemented.md b/blog/Zhihu.md similarity index 94% rename from blog/Database-Dissection-How-Fast-Data-Queries-Are-Implemented.md rename to blog/Zhihu.md index 54b2d4e0bb4..9bef297d53a 100644 --- a/blog/Database-Dissection-How-Fast-Data-Queries-Are-Implemented.md +++ b/blog/Zhihu.md @@ -50,7 +50,7 @@ User segmentation is when analysts pick out a group of website users that share We realize that instead of executing set operations on one big dataset, we can divide our dataset into smaller ones, execute set operations on each of them, and then merge all the results. In this way, each small dataset is computed by one thread/queue. Then we have a queue to do the final merging. It's simple distributed computing thinking. - + Example: @@ -65,17 +65,17 @@ The problem here is, since user tags are randomly distributed across various mac This is enabled by the Colocate mechanism of Apache Doris. The idea of Colocate is to place data chunks that are often accessed together onto the same node, so as to reduce cross-node data transfer and thus, get lower latency. - + The implementation is simple: Bind one group key to one machine. Then naturally, data corresponding to that group key will be pre-bound to that machine. The following is the query plan before we adopted Collocate: It is complicated, with a lot of data shuffling. - + This is the query plan after. It is much simpler, which is why queries are much faster and less costly. - + ### 3.Merge the operators @@ -89,7 +89,7 @@ orthogonal_bitmap_union_count==bitmap_and(bitmap1,bitmap_and(bitmap2,bitmap3) Query execution with one compound function is much faster than that with a chain of simple functions, as you can tell from the lengths of the flow charts: - + - **Multiple Simple functions**: This involves three function executions and two intermediate storage. It's a long and slow process. - **One compound function**: Simple in and out. @@ -102,11 +102,11 @@ This is about putting the right workload on the right component. Apache Doris su In offline data ingestion, we used to perform most computation in Apache Hive, write the data files to HDFS, and pull data regularly from HDFS to Apache Doris. However, after Doris obtains parquet files from HDFS, it performs a series of operations on them before it can turn them into segment files: decompressing, bucketing, sorting, aggregating, and compressing. These workloads will be borne by Doris backends, which have to undertake a few bitmap operations at the same time. So there is [...] - + So we decided on the Spark Load method. It allows us to split the ingestion process into two parts: computation and storage, so we can move all the bucketing, sorting, aggregating, and compressing to Spark clusters. Then Spark writes the output to HDFS, from which Doris pulls data and flushes it to the local disks. - + When ingesting 1.2 TB data (that's 110 billion rows), the Spark Load method only took 55 minutes. @@ -126,7 +126,7 @@ They compared query response time before and after the vectorization in seven of The results are as below: - + ## Conclusion @@ -137,4 +137,4 @@ In short, what contributed to the fast data loading and data queries in this cas - Support for a wide range of data loading methods to choose from - A vectorized engine that brings overall performance increase -It takes efforts from both the database developers and users to make fast performance possible. The user's experience and knowledge of their own status quo will allow them to figure out the quickest path, while a good database design will help pave the way and make users' life easier. +It takes efforts from both the database developers and users to make fast performance possible. The user's experience and knowledge of their own status quo will allow them to figure out the quickest path, while a good database design will help pave the way and make users' life easier. \ No newline at end of file diff --git a/blog/release-note-2.0.0.md b/blog/release-note-2.0.0.md index 60fed524577..93a60c8ee92 100644 --- a/blog/release-note-2.0.0.md +++ b/blog/release-note-2.0.0.md @@ -47,7 +47,7 @@ This new version highlights: In SSB-Flat and TPC-H benchmarking, Apache Doris 2.0.0 delivered **over 10-time faster query performance** compared to an early version of Apache Doris. - + This is realized by the introduction of a smarter query optimizer, inverted index, a parallel execution model, and a series of new functionalities to support high-concurrency point queries. @@ -57,7 +57,7 @@ The brand new query optimizer, Nereids, has a richer statistical base and adopts TPC-H tests showed that Nereids, with no human intervention, outperformed the old query optimizer by a wide margin. Over 100 users have tried Apache Doris 2.0.0 in their production environment and the vast majority of them reported huge speedups in query execution. - + **Doc**: https://doris.apache.org/docs/dev/query-acceleration/nereids/ @@ -65,11 +65,11 @@ Nereids is enabled by default in Apache Doris 2.0.0: `SET enable_nereids_planner ### Inverted Index -In Apache Doris 2.0.0, we introduced [inverted index](https://doris.apache.org/docs/dev/data-table/index/inverted-index?_highlight=inverted) to better support fuzzy keyword search, equivalence queries, and range queries. +In Apache Doris 2.0.0, we introduced inverted index to better support fuzzy keyword search, equivalence queries, and range queries. A smartphone manufacturer tested Apache Doris 2.0.0 in their user behavior analysis scenarios. With inverted index enabled, v2.0.0 was able to finish the queries within milliseconds and maintain stable performance as the query concurrency level went up. In this case, it is 5 to 90 times faster than its old version. - + ### 20 times higher concurrency capability @@ -79,7 +79,7 @@ For a column-oriented DBMS like Apache Doris, the I/O usage of point queries wil After these optimizations, Apache Doris 2.0 reached a concurrency level of **30,000 QPS per node** on YCSB on a 16 Core 64G cloud server with 4×1T hard drives, representing an improvement of **20 times** compared to its older version. This makes Apache Doris a good alternative to HBase in high-concurrency scenarios, so that users don't need to endure extra maintenance costs and redundant storage brought by complicated tech stacks. -Read more: https://doris.apache.org/blog/How-We-Increased-Database-Query-Concurrency-by-20-Times +Read more: https://doris.apache.org/blog/High_concurrency ### A self-adaptive parallel execution model @@ -101,19 +101,19 @@ Apache Doris has been pushing its boundaries. Starting as an OLAP engine for rep Apache Doris 2.0.0 provides native support for semi-structured data. In addition to JSON and Array, it now supports a complex data type: Map. Based on Light Schema Change, it also supports Schema Evolution, which means you can adjust the schema as your business changes. You can add or delete fields and indexes, and change the data types for fields. As we introduced inverted index and a high-performance text analysis algorithm into it, it can execute full-text search and dimensional analy [...] - + ### Enhanced data lakehousing capabilities In Apache Doris 1.2, we introduced Multi-Catalog to allow for auto-mapping and auto-synchronization of data from heterogeneous sources. In version 2.0.0, we extended the list of data sources supported and optimized Doris for based on users' needs in production environment. - + Apache Doris 2.0.0 supports dozens of data sources including Hive, Hudi, Iceberg, Paimon, MaxCompute, Elasticsearch, Trino, ClickHouse, and almost all open lakehouse formats. It also supports snapshot queries on Hudi Copy-on-Write tables and read optimized queries on Hudi Merge-on-Read tables. It allows for authorization of Hive Catalog using Apache Ranger, so users can reuse their existing privilege control system. Besides, it supports extensible authorization plug-ins to enable user-de [...] TPC-H benchmark tests showed that Apache Doris 2.0.0 is 3~5 times faster than Presto/Trino in queries on Hive tables. This is realized by all-around optimizations (in small file reading, flat table reading, local file cache, ORC/Parquet file reading, Compute Nodes, and information collection of external tables) finished in this development cycle and the distributed execution framework, vectorized execution engine, and query optimizer of Apache Doris. - + All this gives Apache Doris 2.0.0 an edge in data lakehousing scenarios. With Doris, you can do incremental or overall synchronization of multiple upstream data sources in one place, and expect much higher data query performance than other query engines. The processed data can be written back to the sources or provided for downstream systems. In this way, you can make Apache Doris your unified data analytic gateway. @@ -144,23 +144,23 @@ As part of our continuing effort to strengthen the real-time analytic capability The sources of system instability often includes small file merging, write amplification, and the consequential disk I/O and CPU overheads. Hence, we introduced Vertical Compaction and Segment Compaction in version 2.0.0 to eliminate OOM errors in compaction and avoid the generation of too many segment files during data writing. After such improvements, Apache Doris can write data 50% faster while **using only 10% of the memory that it previously used**. -Read more: https://doris.apache.org/blog/Understanding-Data-Compaction-in-3-Minutes/ +Read more: https://doris.apache.org/blog/Compaction ### Auto-synchronization of table schema The latest Flink-Doris-Connector allows users to synchronize an entire database (such as MySQL and Oracle) to Apache Doris by one simple step. According to our test results, one single synchronization task can support the real-time concurrent writing of thousands of tables. Users no longer need to go through a complicated synchronization procedure because Apache Doris has automated the process. Changes in the upstream data schema will be automatically captured and dynamically updated to [...] -Read more: https://doris.apache.org/blog/Auto-Synchronization-of-an-Entire-MySQL-Database-for-Data-Analysis +Read more: https://doris.apache.org/blog/FDC ## A New Multi-Tenant Resource Isolation Solution The purpose of multi-tenant resource isolation is to avoid resource preemption in the case of heavy loads. For that sake, older versions of Apache Doris adopted a hard isolation plan featured by Resource Group: Backend nodes of the same Doris cluster would be tagged, and those of the same tag formed a Resource Group. As data was ingested into the database, different data replicas would be written into different Resource Groups, which will be responsible for different workloads. For examp [...] - + This is an effective solution, but in practice, it happens that some Resource Groups are heavily occupied while others are idle. We want a more flexible way to reduce vacancy rate of resources. Thus, in 2.0.0, we introduce Workload Group resource soft limit. - + The idea is to divide workloads into groups to allow for flexible management of CPU and memory resources. Apache Doris associates a query with a Workload Group, and limits the percentage of CPU and memory that a single query can use on a backend node. The memory soft limit can be configured and enabled by the user. @@ -190,7 +190,7 @@ Apache Doris 2.0 provides two solutions to address the needs of the first two ty 1. **Compute nodes**. We introduced stateless compute nodes in version 2.0. Unlike the mix nodes, the compute nodes do not save any data and are not involved in workload balancing of data tablets during cluster scaling. Thus, they are able to quickly join the cluster and share the computing pressure during peak times. In addition, in data lakehouse analysis, these nodes will be the first ones to execute queries on remote storage (HDFS/S3) so there will be no resource competition between [...] 1. Doc: https://doris.apache.org/docs/dev/advanced/compute_node/ 2. **Hot-cold data separation**. Hot/cold data refers to data that is frequently/seldom accessed, respectively. Generally, it makes more sense to store cold data in low-cost storage. Older versions of Apache Doris support lifecycle management of table partitions: As hot data cooled down, it would be moved from SSD to HDD. However, data was stored with multiple replicas on HDD, which was still a waste. Now, in Apache Doris 2.0, cold data can be stored in object storage, which is even chea [...] - 1. Read more: https://doris.apache.org/blog/Tiered-Storage-for-Hot-and-Cold-Data:-What,-Why,-and-How?/ + 1. Read more: https://doris.apache.org/blog/HCDS/ For neater separate of computation and storage, the VeloDB team is going to contribute the Cloud Compute-Storage-Separation solution to the Apache Doris project. The performance and stability of it has stood the test of hundreds of companies in their production environment. The merging of code will be finished by October this year, and all Apache Doris users will be able to get an early taste of it in September. diff --git a/blog/Use-Apache-Doris-with-AI-chatbots.md b/blog/scenario.md similarity index 99% rename from blog/Use-Apache-Doris-with-AI-chatbots.md rename to blog/scenario.md index af53293b174..0f537f84bff 100644 --- a/blog/Use-Apache-Doris-with-AI-chatbots.md +++ b/blog/scenario.md @@ -1,6 +1,6 @@ --- { - 'title': 'How Does Apache Doris Help AISPEECH Build a Data Warehouse in AI Chatbots Scenario', + 'title': 'How Does Apache Doris Help AISPEACH Build a Datawherehouse in AI Chatbots Scenario', 'summary': "Guide: In 2019, AISPEACH built a real-time and offline datawarehouse based on Apache Doris. Reling on its flexible query model, extremely low maintenance costs, high development efficiency, and excellent query performance, Apache Doris has been used in many business scenarios such as real-time business operations, AI chatbots analysis. It meets various data analysis needs such as device portrait/user label, real-time operation, data dashboard, self-service BI and financia [...] 'date': '2022-11-24', 'author': 'Zhao Wei', @@ -27,7 +27,7 @@ specific language governing permissions and limitations under the License. --> -# How Does Apache Doris Help AISPEECH Build a Data warehouse in AI Chatbots Scenario +# How Does Apache Doris Help AISPEACH Build a Datawherehouse in AI Chatbots Scenario  diff --git a/static/images/Unicom-1.png b/static/images/Unicom-1.png deleted file mode 100644 index 2773ab0a1e8..00000000000 Binary files a/static/images/Unicom-1.png and /dev/null differ diff --git a/static/images/Unicom-2.png b/static/images/Unicom-2.png deleted file mode 100644 index 99b21ee4ccb..00000000000 Binary files a/static/images/Unicom-2.png and /dev/null differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org