This is an automated email from the ASF dual-hosted git repository. luzhijing pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push: new 1b31cf2b522 add blogs (#299) 1b31cf2b522 is described below commit 1b31cf2b522a55b4d5e9c331e43085f2e40fc9ba Author: Hu Yanjun <100749531+httpshir...@users.noreply.github.com> AuthorDate: Fri Sep 1 21:16:57 2023 +0800 add blogs (#299) --- blog/Pingan.md | 79 +++++++++++++++++++++ blog/Tencent-LLM.md | 149 ++++++++++++++++++++++++++++++++++++++++ blog/Xingyun.md | 67 ++++++++++++++++++ static/images/Pingan_1.png | Bin 0 -> 655030 bytes static/images/Pingan_2.png | Bin 0 -> 121650 bytes static/images/Pingan_3.png | Bin 0 -> 232177 bytes static/images/Tencent_LLM_1.png | Bin 0 -> 144899 bytes static/images/Tencent_LLM_2.png | Bin 0 -> 183848 bytes static/images/Tencent_LLM_3.png | Bin 0 -> 261700 bytes static/images/Tencent_LLM_4.png | Bin 0 -> 570073 bytes static/images/Tencent_LLM_5.png | Bin 0 -> 607999 bytes static/images/Tencent_LLM_6.png | Bin 0 -> 233280 bytes static/images/Tencent_LLM_7.png | Bin 0 -> 395985 bytes static/images/Tencent_LLM_8.png | Bin 0 -> 722036 bytes static/images/Xingyun_1.png | Bin 0 -> 488022 bytes static/images/Xingyun_2.png | Bin 0 -> 249250 bytes static/images/Xingyun_3.png | Bin 0 -> 548768 bytes 17 files changed, 295 insertions(+) diff --git a/blog/Pingan.md b/blog/Pingan.md new file mode 100644 index 00000000000..f8d49ab4808 --- /dev/null +++ b/blog/Pingan.md @@ -0,0 +1,79 @@ +--- +{ + 'title': 'Database in Fintech: How to Support 10,000 Dashboards Without Causing a Mess', + 'summary': "This article introduces the lifecycle of financial metrics in a database, from how they're produced to how they're efficiently presented in data reports.", + 'date': '2023-08-05', + 'author': 'Hou Lan', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +In a data-intensive industry like finance, data comes from numerous entries and goes to numerous exits. Such status quo can easily, and almost inevitably, lead to chaos in data analysis and management. For example, analysts from different business lines define their own financial metrics in data reports. When you pool these countless reports together in your data architecture, you will find that many metrics overlap or even contradict each other in definition. The consequence is, develop [...] + +As your business grows, your data management will arrive at a point when "standardization" is needed. In terms of data engineering, that means you need a data platform where you can produce and manage all metrics. That's your architectural prerequisite to provide efficient financial services. + +This article introduces the lifecycle of financial metrics in a database (in this case, [Apache Doris](https://doris.apache.org/)), from how they're produced to how they're efficiently presented in data reports. You will get an inside view of what's behind those fancy financial dashboards. + +## Define New Metrics & Add Them to Your Database + +Fundamentally, metrics are fields in a table. To provide a more concrete idea of them, I will explain with an example in the banking industry. + +Banks measure the assets of customers by AUM (Assets Under Management). In this scenario, AUM is an **atomic metric**, which is often a field in the source data table. On the basis of AUM, analysts derive a series of **derivative metrics**, such as "year-on-year AUM growth", "month-on-month AUM growth", and "AUM per customer". + +Once you define the new metrics, you add them to your data reports, which involves a few simple configurations in Apache Doris: + +Developers update the metadata accordingly, register the base table where the metrics are derived, configure the data granularity and update frequency of intermediate tables, and input the metric name and definition. Some engineers will also monitor the metrics to identify abnormalities and remove redundant metrics based on a metric evaluation system. + +When the metrics are soundly put in place, you can ingest new data into your database to get your data reports. For example, if you ingest CSV files, we recommend the Stream Load method of Apache Doris and a file size of 1~10G per batch. Eventually, these metrics will be visualized in data charts. + +## Calculate Your Metrics + +As is mentioned, some metrics are produced by combining multiple fields in the source table. In data engineering, that is a multi-table join query. Based on the optimization experience of an Apache Doris user, we recommend flat tables instead of Star/Snowflake Schema. The user reduced the query response time on tables of 100 million rows **from 5s to 63ms** after such a change. + + + +The flat table solution also eliminates jitter. + + + +## Enable SQL Caching to Reduce Resource Consumption + +Analysts often check data reports of the same metrics on a regular basis. These reports are produced by the same SQL, so one way to further improve query speed is SQL caching. Here is how it turns out in a use case with SQL caching enabled. + +- All queries are responded within 10ms; +- When computing 30 metrics simultaneously (over 120 SQL commands), results can be returned within 600ms; +- A TPS (Transactions Per Second) of 300 is reached, with CPU, memory, disk, and I/O usage under 80%; +- Under the recommended cluster size, over 10,000 metrics can be cached, which means you can save a lot of computation resources. + + + +## Conclusion + +The complexity of data analysis in the financial industry lies in the data itself other than the engineering side. Thus, the underlying data architecture should focus on facilitating the unified and efficient management of data. Apache Doris provides the flexibility of simple metric registration and the ability of fast and resource-efficient metric computation. In this case, the user is able to handle 10,000 active financial metrics in 10,000 dashboards with 30% less ETL efforts. + + + + + + + diff --git a/blog/Tencent-LLM.md b/blog/Tencent-LLM.md new file mode 100644 index 00000000000..9e7c0fba8e6 --- /dev/null +++ b/blog/Tencent-LLM.md @@ -0,0 +1,149 @@ +--- +{ + 'title': 'LLM-Powered OLAP: the Tencent Experience with Apache Doris', + 'summary': "The exploration of a LLM+OLAP solution is a bumpy journey, but phew, it now works well for the Tencent case, and they're writing down every lesson learned to share with you.", + 'date': '2023-08-29', + 'author': 'Jun Zhang & Lei Luo', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +Six months ago, I wrote about [why we replaced ClickHouse with Apache Doris as an OLAP engine](https://doris.apache.org/blog/Tencent%20Music/) for our data management system. Back then, we were struggling with the auto-generation of SQL statements. As days pass, we have made progresses big enough to be references for you (I think), so here I am again. + +We have adopted Large Language Models (LLM) to empower our Doris-based OLAP services. + +## LLM + OLAP + +Our incentive was to save our internal staff from the steep learning curve of SQL writing. Thus, we used LLM as an intermediate. It transforms natural language questions into SQL statements and sends the SQLs to the OLAP engine for execution. + + + +Like every AI-related experience, we came across some friction: + +1. LLM does not understand data jargons, like "fields", "rows", "columns" and "tables". Instead, they can perfectly translate business terms like "corporate income" and "DAU", which are basically what the fields/rows/columns are about. That means it can work well only if the analysts use the exact right word to refer to the metric they need when typing their questions. +2. The LLM we are using is slow in inference. It takes over 10 seconds to respond. As it charges fees by token, cost-effectiveness becomes a problem. +3. Although the LLM is trained on a large collection of public datasets, it is under-informed of niche knowledge. In our case, the LLM is super unfamiliar with indie songs, so even if the songs are included in our database, the LLM will not able to identify them properly. +4. Sometimes our input questions require adequate and latest legal, political, financial, and regulatory information, which is hard to be included in a training dataset or knowledge base. We need to connect the LLM to wider info bases in order to perform more diversified tasks. + +We knock these problems down one by one. + +### 1. A semantic layer + +For problem No.1, we introduce a semantic layer between the LLM and the OLAP engine. This layer translates business terms into the corresponding data fields. It can identify data filtering conditions from the various natural language wordings, relate them to the metrics involved, and then generate SQL statements. + +Besides that, the semantic layer can optimize the computation logic. When analysts input a question that involves a complicated query, let's say, a multi-table join, the semantic layer can split that into multiple single-table queries to reduce semantic distortion. + + + +### 2. LLM parsing rules + +To increase cost-effectiveness in using LLM, we evaluate the computation complexity of all scenarios, such as metric computation, detailed record retrieval, and user segmentation. Then, we create rules and dedicate the LLM-parsing step to only complicated tasks. That means for the simple computation tasks, it will skip the parsing. + +For example, when an analyst inputs "tell me the earnings of the major musical platforms", the LLM identifies that this question only entails several metrics or dimensions, so it will not further parse it but send it straight for SQL generation and execution. This can largely shorten query response time and reduce API expenses. + + + +### 3. Schema Mapper and external knowledge base + +To empower the LLM with niche knowledge, we added a Schema Mapper upstream from the LLM. The Schema Mapper maps the input question to an external knowledge base, and then the LLM will do parsing. + +We are constantly testing and optimizing the Schema Mapper. We categorize and rate content in the external knowledge base, and do various levels of mapping (full-text mapping and fuzzy mapping) to enable better semantic parsing. + + + +### 4. Plugins + +We used plugins to connect the LLM to more fields of information, and we have different integration methods for different types of plugins: + +- **Embedding local files**: This is especially useful when we need to "teach" the LLM the latest regulatory policies, which are often text files. Firstly, the system vectorizes the local text file, executes semantic searches to find matching or similar terms in the local file, extracts the relevant contents and puts them into the LLM parsing window to generate output. +- **Third-party plugins**: The marketplace is full of third-party plugins that are designed for all kinds of sectors. With them, the LLM is able to deal with wide-ranging topics. Each plugin has its own prompts and calling function. Once the input question hits a prompt, the relevant plugin will be called. + + + +After we are done with above four optimizations, the SuperSonic framework comes into being. + +## The SuperSonic framework + +Now let me walk you through this [framework](https://github.com/tencentmusic/supersonic): + + + +- An analyst inputs a question. +- The Schema Mapper maps the question to an external knowledge base. +- If there are matching fields in the external knowledge base, the question will not be parsed by the LLM. Instead, a metric computation formula will trigger the OLAP engine to start querying. If there is no matching field, the question will enter the LLM. +- Based on the pre-defined rules, the LLM rates the complexity level of the question. If it is a simple query, it will go directly to the OLAP engine; if it is a complicated query, it will be semantically parsed and converted to a DSL statement. +- At the Semantic Layer, the DSL statement will be split based on its query scenario. For example, if it is a multi-table join query, this layer will generate multiple single-table query SQL statements. +- If the question involves external knowledge, the LLM will call a third-party plugin. + +**Example** + + + +To answer whether a certain song can be performed on variety shows, the system retrieves the OLAP data warehouse for details about the song, and presents it with results from the Commercial Use Query third-party plugin. + +## OLAP Architecture + +As for the OLAP part of this framework, after several rounds of architectural evolution, this is what our current OLAP pipeline looks like. + +Raw data is sorted into tags and metrics, which are custom-defined by the analysts. The tags and metrics are under unified management in order to avoid inconsistent definitions. Then, they are combined into various tagsets and metricsets for various queries. + + + +We have drawn two main takeaways for you from our architectural optimization experience. + +**1. Streamline the links** + +Before we adopted Apache Doris, we used to have ClickHouse to accelerate the computation of tags and metrics, and Elasticsearch to process dimensional data. That's two analytic engines and requires us to adapt the query statements to both of them. It was high-maintenance. + +Thus, we replaced ClickHouse with Apache Doris, and utilized the [Elasticsearch Catalog](https://doris.apache.org/docs/dev/lakehouse/multi-catalog/es) functionality to connect Elasticsearch data to Doris. In this way, we make Doris our unified query gateway. + +**2. Split the flat tables** + +In early versions of our OLAP architecture, we used to put data into flat tables, which made things tricky. For one thing, flat tables absorbed all the writing latency from upstreams, and that added up to considerable loss in data realtimeliness. For another, 50% of data in a flat table was dimensional data, which was rarely updated. With every new flat table came some bulky dimensional data that consumed lots of storage space. + +Therefore, we split the flat tables into metric tables and dimension tables. As they are updated in different paces, we put them into different data models. + +- **Metric tables**: We arrange metric data in the Aggregate Key model of Apache Doris, which means new data will be merged with the old data by way of SUM, MAX, MIN, etc. +- **Dimension tables**: These tables are in the Unique Key model of Apache Doris, which means new data record will replace the old. This can greatly increase performance in our query scenarios. + +You might ask, does this cause trouble in queries, since most queries require data from both types of tables? Don't worry, we address that with the Rollup feature of Doris. On the basis of the base tables, we can select the dimensions we need to create Rollup views, which will automatically execute `GROUP BY`. This relieves us of the need to define tags for each Rollup view and largely speed up queries. + +## Other Tricks + +In our experience with Apache Doris, we also find some other functionalities handy, so I list them here for you, too: + +**1. Materialized View** + +A Materialized View is a pre-computed dataset. It is a way to accelerate queries when you frequently need to access data of certain dimensions. In these scenarios, we define derived tags and metrics based on the original ones. For example, we create a derived metric by combining Metric 1, Metric 2, and Metric 3: `sum(m1+m2+m3)`. Then, we can create a Materialized View for it. According to the Doris release schedule, version 2.1 will support multi-table Materialized Views, and we look for [...] + +**2. Flink-Doris-Connector** + +This is for Exactly-Once guarantee in data ingestion. The Flink-Doris-Connector implements a checkpoint mechanism and two-stage commit, and allows for auto data synchronization from relational databases to Doris. + +**3. Compaction** + +When the number of aggregation tasks or data volume becomes overwhelming for Flink, there might be huge latency in data compaction. We solve that with Vertical Compaction and Segment Compaction. Vertical Compaction supports loading of only part of the columns, so it can reduce storage consumption when compacting flat tables. Segment Compaction can avoid generating too much segments during data writing, and allows for compaction while writing simultaneously. + +## What's Next + +With an aim to reduce costs and increase service availability, we plan to test the newly released Storage-Compute Separation and Cross-Cluster Replication of Doris, and we embrace any ideas and inputs about the SuperSonic framework and the Apache Doris project. \ No newline at end of file diff --git a/blog/Xingyun.md b/blog/Xingyun.md new file mode 100644 index 00000000000..8a9a99782d4 --- /dev/null +++ b/blog/Xingyun.md @@ -0,0 +1,67 @@ +--- +{ + 'title': 'Choosing an OLAP Engine for Financial Risk Management: What to Consider?', + 'summary': "This post provides reference for what you should take into account when choosing an OLAP engine in a financial scenario.", + 'date': '2023-08-17', + 'author': 'Jianbo Liu', + 'tags': ['Best Practice'], +} + +--- + +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + + + +From a data engineer's point of view, financial risk management is a series of data analysis activities on financial data. The financial sector imposes its unique requirements on data engineering. This post explains them with a use case of Apache Doris, and provides reference for what you should take into account when choosing an OLAP engine in a financial scenario. + +## Data Must Be Combined + +The financial data landscape is evolving from standalone to distributed, heterogeneous systems. For example, in this use case scenario, the fintech service provider needs to connect the various transaction processing (TP) systems (MySQL, Oracle, and PostgreSQL) of its partnering banks. Before they adopted an OLAP engine, they were using Kettle to collect data. The ETL tool did not support join queries across different data sources and it could not store data. The ever-enlarging data size [...] + +The financial user's main pursuit is quick queries on large data volume with as least engineering and maintenance efforts as possible, so when it comes to the choice of OLAP engines, SQL on Hadoop should be crossed off the list due to its huge ecosystem and complicated components. One reason that they landed on Apache Doris was the metadata management capability. Apache Doris collects metadata of various data sources via API so it is a fit for the case which requires combination of diffe [...] + +## High Concurrency & High Throughput + +Financial risk control is based on analysis of large amounts of transaction data. Sometimes analysts identify abnormalities by combining data from different large tables, and often times they need to check a certain data record, which comes in the form of concurrent point queries in the data system. Thus, the OLAP engine should be able to handle both high-throughput queries and high-concurrency queries. + +To speed up the highly concurrent point queries, you can create [Materialized Views](https://doris.apache.org/docs/dev/query-acceleration/materialized-view/) in Apache Doris. A Materialized View is a pre-computed data set stored in Apache Doris so that the system can respond much faster to queries that are frequently conducted. + +To facilitate queries on large tables, you can leverage the [Colocation Join](https://doris.apache.org/docs/dev/query-acceleration/join-optimization/colocation-join/) mechanism. Colocation Join minimizes data transfer between computation nodes to reduce overheads brought by data movement. Thus, it can largely improve query speed when joining large tables. + + + +## Log Analysis + +Log analysis is important in financial data processing. Real-time processing and monitoring of logs can expose risks in time. Apache Doris provides data storage and analytics capabilities to make log analysis easier and more efficient. As logs are bulky, Apache Doris can deliver a high data compression rate to lower storage costs. + +Retrieval is a major part of log analysis, so [Apache Doris 2.0](https://doris.apache.org/docs/dev/releasenotes/release-2.0.0) supports inverted index, which is a way to accelerate text searching and equivalence/range queries on numerics and datetime. It allows users to quickly locate the log record that they need among the massive data. The JSON storage feature in Apache Doris is reported to reduce storage costs of user activity logs by 70%, and the variety of parse functions provided c [...] + + + +## Easy Maintenance + +In addition to the easy deployment, Apache Doris has a few mechanisms that are designed to save maintenance efforts. For example, it ensures high availability of cluster nodes with Systemd, and high availability of data with multi-replica and auto-balancing of replicas, so all maintenance required is to backup metadata on a regular basis. Apache Doris also supports [dynamic partitioning of data](https://doris.apache.org/docs/dev/advanced/partition/dynamic-partition/), which means it will [...] + +## Architecture Overview + +This is overall data architecture in the case. The user utilizes Apache Flume for log data collection, and DataX for data update. Data from multiple sources will be collected into Apache Doris to form a data mart, from which analysts extract information to generate reports and dashboards for reference in risk control and business decisions. As for stability of the data mart itself, Grafana and Prometheus are used to monitor memory usage, compaction score and query response time of Apache [...] + + \ No newline at end of file diff --git a/static/images/Pingan_1.png b/static/images/Pingan_1.png new file mode 100644 index 00000000000..cf90c589fa3 Binary files /dev/null and b/static/images/Pingan_1.png differ diff --git a/static/images/Pingan_2.png b/static/images/Pingan_2.png new file mode 100644 index 00000000000..0d698da52a6 Binary files /dev/null and b/static/images/Pingan_2.png differ diff --git a/static/images/Pingan_3.png b/static/images/Pingan_3.png new file mode 100644 index 00000000000..34ae4bdbc59 Binary files /dev/null and b/static/images/Pingan_3.png differ diff --git a/static/images/Tencent_LLM_1.png b/static/images/Tencent_LLM_1.png new file mode 100644 index 00000000000..1d1a2ca9cb3 Binary files /dev/null and b/static/images/Tencent_LLM_1.png differ diff --git a/static/images/Tencent_LLM_2.png b/static/images/Tencent_LLM_2.png new file mode 100644 index 00000000000..a6bacf64ad3 Binary files /dev/null and b/static/images/Tencent_LLM_2.png differ diff --git a/static/images/Tencent_LLM_3.png b/static/images/Tencent_LLM_3.png new file mode 100644 index 00000000000..72cd099022b Binary files /dev/null and b/static/images/Tencent_LLM_3.png differ diff --git a/static/images/Tencent_LLM_4.png b/static/images/Tencent_LLM_4.png new file mode 100644 index 00000000000..c8b94c6f3c6 Binary files /dev/null and b/static/images/Tencent_LLM_4.png differ diff --git a/static/images/Tencent_LLM_5.png b/static/images/Tencent_LLM_5.png new file mode 100644 index 00000000000..5620cc99f59 Binary files /dev/null and b/static/images/Tencent_LLM_5.png differ diff --git a/static/images/Tencent_LLM_6.png b/static/images/Tencent_LLM_6.png new file mode 100644 index 00000000000..a57fa8786c1 Binary files /dev/null and b/static/images/Tencent_LLM_6.png differ diff --git a/static/images/Tencent_LLM_7.png b/static/images/Tencent_LLM_7.png new file mode 100644 index 00000000000..4d5c5526eda Binary files /dev/null and b/static/images/Tencent_LLM_7.png differ diff --git a/static/images/Tencent_LLM_8.png b/static/images/Tencent_LLM_8.png new file mode 100644 index 00000000000..209f42f52a9 Binary files /dev/null and b/static/images/Tencent_LLM_8.png differ diff --git a/static/images/Xingyun_1.png b/static/images/Xingyun_1.png new file mode 100644 index 00000000000..d19536a2d78 Binary files /dev/null and b/static/images/Xingyun_1.png differ diff --git a/static/images/Xingyun_2.png b/static/images/Xingyun_2.png new file mode 100644 index 00000000000..8375a36c9b8 Binary files /dev/null and b/static/images/Xingyun_2.png differ diff --git a/static/images/Xingyun_3.png b/static/images/Xingyun_3.png new file mode 100644 index 00000000000..f2a13a7698d Binary files /dev/null and b/static/images/Xingyun_3.png differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org