[kylin] branch document updated: Update kylin diagram and add new blog

yaqian Thu, 21 Oct 2021 21:58:10 -0700

This is an automated email from the ASF dual-hosted git repository.

yaqian pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git



The following commit(s) were added to refs/heads/document by this push:
     new 9a814e7  Update kylin diagram and add new blog
9a814e7 is described below

commit 9a814e736165f1d4f839d3bf8f92399cb007ad39
Author: yaqian.zhang <[email protected]>
AuthorDate: Thu Oct 21 10:45:33 2021 +0800

    Update kylin diagram and add new blog
---
 ...-Local-Cache-and-Soft-Affinity-Scheduling.cn.md |  63 ++++++++++++++++++
 ...-21-Local-Cache-and-Soft-Affinity-Scheduling.md |  72 +++++++++++++++++++++
 website/assets/images/Kylin_diagram.pptx           | Bin 0 -> 172262 bytes
 website/assets/images/kylin_diagram.png            | Bin 61344 -> 195017 bytes
 .../images/blog/local-cache/Local_cache_stage.png  | Bin 0 -> 103528 bytes
 .../images/blog/local-cache/kylin4_local_cache.png | Bin 0 -> 145137 bytes
 .../local_cache_benchmark_result_ssb.png           | Bin 0 -> 14438 bytes
 .../local_cache_benchmark_result_tpch1.png         | Bin 0 -> 17214 bytes
 .../local_cache_benchmark_result_tpch4.png         | Bin 0 -> 16531 bytes
 9 files changed, 135 insertions(+)

diff --git 
a/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md 
b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md
new file mode 100644
index 0000000..cae83bb
--- /dev/null
+++ 
b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.cn.md
@@ -0,0 +1,63 @@
+---
+layout: post-blog
+title:  Kylin4 云上性能优化：本地缓存和软亲和性调度
+date:   2021-10-21 11:00:00
+author: 张亚倩
+categories: cn_blog
+---
+
+## 01 背景介绍
+日前，Apache Kylin 社区发布了全新架构的 Kylin 4.0。Kylin 4.0 的架构支持存储和计算分离，这使得 kylin 
用户可以采取更加灵活、计算资源可以弹性伸缩的云上部署方式来运行 Kylin 4.0。借助云上的基础设施，用户可以选择使用便宜且可靠的对象存储来储存 cube 
数据，比如 S3 等。不过在存储与计算分离的架构下，我们需要考虑到，计算节点通过网络从远端存储读取数据仍然是一个代价较大的操作，往往会带来性能的损耗。
+为了提高 Kylin 4.0 在使用云上对象存储作为存储时的查询性能，我们尝试在 Kylin 4.0 的查询引擎中引入本地缓存（Local 
Cache）机制，在执行查询时，将经常使用的数据缓存在本地磁盘，减小从远程对象存储中拉取数据带来的延迟，实现更快的查询响应；除此之外，为了避免同样的数据在大量 
spark executor 上同时缓存浪费磁盘空间，并且计算节点可以更多的从本地缓存读取所需数据，我们引入了 软亲和性（Soft Affinity 
）的调度策略，所谓软亲和性策略，就是通过某种方法在 spark executor 和数据文件之间建立对应关系，使得同样的数据在大部分情况下能够总是在同一个 
executor 上面读取，从而提高缓存的命中率。
+
+## 02 实现原理
+
+#### 1.本地缓存
+在 Kylin 4.0 执行查询时，主要经过以下几个阶段，其中用虚线标注出了可以使用本地缓存来提升性能的阶段：
+
+![](/images/blog/local-cache/Local_cache_stage.png)
+
+- File list cache：在 spark driver 端对 file status 进行缓存。在执行查询时，spark driver 
需要读取文件列表，获取一些文件信息进行后续的调度执行，这里会将 file status 信息缓存到本地避免频繁读取远程文件目录。
+- Data cache：在 spark executor 端对数据进行缓存。用户可以设置将数据缓存到内存或是磁盘，若设置为缓存到内存，则需要适当调大 
executor memory，保证 executor 有足够的内存可以进行数据缓存；若是缓存到磁盘，需要用户设置数据缓存目录，最好设置为 SSD 
磁盘目录。除此之外，缓存数据的最大容量、备份数量等均可由用户配置调整。
+
+基于以上设计，在 Kylin 4.0 的查询引擎 sparder 的 driver 端和 executor 端分别做不同类型的缓存，基本架构如下：
+
+![](/images/blog/local-cache/kylin4_local_cache.png)
+
+#### 2.软亲和性调度
+在 executor 端做 data cache 时，如果在所有的 executor 
上都缓存全部的数据，那么缓存数据的大小将会非常可观，极大的浪费磁盘空间，同时也容易导致缓存数据被频繁清理。为了最大化 spark executor 
的缓存命中率，spark driver 需要将同一文件的 task 在资源条件满足的情况下尽可能调度到同样的 
executor，这样可以保证相同文件的数据能够缓存在特定的某个或者某几个 executor 上，再次读取时便可以通过缓存读取数据。
+为此，我们采取根据文件名计算 hash 之后再与 executors num 取模的结果来计算目标 executor 列表，在多少个 executor 
上面做缓存由用户配置的缓存备份数量决定，一般情况下，缓存备份数量越大，击中缓存的概率越高。当目标 executor 
均不可达或者没有资源供调度时，调度程序将回退到 spark 的随机调度机制上。这种调度方式便称为软亲和性调度策略，它虽然不能保证 100% 
击中缓存，但能够有效提高缓存命中率，在尽量不损失性能的前提下避免 full cache 浪费大量磁盘空间。
+
+## 03 相关配置
+根据以上原理，我们在 Kylin 4.0 中实现了本地缓存+软亲和性调度的基础功能，并分别基于 ssb 数据集和 tpch 数据集做了查询性能测试。
+这里列出几个比较重要的配置项供用户了解，实际使用的配置将在结尾链接中给出：
+- 是否开启软亲和性调度策略：kylin.query.spark-conf.spark.kylin.soft-affinity.enabled
+- 是否开启本地缓存：kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled
+- Data cache 的备份数量，即在多少个 executor 
上对同一数据文件进行缓存：kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num
+- 缓存到内存中还是本地目录，缓存到内存设置为 BUFF，缓存到本地设置为 
LOCAL：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type
+- 最大缓存容量：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size
+
+## 04 性能对比
+我们在 AWS EMR 环境下进行了 3 种场景的性能测试，在 scale factor = 10的情况下，对 ssb 数据集进行单并发查询测试、tpch 
数据集进行单并发查询以及 4 并发查询测试，实验组和对照组均配置 s3 
作为存储，在实验组中开启本地缓存和软亲和性调度，对照组则不开启。除此之外，我们还将实验组结果与相同环境下 hdfs 
作为存储时的结果进行对比，以便用户可以直观的感受到 本地缓存+软亲和性调度 对云上部署 Kylin 4.0 使用对象存储作为存储场景下的优化效果。
+
+![](/images/blog/local-cache/local_cache_benchmark_result_ssb.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch1.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch4.png)
+
+从以上结果可以看出：
+1. 在 ssb 10 数据集单并发场景下，使用 s3 作为存储时，开启本地缓存和软亲和性调度能够获得3倍左右的性能提升，可以达到与 hdfs 
作为存储时的相同性能甚至还有 5% 左右的提升。
+2. 在 tpch 10 数据集下，使用 s3 
作为存储时，无论是单并发查询还是多并发查询，开启本地缓存和软亲和性调度后，基本在所有查询中都能够获得大幅度的性能提升。
+
+不过在 tpch 10 数据集的 4 并发测试下的 Q21 的对比结果中，我们观察到，开启本地缓存和软亲和性调度的结果反而比单独使用 s3 
作为存储时有所下降，这里可能是由于某种原因导致没有通过缓存读取数据，深层原因在此次测试中没有进行进一步的分析，在后续的优化过程中我们会逐步改进。由于 tpch 
的查询比较复杂且 SQL 类型各异，与 hdfs 作为存储时的结果相比，仍然有部分 sql 的性能略有不足，不过总体来说已经与 hdfs 的结果比较接近。
+本次性能测试的结果是一次对 本地缓存+软亲和性调度 性能提升效果的初步验证，从总体上来看，本地缓存+软亲和性调度 
无论对于简单查询还是复杂查询都能够获得明显的性能提升，但是在高并发查询场景下存在一定的性能损失。
+如果用户使用云上对象存储作为 Kylin 4.0 的存储，在开启 本地缓存+ 软亲和性调度的情况下，是可以获得很好的性能体验的，这为 Kylin 4.0 
在云上使用计算和存储分离架构提供了性能保障。
+
+## 05 代码实现
+由于目前的代码实现还处于比较基础的阶段，还有许多细节需要完善，比如实现一致性哈希、当 executor 数量发生变化时如何处理已有 cache 
等，所以作者还未向社区代码库提交 PR，想要提前预览的开发者可以通过下面的链接查看源码：
+[Kylin4.0 
本地缓存+软亲和性调度代码实现](https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35)
+
+## 06 相关链接
+通过链接可查阅性能测试结果数据和具体配置：
+[Kylin4.0 本地缓存+软亲和性调度测试](https://github.com/Kyligence/kylin-tpch/issues/9)
\ No newline at end of file
diff --git 
a/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md 
b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md
new file mode 100644
index 0000000..eb5e62e
--- /dev/null
+++ b/website/_posts/blog/2021-10-21-Local-Cache-and-Soft-Affinity-Scheduling.md
@@ -0,0 +1,72 @@
+---
+layout: post-blog
+title:  Performance optimization of Kylin 4.0 in cloud -- local cache and soft 
affinity scheduling
+date:   2021-10-21 11:00:00
+author: Yaqian Zhang
+categories: blog
+---
+
+## 01 Background Introduction
+Recently, the Apache Kylin community released Kylin 4.0.0 with a new 
architecture. The architecture of Kylin 4.0 supports the separation of storage 
and computing, which enables kylin users to run Kylin 4.0 in a more flexible 
cloud deployment mode with flexible computing resources. With the cloud 
infrastructure, users can choose to use cheap and reliable object storage to 
store cube data, such as S3. However, in the architecture of separation of 
storage and computing, we need to consider  [...]
+In order to improve the query performance of Kylin 4.0 when using cloud object 
storage as the storage, we try to introduce the local cache mechanism into the 
Kylin 4.0 query engine. When executing the query, the frequently used data is 
cached on the local disk to reduce the delay caused by pulling data from the 
remote object storage and achieve faster query response. In addition, in order 
to avoid wasting disk space when the same data is cached on a large number of 
spark executors at the [...]
+
+## 02 Implementation Principle
+
+#### 1. Local Cache
+
+When Kylin 4.0 executes a query, it mainly goes through the following stages, 
in which the stages where local cache can be used to improve performance are 
marked with dotted lines:
+
+![](/images/blog/local-cache/Local_cache_stage.png)
+
+- File list cache：Cache the file status on the spark driver side. When 
executing the query, the spark driver needs to read the file list and obtain 
some file information for subsequent scheduling execution. Here, the file 
status information will be cached locally to avoid frequent reading of remote 
file directories.
+- Data cache：Cache the data on the spark executor side. You can set the data 
cache to memory or disk. If it is set to cache to memory, you need to 
appropriately increase the executor memory to ensure that the executor has 
enough memory for data cache; If it is cached to disk, you need to set the data 
cache directory, preferably SSD disk directory.
+
+Based on the above design, different types of caches are made on the driver 
side and the executor side of the query engine of kylin 4.0. The basic 
architecture is as follows:
+
+![](/images/blog/local-cache/kylin4_local_cache.png)
+
+#### 2. Soft Affinity Scheduling
+
+When doing data cache on the executor side, if all data is cached on all 
executors, the size of cached data will be very considerable and a great waste 
of disk space, and it is easy to cause frequent evict cache data. In order to 
maximize the cache hit rate of the spark executor, the spark driver needs to 
schedule the tasks of the same file to the same executor as far as possible 
when the resource conditions are me, so as to ensure that the data of the same 
file can be cached on a specif [...]
+To this end, we calculate the target executor list by calculating the hash 
according to the file name and then modulo with the executor num. The number of 
executors to cache is determined by the number of data cache replications 
configured by the user. Generally, the larger the number of cache replications, 
the higher the probability of hitting the cache. When the target executors are 
unreachable or have no resources for scheduling, the scheduler will fall back 
to the random scheduling m [...]
+
+## 03 Related Configuration
+
+According to the above principles, we implemented the basic function of local 
cache + soft affinity scheduling in Kylin 4.0, and tested the query performance 
based on SSB data set and TPCH data set respectively.
+Several important configuration items are listed here for users to understand. 
The actual configuration will be given in the attachment at the end:
+
+- Enable soft affinity 
scheduling：kylin.query.spark-conf.spark.kylin.soft-affinity.enabled
+- Enable local 
cache：kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled
+- The number of data cache replications, that is, how many executors cache the 
same data file：kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num
+- Cache to memory or local directory. Set cache to memory as buff and cache to 
local as local: 
kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type
+- Maximum cache 
capacity：kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size
+
+## 04 Performance Benchmark
+
+We conducted performance tests in three scenarios under AWS EMR environment. 
When scale factor = 10, we conducted single concurrent query test on SSB 
dataset, single concurrent query test and 4 concurrent query test on TPCH 
dataset. S3 was configured as storage in the experimental group and the control 
group. Local cache and soft affinity scheduling were enabled in the 
experimental group, but not in the control group. In addition, we also compare 
the results of the experimental group wit [...]
+
+![](/images/blog/local-cache/local_cache_benchmark_result_ssb.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch1.png)
+
+![](/images/blog/local-cache/local_cache_benchmark_result_tpch4.png)
+
+As can be seen from the above results:
+
+1. In the single concurrency scenario of SSB data set, when S3 is used as 
storage, turning on the local cache and soft affinity scheduling can achieve 
about three times the performance improvement, which can be the same as that of 
HDFS, or even improved.
+2. Under TPCH data set, when S3 is used as storage, whether single concurrent 
query or multiple concurrent query, after local cache and soft affinity 
scheduling are enabled, the performance of all queries can be greatly improved.
+
+However, in the comparison results of Q21 under the 4 concurrent tests of TPCH 
dataset, we observed that the results of enabling local cache and soft affinity 
scheduling are lower than those when using S3 alone as storage. Here, it may be 
that the data is not read through the cache for some reason. The underlying 
reason is not further analyzed in this test, in the subsequent optimization 
process, we will gradually improve. Moreover, because the query of TPCH is 
complex and the SQL types  [...]
+The result of this performance test is a preliminary verification of the 
performance improvement effect of local cache + soft affinity scheduling. On 
the whole, local cache + soft affinity scheduling can achieve significant 
performance improvement for both simple queries and complex queries, but there 
is a certain performance loss in the scenario of high concurrent queries.
+If users use cloud object storage as Kylin 4.0 storage, they can get a good 
performance experience when local cache + soft affinity scheduling is enabled, 
which provides performance guarantee for Kylin 4.0 to use the separation 
architecture of computing and storage in the cloud.
+
+## 05 Code Implementation
+
+Since the current code implementation is still in the basic stage, there are 
still many details to be improved, such as implementing consistent hash, how to 
deal with the existing cache when the number of executors changes, so the 
author has not submitted PR to the community code base. Developers who want to 
preview in advance can view the source code through the following link:
+
+[The code implementation of local cache and soft affinity 
scheduling](https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35)
+
+## 06 Related Link
+
+You can view the performance test result data and specific configuration 
through the link:
+[The benchmark of Kylin4.0 with local cache and soft affinity 
scheduling](https://github.com/Kyligence/kylin-tpch/issues/9)
diff --git a/website/assets/images/Kylin_diagram.pptx 
b/website/assets/images/Kylin_diagram.pptx
new file mode 100644
index 0000000..b9aad8d
Binary files /dev/null and b/website/assets/images/Kylin_diagram.pptx differ
diff --git a/website/assets/images/kylin_diagram.png 
b/website/assets/images/kylin_diagram.png
index f484778..07e4f05 100644
Binary files a/website/assets/images/kylin_diagram.png and 
b/website/assets/images/kylin_diagram.png differ
diff --git a/website/images/blog/local-cache/Local_cache_stage.png 
b/website/images/blog/local-cache/Local_cache_stage.png
new file mode 100644
index 0000000..b895540
Binary files /dev/null and 
b/website/images/blog/local-cache/Local_cache_stage.png differ
diff --git a/website/images/blog/local-cache/kylin4_local_cache.png 
b/website/images/blog/local-cache/kylin4_local_cache.png
new file mode 100644
index 0000000..3dc7fe2
Binary files /dev/null and 
b/website/images/blog/local-cache/kylin4_local_cache.png differ
diff --git 
a/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png 
b/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png
new file mode 100644
index 0000000..4bb861b
Binary files /dev/null and 
b/website/images/blog/local-cache/local_cache_benchmark_result_ssb.png differ
diff --git 
a/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png 
b/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png
new file mode 100644
index 0000000..2c71d5c
Binary files /dev/null and 
b/website/images/blog/local-cache/local_cache_benchmark_result_tpch1.png differ
diff --git 
a/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png 
b/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png
new file mode 100644
index 0000000..715a287
Binary files /dev/null and 
b/website/images/blog/local-cache/local_cache_benchmark_result_tpch4.png differ

[kylin] branch document updated: Update kylin diagram and add new blog

Reply via email to