Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1891980&r1=1891979&r2=1891980&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Tue Aug  3 10:55:02 2021
@@ -19,11 +19,371 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Tue, 03 Aug 2021 00:39:20 -0700</pubDate>
-    <lastBuildDate>Tue, 03 Aug 2021 00:39:20 -0700</lastBuildDate>
+    <pubDate>Tue, 03 Aug 2021 03:26:40 -0700</pubDate>
+    <lastBuildDate>Tue, 03 Aug 2021 03:26:40 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>Kylin 在美团到店餐饮的实践和优化</title>
+        
<description>&lt;p&gt;从2016年开始,美团到店餐饮技术团队就开始使用Apache
 
Kylin作为OLAP引擎,但是随着业务的高速发展,在构建和查询层面都出现了效率问题。于是,技术团队从原理解读开始,然后对过程进行层层拆解,并制定了由点及面的实施路线。本文总结了一些经验和心得,希望能够帮助业界更多的技术团队提高数据的产出效率。&lt;/p&gt;
+
+&lt;h2 id=&quot;section&quot;&gt;背景&lt;/h2&gt;
+
+&lt;p&gt;销售业务的特点是规模大、领域多、需求密。美团到店餐饮擎天销售系统(&lt;strong&gt;以下简称“擎天”&lt;/strong&gt;)作为销售数据支持的主要载体,不ä»
…
涉及的范围较广,而且面临的技术场景也非常复杂(&lt;strong&gt;多组织层级数据展示及鉴权、è¶
…过1/3的指æ 
‡éœ€è¦ç²¾å‡†åŽ»é‡ï¼Œå³°å€¼æŸ¥è¯¢å·²ç»è¾¾åˆ°æ•°ä¸‡çº§åˆ«&lt;/strong&gt;)。在这æ
 
·çš„业务背景下,建设稳定高效的OLAP引擎,协助分析人员快速决策,已经æˆ�
 �为到餐擎天的核心目标。&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin是一个基于Hadoop大数据平台打造
的开源OLAP引擎,它采用了多维立方体预计算技术,利用空间换时间的方法,将查询速度提升至亚秒级别,极大地提高了数据分析的效率,并带来了便捷、灵活的查询功能。基于技术与业务匹é
…åº¦ï¼Œæ“Žå¤©äºŽ2016年采用Kylin作为OLAP引擎,接下来的几
年里,这套系统高效地支撑了我们的数据分析体系。&lt;/p&gt;
+
+&lt;p&gt;2020年,美团到餐业务发展较快,数据指标也迅速增加
。基于Kylin的这套系统,在构建和查询上均出现了严重的效率问题,从而影响到数据的分析决策,并给用户体验优化带来了很大的阻碍。技术团队经过半年左右的时间,对Kylin进行一系列的优化迭代,åŒ
…括维度裁剪、模型设计以及资源适é…
ç­‰ç­‰ç­‰ï¼Œå¸®åŠ©é”€å”®ä¸šç»©æ•°æ®SLA从90%提升至99.99%。基于这次实战,我们沉淀了一套涵盖了“原理è
 
§£è¯»â€ã€â€œè¿‡ç¨‹æ‹†è§£â€ã€â€œå®žæ–½è·¯çº¿â€çš„æŠ€æœ¯æ–¹æ¡ˆã€‚希望这些经验与总结,能够帮助业界更多的技术团队提高数据产出与业务决策的效率。&lt;/p&gt;
+
+&lt;h2 id=&quot;section-1&quot;&gt;问题与目标&lt;/h2&gt;
+
+&lt;p&gt;销售作为衔接平台和商家的桥梁,包
含销售到店和电话拜访两种业务模式,以战区、人力组织架构逐级管理,所有分析均需要按2套组织层级查看。在指æ
 
‡å£å¾„一致、数据产出及时等要求下,我们结合Kylin的预计算思想,进行了数据的架构设计。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-01.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;而Kylin计算维度组合的å…
¬å¼æ˜¯2^N(&lt;strong&gt;N为维度个数&lt;/strong&gt;),官方提供维度剪枝的方式,减少维度组合个数。但由于到餐业务的特殊性,单任务不可裁剪的组合个数仍高达1000+。在需求迭代以及人力、战区组织变动的场景下,需要回溯å
…¨éƒ¨åŽ†å²æ•°æ®ï¼Œä¼šè€—è´¹å¤§é‡çš„èµ„æºä»¥åŠè¶…
高的构建时长。而基于业务划分的架构设计,虽能够极大地保证数据产出的解耦,保证指æ
 ‡å£å¾„的一致性ï�
 �Œä½†æ˜¯å¯¹Kylin构建产生了很大的压力,进而导致资源占
用大、耗时长。基于以上业务现状,我们归纳了Kylin的MOLAP模式下存在的问题,å
…·ä½“如下:&lt;/p&gt;
+
+&lt;ul&gt;
+  
&lt;li&gt;&lt;strong&gt;效率问题命中难(实现原理)&lt;/strong&gt;:构建过程步骤多,各步骤之间强å
…³è”,仅从问题的表象很难发现问题的根本原因,无
法行之有效地解决问题。&lt;/li&gt;
+  
&lt;li&gt;&lt;strong&gt;构建引擎未迭代(构建过程)&lt;/strong&gt;:历史任务仍采用MapReduce作为构建引擎,没有切换到构建效率更高的Spark。&lt;/li&gt;
+  
&lt;li&gt;&lt;strong&gt;资源利用不合理(构建过程)&lt;/strong&gt;:资源浪费、资源等å¾
…,默认平台动态资源适é…
æ–¹å¼ï¼Œå¯¼è‡´å°ä»»åŠ¡ç”³è¯·äº†å¤§é‡èµ„æºï¼Œæ•°æ®åˆ‡åˆ†ä¸åˆç†ï¼Œäº§ç”Ÿäº†å¤§é‡çš„å°æ–‡ä»¶ï¼Œä»Žè€Œé€
 æˆèµ„源浪费、大量任务等待。&lt;/li&gt;
+  &lt;li&gt;&lt;strong&gt;æ 
¸å¿ƒä»»åŠ¡è€—æ—¶é•¿ï¼ˆå®žæ–½è·¯çº¿ï¼‰&lt;/strong&gt;:擎天销售交易业绩数据指æ
 
‡çš„æºè¡¨æ•°æ®é‡å¤§ã€ç»´åº¦ç»„合多、膨胀率高,导致每天构建的时长è¶
…过2个小时。&lt;/li&gt;
+  &lt;li&gt;&lt;strong&gt;SLA质量不达æ 
‡ï¼ˆå®žæ–½è·¯çº¿ï¼‰&lt;/strong&gt;:SLA的整体达成率未能达到预期目æ
 ‡ã€‚&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;在认真分析完问题,并确定提效的大目æ 
‡åŽï¼Œæˆ‘们对Kylin的构建过程进行了分类,拆解出在构建过程中能提升效率的æ
 
¸å¿ƒçŽ¯èŠ‚ï¼Œé€šè¿‡â€œåŽŸç†è§£è¯»â€ã€â€œå±‚å±‚æ‹†è§£â€ã€â€œç”±ç‚¹åŠé¢â€çš„æ‰‹æ®µï¼Œè¾¾æˆåŒå‘é™ä½Žçš„ç›®æ
 ‡ã€‚具体量化目标如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-02.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-2&quot;&gt;优化前提-原理解读&lt;/h2&gt;
+
+&lt;p&gt;为了解决效率提升定位难、归因
难的问题,我们解读了Kylin构建原理,包
含了预计算思想以及By-layer逐层算法。&lt;/p&gt;
+
+&lt;h3 id=&quot;section-3&quot;&gt;预计算&lt;/h3&gt;
+
+&lt;p&gt;æ 
¹æ®ç»´åº¦ç»„合出所有可能的维度,对多维分析可能用到的指æ 
‡è¿›è¡Œé¢„计算,将计算好的结果保存成Cube。假设我们有4个维度,这个Cube中每个节点(&lt;strong&gt;称作Cuboid&lt;/strong&gt;)都是这4个维度的不同组合,每个组合定义了一组分析的维度(&lt;strong&gt;如group
 by&lt;/strong&gt;),指æ 
‡çš„聚合结果就保存在每个Cuboid上。查询时,我们æ 
¹æ®SQL找到对应的Cuboid,读取指æ 
‡çš„值,即可返回。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-03.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;by-layer&quot;&gt;By-layer逐层算法&lt;/h3&gt;
+
+&lt;p&gt;一个N维的Cube,是由1个N维子立方体、N个(N-1)维子立方体、N*(N-1)/2个(N-2)维子立方体、……N个1维子立方体和1个0维子立方体构成,总å
…±æœ‰ 2^N个子立方体。在逐层算法中,按ç…
§ç»´åº¦æ•°é€å±‚减少来计算,每个层级的计算(除了第一层,由原始数据聚合而来),是基于上一层级的计算结果来计算的。&lt;/p&gt;
+
+&lt;p&gt;例如:group by [A,B]的结果,可以基于group by 
[A,B,C]的结果,通过去掉C后聚合得来的,这æ 
·å¯ä»¥å‡å°‘重复计算,当0ç»´Cuboid计算出来的时候,整个Cube的计算也就完成了。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-04.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-4&quot;&gt;过程分析-层层拆解&lt;/h2&gt;
+
+&lt;p&gt;在了解完Kylin的底层原理后,我们将优化的方向锁定在“引擎选择”、“数据读取”、“构建字å
…
¸â€ã€â€œåˆ†å±‚构建”、“文件转换”五个环节,再细化各阶段的问题、思路及目æ
 
‡åŽï¼Œæˆ‘们终于做到了在降低计算资源的同时降低了耗时。详æƒ
…如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-05.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-5&quot;&gt;构建引擎选择&lt;/h3&gt;
+
+&lt;p&gt;目前,我们已经将构建引擎已逐步切换为Spark。擎天早在2016年就使用Kylin作为OLAP引擎,历史任务没有切换,ä»
…仅针对MapReduce做了参数优化。å…
¶å®žåœ¨2017年,Kylin官网已启用Spark作为构建引擎(官网启用Spark构建引擎),构建效率相较MapReduce提升1至3倍,还可通过Cube设计选择切换,如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-06.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-6&quot;&gt;读取源数据&lt;/h3&gt;
+
+&lt;p&gt;Kylin以外部表的方式读取Hive中的源数据,表中的数据文件(&lt;strong&gt;存储在HDFS&lt;/strong&gt;)作为下一个子任务的输å
…
¥ï¼Œæ­¤è¿‡ç¨‹å¯èƒ½å­˜åœ¨å°æ–‡ä»¶é—®é¢˜ã€‚当前,Kylin上游数据宽表文件数分布比较合理,æ—
 éœ€åœ¨ä¸Šæ¸¸è®¾ç½®åˆå¹¶ï¼Œå¦‚果强行合并反而会增加
上游源表数据加工时间。&lt;/p&gt;
+
+&lt;p&gt;对于项目需求,要回刷历史数据或增加
维度组合,需要重新构建å…
¨éƒ¨çš„æ•°æ®ï¼Œé€šå¸¸é‡‡ç”¨æŒ‰æœˆæž„建的方式回刷历史,加
载的分区过多出现小文件问题,导致此过程执行缓æ…
¢ã€‚在Kylin级别重写é…
ç½®æ–‡ä»¶ï¼Œå¯¹å°æ–‡ä»¶è¿›è¡Œåˆå¹¶ï¼Œå‡å°‘Map数量,可有效地提升读取效率。&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;合并源表小文件&lt;/strong&gt;:合并Hive源表中小文件个数,控制每个Job并行的Task个数。调整参数如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-07.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Kylin级别参数重写&lt;/strong&gt;:设置Map读取过程的文件大小。调整参数如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-08.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-7&quot;&gt;构建字典&lt;/h3&gt;
+
+&lt;p&gt;Kylin通过计算Hive表出现的维度值,创建维度字å…
¸ï¼Œå°†ç»´åº¦å€¼æ˜ å°„成编ç 
ï¼Œå¹¶ä¿å­˜ä¿å­˜ç»Ÿè®¡ä¿¡æ¯ï¼ŒèŠ‚çº¦HBase存储资源。每一种维度组合,称为一个Cuboid。理论上来说,一个N维的Cube,便有2^N种维度组合。&lt;/p&gt;
+
+&lt;h4 id=&quot;section-8&quot;&gt;组合数量查看&lt;/h4&gt;
+
+&lt;p&gt;在对维度组合剪枝后,实际
计算维度组合难以计算,可通过执行日志(&lt;strong&gt;截图为提取事实表唯一列的步骤中,最后一个Reduce的日志&lt;/strong&gt;),查看å
…·ä½“的维度组合数量。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-09.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;section-9&quot;&gt;全局字典依赖&lt;/h4&gt;
+
+&lt;p&gt;擎天有很多业务场景需要精确去重,当存在多个å…
¨å±€å­—å…
¸åˆ—时,可设置列依赖,例如:当同时存在“门店数量”、“在线门店数量”数据指æ
 ‡ï¼Œå¯è®¾ç½®åˆ—依赖,减少对超
高基维度的计算。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-10.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;section-10&quot;&gt;计算资源配置&lt;/h4&gt;
+
+&lt;p&gt;当指标中存在多个精准去重指标时,可适当增加
计算资源,提升对高基维度构建的效率。参数设置如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-11.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-11&quot;&gt;分层构建&lt;/h3&gt;
+
+&lt;p&gt;此过程为Kylin构建的æ 
¸å¿ƒï¼Œåˆ‡æ¢Spark引擎后,默认只采用By-layer逐层算法,不再自动选择(By-layer逐层算法、快速算法)。Spark在实现By-layer逐层算法的过程中,从最底层的Cuboid一层一层地向上计算,直到计算出最顶层的Cuboid(相当于执行了一个不带group
 by的查询),将各层的结果数据缓存到内
存中,跳过每次数据的读取过程,直接依赖上层的缓存数据,大大提高了执行效率。Spark执行过程å
…·ä½“内容如
 下。&lt;/p&gt;
+
+&lt;h4 id=&quot;job&quot;&gt;Job阶段&lt;/h4&gt;
+
+&lt;p&gt;Job个数为By-layer算法æ 
‘的层数,Spark将每层结果数据的输出,作为一个Job。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-12.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;stage&quot;&gt;Stage阶段&lt;/h4&gt;
+
+&lt;p&gt;每个Job对应两个Stage阶段,分为读取上层缓存数据和缓存该层计算后的结果数据。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-13.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;task&quot;&gt;Task并行度设置&lt;/h4&gt;
+
+&lt;p&gt;Kylinæ 
¹æ®é¢„估每层构建Cuboid组合数据的大小(&lt;strong&gt;可通过维度剪枝的方式,减少维度组合的数量,降低Cuboid组合数据的大小,提升构建效率,本文暂不详细介绍&lt;/strong&gt;)和分割数据的参数值计算出任务并行度。计算å
…¬å¼å¦‚下:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;
+    &lt;p&gt;&lt;strong&gt;Task个数计算å…
¬å¼&lt;/strong&gt;:Min(MapSize/cut-mb ,MaxPartition) ;Max(MapSize/cut-mb 
,MinPartition)&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;ul&gt;
+      
&lt;li&gt;&lt;strong&gt;MapSize&lt;/strong&gt;:每层构建的Cuboid组合大小,即:Kylin对各层级维度组合大小的预估值。&lt;/li&gt;
+      
&lt;li&gt;&lt;strong&gt;cut-mb&lt;/strong&gt;:分割数据大小,控制Task任务并行个数,可通过kylin.engine.spark.rdd-partition-cut-mb参数设置。&lt;/li&gt;
+      
&lt;li&gt;&lt;strong&gt;MaxPartition&lt;/strong&gt;:最大分区,可通过kylin.engine.spark.max-partition参数设置。&lt;/li&gt;
+      
&lt;li&gt;&lt;strong&gt;MinPartition&lt;/strong&gt;:最小分区,可通过kylin.engine.spark.min-partition参数设置。&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  
&lt;li&gt;&lt;strong&gt;输出文件个数计算&lt;/strong&gt;:每个Task任务将执行完成后的结果数据压缩,写å
…¥HDFS,作为文件转换过程的输å…
¥ã€‚文件个数即为:Task任务输出文件个数的汇总。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h4 id=&quot;section-12&quot;&gt;资源申请计算&lt;/h4&gt;
+
+&lt;p&gt;平台默认采用动态方式申请计算资源,单个Executor的计算能力åŒ
…含:1个逻辑CPU(以下简称CPU)、6GB堆内内存、1GB的堆外内
存。计算公式如下:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt; =  
kylin.engine.spark-conf.spark.executor.cores * 实际
申请的Executors个数。&lt;/li&gt;
+  &lt;li&gt;&lt;strong&gt;内存&lt;/strong&gt; 
=(kylin.engine.spark-conf.spark.executor.memory + 
spark.yarn.executor.memoryOverhead)* 实际
申请的Executors个数。&lt;/li&gt;
+  &lt;li&gt;&lt;strong&gt;单个Executor的执行能力&lt;/strong&gt; = 
kylin.engine.spark-conf.spark.executor.memory / 
kylin.engine.spark-conf.spark.executor.cores,即:1个CPU执行过程中申请的å†
…存大小。&lt;/li&gt;
+  &lt;li&gt;&lt;strong&gt;最大Executors个数&lt;/strong&gt; = 
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors,平台默认动态申请,该参数限制最大申请个数。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;在资源充足的情
况下,若单个Stage阶段申请1000个并行任务,则需要申请资源达到7000GBå†
…存和1000个CPU,即:&lt;code 
class=&quot;highlighter-rouge&quot;&gt;CPU:1*1000=1000;内
存:(6+1)*1000=7000GB&lt;/code&gt;。&lt;/p&gt;
+
+&lt;h4 id=&quot;section-13&quot;&gt;资源合理化适配&lt;/h4&gt;
+
+&lt;p&gt;由于By-layer逐层算法的特性,以及Spark在实际
执行过程中的压缩机制,实际执行的Task任务加
载的分区数据远远小于参数设置值,从而导致任务超
高并行,占
用大量资源,同时产生大量的小文件,影响下游文件转换过程。å›
 æ­¤ï¼Œåˆç†çš„切分数据成为优化的å…
³é”®ç‚¹ã€‚通过Kylin构建日志,可查看各层级的Cuboid组合数据的预估大小,以及切分的分区个数(等于Stage阶段实é™
…生成的Task个数)。如下å�
 �¾æ‰€ç¤ºï¼š&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-14.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;结合Spark UI可查看实执行情况,调整内
存的申请,满足执行所需要的资源即可,减少资源浪费。&lt;/p&gt;
+
+&lt;ol&gt;
+  
&lt;li&gt;整体资源申请最小值大于Stage阶段Top1、Top2层级的缓存数据之和,保证缓存数据å
…¨éƒ¨åœ¨å†…存。如下图所示:&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-15.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;计算å…
¬å¼&lt;/strong&gt;:Stage阶段Top1、Top2层级的缓存数据之和 
&amp;lt; kylin.engine.spark-conf.spark.executor.memory * 
kylin.engine.spark-conf.spark.memory.fraction *  spark.memory.storageFraction 
*最大Executors个数&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;单个Task实际所需要的内
存和CPU(&lt;strong&gt;1个Task执行使用1个CPU&lt;/strong&gt;)小于单个Executor的执行能力。如下图所示:&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-16.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;计算公式&lt;/strong&gt;:单个Task实际
所需要的内存 &amp;lt; kylin.engine.spark-conf.spark.executor.memory * 
kylin.engine.spark-conf.spark.memory.fraction *  spark.memory.st·orageFraction 
/ 
kylin.engine.spark-conf.spark.executor.cores。参数说明如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-17.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-14&quot;&gt;文件转换&lt;/h3&gt;
+
+&lt;p&gt;Kylin将构建之后的Cuboid文件转换成HTableæ 
¼å¼çš„Hfile文件,通过BulkLoad的方式将文件和HTable进行å…
³è”,大大降低了HBase的负载。此过程通过一个MapReduce任务完成,Map个数为分层构建阶段输出文件个数。日志如下:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-18.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;此阶段可根据实际输å…
¥çš„æ•°æ®æ–‡ä»¶å¤§å°ï¼ˆ&lt;strong&gt;可通过MapReduce日志查看&lt;/strong&gt;),合理申请计算资源,避å
…èµ„源浪费。&lt;/p&gt;
+
+&lt;p&gt;计算公式:Map阶段资源申请 = 
kylin.job.mr.config.override.mapreduce.map.memory.mb * 
分层构建阶段输出文件个数。具体参数如下表所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-19.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-15&quot;&gt;实施路线-由点及面&lt;/h2&gt;
+
+&lt;h3 id=&quot;section-16&quot;&gt;交易试点实践&lt;/h3&gt;
+
+&lt;p&gt;我们通过对Kylin原理的解读以及构建过程的层层拆解,选取销售交易æ
 ¸å¿ƒä»»åŠ¡è¿›è¡Œè¯•ç‚¹å®žè·µã€‚å¦‚ä¸‹å›¾æ‰€ç¤ºï¼š&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-20.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-17&quot;&gt;实践结果对比&lt;/h3&gt;
+
+&lt;p&gt;针对销售交易æ 
¸å¿ƒä»»åŠ¡è¿›è¡Œå®žè·µä¼˜åŒ–ï¼Œå¯¹æ¯”è°ƒæ•´å‰åŽèµ„æºå®žé™…ä½¿ç”¨æƒ…
况和执行时长,最终达到双向降低的目æ 
‡ã€‚如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-21.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-18&quot;&gt;成果展示&lt;/h2&gt;
+
+&lt;h3 id=&quot;section-19&quot;&gt;资源整体情况&lt;/h3&gt;
+
+&lt;p&gt;擎天现有20+的Kylin任务,经过半年时间持续优化迭代,对比Kylin资源队列月均CU使用量和Pending任务CU使用量,在同等任务下资源消耗已明显降低。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-23.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;sla&quot;&gt;SLA整体达成率&lt;/h3&gt;
+
+&lt;p&gt;经过了由点及面的整体优化,擎天于2020年6月SLA达成率达到100%。如下图所示:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan_cn/chart-24.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-20&quot;&gt;展望&lt;/h2&gt;
+
+&lt;p&gt;Apache 
Kylin在2015年11月正式成为Apache基金会的顶级项目。从开源到成为Apache顶级项目,只花了13个月的时间,而且它也是第一个由中国团队完整贡献到Apache的顶级项目。&lt;/p&gt;
+
+&lt;p&gt;目前,美团采用比较稳定的V2.0版本,经过近4年的使用与积累,到店餐饮技术团队在优化查询性能以及构建效率层面都积累了大量经验,本文主要阐述了在Spark构建过程的资源适é
…
æ–¹æ³•。值得一提的是,Kylin官方在2020å¹´7月发布了V3.1版本,引å
…¥äº†Flink作为构建引擎,统一使用Flink构建核心过程,包
含数据读取阶段、构建字å…
¸é˜¶æ®µã€åˆ†å±‚构建阶段、文件转换阶段,以上四部分占
整体构建耗时
 的95%以上。此次版本的升级也大幅
度提高了Kylin的构建效率。详情可查看:Flink Cube Build 
Engine。&lt;/p&gt;
+
+&lt;p&gt;回顾Kylin构建引擎的升级过程,从MapReduce到Spark,再到如今的Flink,构建工å
…·çš„迭代始终向更加优秀的主流引擎在靠
拢,而且Kylin社区有很多活跃的优秀代码贡献者
,他们也在帮助扩大Kylin的生态,增加
更多的新功能,非常值得大家学习
。最后,美团到店餐饮技术团队再次表达对Apache 
Kylin项目团队的感谢。&lt;/p&gt;
+</description>
+        <pubDate>Tue, 03 Aug 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/cn_blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/cn_blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</guid>
+        
+        
+        <category>cn_blog</category>
+        
+      </item>
+    
+      <item>
+        <title>How Meituan Dominates Online Shopping with Apache Kylin</title>
+        <description>&lt;p&gt;Let’s face it, online shopping now affects 
nearly every part of our shopping lives. From ordering groceries to &lt;a 
href=&quot;https://www.carvana.com/&quot;&gt;purchasing a car&lt;/a&gt;, 
we’re living in an age of limitless choices when it comes to online commerce. 
Nowhere is this more the case than with the world’s 2nd largest consumer 
market: China.&lt;/p&gt;
+
+&lt;p&gt;Leading the online shopping revolution in China is Meituan, who since 
2016 has grown to support nearly 460 million consumers from over 2,000 
industries, regularly processing hundreds of $billions in transactions. To 
support these staggering operations, Meituan has invested heavily in its data 
analytics system and employs more than 10,000 engineers to ensure a stable and 
reliable experience for their customers.&lt;/p&gt;
+
+&lt;p&gt;But the driving force behind Meituan’s success is not simply a 
robust analytics system. While the organization’s executives might think so, 
its engineers understand that it is the OLAP engine that system is built upon 
that has empowered the company to move quickly and win in the market.&lt;/p&gt;
+
+&lt;h2 
id=&quot;meituans-secret-weapon-apache-kylin&quot;&gt;&lt;strong&gt;Meituan’s 
Secret Weapon: Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Since 2016, Meituan’s technical team has relied on&lt;a 
href=&quot;https://kyligence.io/apache-kylin-overview/&quot;&gt; Apache 
Kylin&lt;/a&gt; to power their&lt;a 
href=&quot;https://kyligence.io/resources/extreme-olap-with-apache-kylin/&quot;&gt;
 OLAP engine&lt;/a&gt;. Apache Kylin, an open source OLAP engine built on the 
Hadoop platform, resolves complex queries at sub-second speeds through 
multidimensional precomputation, allowing for blazing-fast analysis on even the 
largest datasets.&lt;/p&gt;
+
+&lt;p&gt;However, the limitations of this open source solution became apparent 
as the company’s business grew, becoming less and less efficient as cubes and 
queries became larger and more complex. To solve this problem, the engineering 
team leveraged Kylin’s open source foundations to dig into the engine, 
understand its underlying principles, and develop an implementation strategy 
that other organizations using Kylin can adopt to greatly improve their data 
output efficiency.&lt;/p&gt;
+
+&lt;p&gt;Meituan’s technical team has graciously shared their story of this 
process below so that you can apply it toward solving your own big data 
challenges.&lt;/p&gt;
+
+&lt;h2 
id=&quot;a-global-pandemic-and-a-new-normal-for-business&quot;&gt;&lt;strong&gt;A
 Global Pandemic and a New Normal for Business&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;For the last four years, Meituan’s Qingtian sales system has served 
as the company’s data processing workhorse, handling massive amounts of daily 
sales data involving a wide range of highly complex technical scenarios. The 
stability and efficiency of this system is paramount, and it’s why 
Meituan’s engineers have made significant investments in optimizing the OLAP 
engine Qingtian is built upon.&lt;/p&gt;
+
+&lt;p&gt;After a thorough investigation, the team identified Apache Kylin as 
the only OLAP engine that could meet their needs and scale with anticipated 
growth. The engine was rolled out in 2016 and, over the next few years, Kylin 
played an important role in the company’s evolving data analytics 
system.&lt;/p&gt;
+
+&lt;p&gt;Growth expectations, however, turned out to be severely 
underestimated, as a global pandemic quickly drove major changes in how 
consumers shopped and how businesses sold their goods. Such a massive shift in 
online shopping led to even faster growth for Meituan as well as a nearly 
untenable amount of new business data.&lt;/p&gt;
+
+&lt;p&gt;This caused efficiency bottlenecks that even their Kylin-based system 
started to struggle with. Cube building and query performance was unable to 
keep up with these changes in consumer behaviors, slowing down data analysis 
and decision-making and creating a major obstacle towards addressing user 
experiences.&lt;/p&gt;
+
+&lt;p&gt;Meituan’s technical team would spend the next six months carrying 
out optimizations and iterations for Kylin, including dimension pruning, model 
design, resource adaptation, and improving SLA compliance.&lt;/p&gt;
+
+&lt;h2 
id=&quot;responding-to-new-consumer-behaviors-with-apache-kylin&quot;&gt;&lt;strong&gt;Responding
 to New Consumer Behaviors with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;In order to understand the approach taken when optimizing Meituan’s 
data architecture, it’s important to understand how the business is managed. 
The company’s sales force operates with two business models – in-store 
sales and phone sales – and is then further broken down by various 
territories and corporate departments. All analytics data must be communicated 
across both business models.&lt;/p&gt;
+
+&lt;p&gt;With this in mind, Meituan engineers incorporated Kylin into their 
design of the data architecture as follows:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-01.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 3. Apache Kylin’s layer-by-layer building data flow&lt;/p&gt;
+
+&lt;p&gt;While this design addressed many of Meituan’s initial concerns 
around scalability and efficiency, continued shifts in consumer behaviors and 
the organization’s response to dramatic changes in the market put enormous 
pressure on Kylin when it came to building cubes. This lead to an unsustainable 
level of consumption of both resources and time.&lt;/p&gt;
+
+&lt;p&gt;It became clear that Kylin’s MOLAP model was presenting the 
following challenges:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The build process involved many steps that were highly correlated, 
making it difficult to root cause problems.&lt;/li&gt;
+  &lt;li&gt;MapReduce - instead of the more efficient Spark - was still being 
used as the build engine for historical tasks.&lt;/li&gt;
+  &lt;li&gt;The platform’s default dynamic resource adaption method demanded 
considerable resources for small tasks. Data was sharded unnecessarily and a 
large number of small files were generated, resulting in a waste of 
resources.&lt;/li&gt;
+  &lt;li&gt;Data volumes Meituan was now having to work with were well beyond 
the original architectural plan, resulting in two hours of cube building every 
day.&lt;/li&gt;
+  &lt;li&gt;The overall SLA fulfillment rate remained lower than 
expected.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Recognizing these problems, the team set a goal of improving the 
platform’s efficiency (you can see the quantitative targets below). Finding a 
solution would involve classifying Kylin’s build process, digging into how 
Kylin worked under the hood, breaking down that process, and finally 
implementing a solution.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-02.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 4. Implementation path diagram&lt;/p&gt;
+
+&lt;h2 
id=&quot;optimization-understanding-how-apache-kylin-builds-cubes&quot;&gt;&lt;strong&gt;Optimization:
 Understanding How Apache Kylin Builds Cubes&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Understanding the cube building process is critical for pinpointing 
efficiency and performance issues. In the case of Kylin, a solid grasp of its 
precomputation approach and its “by layer” cubing algorithm are necessary 
when formulating a solution.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Precomputation with Apache 
Kylin&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin generates all possible dimensional combinations and 
pre-calculates the metrics that may be used in future multidimensional 
analysis, saving the results as a cube. Metric aggregation results are saved on 
&lt;em&gt;cuboids&lt;/em&gt; (a logical branch of the cube), and during queries 
relevant cuboids are found through SQL statements, and then read and quickly 
returned as metric values.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-03.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 5. Precomputation across four dimensions example&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Apache Kylin’s By-Layer Cubing 
Algorithm&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;An N-dimensional cube is composed of 1 N-dimensional sub-cube, N 
(N-1)-dimensional sub-cubes, N*(N-1)/2 (N-2)-dimensional sub-cubes, …, N 
1-dimensional sub-cubes, and one 0-dimensional sub-cube, consisting of a total 
of 2^N sub-cubes. In Kylin’s by-layer cubing algorithm, the number of 
dimensions decreases with the calculation of each layer, and each layer’s 
calculation is based on the calculation result of its parent layer (except the 
first layer, which bases it on the source data).&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-04.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 6. Cuboid example&lt;/p&gt;
+
+&lt;h2 id=&quot;the-proof-is-in-the-process&quot;&gt;&lt;strong&gt;The Proof 
Is in the Process&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Understanding the principles outlined above, the Meituan team 
identified five key areas to focus on for optimization: engine selection, data 
reading, dictionary building, layer-by-layer build, and file conversion. 
Addressing these areas would lead to the greatest gains in reducing the 
required resources for calculation and shortening processing time.&lt;/p&gt;
+
+&lt;p&gt;The team outlined the challenges, their solutions, and key objectives 
in the following table:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-05.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 7. Breakdown of Apache Kylin’s process&lt;/p&gt;
+
+&lt;h2 
id=&quot;putting-apache-kylin-to-the-test&quot;&gt;&lt;strong&gt;Putting Apache 
Kylin to the Test&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;With their solutions in place, the next step was to test if Kylin’s 
build process had actually improved. To do this, the team selected a set of 
critical sales tasks and ran a pilot (outlined below):&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-06.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 8. Meituan’s pilot program for their Apache Kylin 
optimizations&lt;/p&gt;
+
+&lt;p&gt;The results of the pilot were astonishing. Ultimately, the team was 
able to realize a significant reduction in resource consumption as seen in the 
following chart:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-07.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 9. Resource usage and performance of Apache Kylin before and 
after pilot&lt;/p&gt;
+
+&lt;h2 id=&quot;analytics-optimized&quot;&gt;&lt;strong&gt;Analytics 
Optimize&lt;/strong&gt;d&lt;/h2&gt;
+
+&lt;p&gt;Today, Meituan’s Qingtian system is processing over 20 different 
Kylin tasks, and after six months of constant optimization, the monthly CU 
usage for Kylin’s resource queue and the CU usage for pending tasks have seen 
significant reductions.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-08.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 10. Current performance of Apache Kylin after solution 
implementation&lt;/p&gt;
+
+&lt;p&gt;Resource usage isn’t the only area of impressive improvement. The 
Qingtian system’s SLA compliance also was able to reach 100% as of June 
2020.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-09.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 11. Meituan SLA compliance after Apache Kylin 
optimization&lt;/p&gt;
+
+&lt;h2 
id=&quot;taking-on-the-future-with-apache-kylin&quot;&gt;&lt;strong&gt;Taking 
on the Future with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Over the past four years, Meituan’s technical team has accumulated 
a great deal of experience in optimizing query performance and build efficiency 
with Apache Kylin. But Meituan’s success is also the story of open source’s 
success.&lt;/p&gt;
+
+&lt;p&gt;The&lt;a href=&quot;http://kylin.apache.org/community/&quot;&gt; 
Apache Kylin community&lt;/a&gt; has many active and outstanding code 
contributors (&lt;a 
href=&quot;https://kyligence.io/comparing-kylin-vs-kyligence/&quot;&gt;including
 Kyligence&lt;/a&gt;), who are relentlessly working to expand the Kylin 
ecosystem and add more new features. It’s in sharing success stories like 
this that Apache Kylin is able to remain the leading open source solution for 
analytics on massive datasets.&lt;/p&gt;
+
+&lt;p&gt;Together, with the entire Apache Kylin community, Meituan is making 
sure critical analytics work can remain unburdened by growing datasets, and 
that when the next major shift in business takes place, industry leaders like 
Meituan will be able to analyze what’s happening and quickly take 
action.&lt;/p&gt;
+</description>
+        <pubDate>Tue, 03 Aug 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache kylin4 新架构分享</title>
         <description>&lt;p&gt;这篇文章主要分为以下几
个部分:&lt;br /&gt;
 - Apache Kylin 使用场景&lt;br /&gt;
@@ -217,6 +577,155 @@ For example, a query joins two subquerie
       </item>
     
       <item>
+        <title>Why did Youzan choose Kylin4</title>
+        <description>&lt;p&gt;At the QCon Global Software Developers 
Conference held on May 29, 2021, Zheng Shengjun, head of Youzan’s data 
infrastructure platform, shared Youzan’s internal use experience and 
optimization practice of Kylin 4.0 on the meeting room of open source big data 
frameworks and applications. &lt;br /&gt;
+For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how 
and why to upgrade to Kylin 4.&lt;/p&gt;
+
+&lt;p&gt;This sharing is mainly divided into the following parts:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The reason for choosing Kylin 4&lt;/li&gt;
+  &lt;li&gt;Introduction to Kylin 4&lt;/li&gt;
+  &lt;li&gt;How to optimize performance of Kylin 4&lt;/li&gt;
+  &lt;li&gt;Practice of Kylin 4 in Youzan&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;the-reason-for-choosing-kylin-4&quot;&gt;01 The reason for 
choosing Kylin 4&lt;/h2&gt;
+
+&lt;h3 id=&quot;introduction-to-youzan&quot;&gt;Introduction to 
Youzan&lt;/h3&gt;
+&lt;p&gt;China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly 
engaged in retail technology services.&lt;br /&gt;
+At present, it owns several tools and solutions to provide SaaS software 
products and talent services to help merchants operate mobile social e-commerce 
and new retail channels in an all-round way. &lt;br /&gt;
+Currently Youzan has hundreds of millions of consumers and 6 million existing 
merchants.&lt;/p&gt;
+
+&lt;h3 id=&quot;history-of-kylin-in-youzan&quot;&gt;History of Kylin in 
Youzan&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/1 
history_of_youzan_OLAP.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, I would like to share why Youzan chose to upgrade to 
Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP 
infra.&lt;/p&gt;
+
+&lt;p&gt;In the early days of Youzan, in order to iterate develop process 
quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was 
introduced because of query flexibility and development efficiency, but there 
were problems such as low pre-aggregation, not supporting precisely count 
distinct measure. In this situation, Youzan introduced Apache Kylin and 
ClickHouse. Kylin supports high aggregation, precisely count distinct measure 
and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc 
query).&lt;/p&gt;
+
+&lt;p&gt;From the introduction of Kylin in 2018 to now, Youzan has used Kylin 
for more than three years. With the continuous enrichment of business scenarios 
and the continuous accumulation of data volume, Youzan currently has 6 million 
existing merchants, GMV in 2020 is 107.3 billion, and the daily build data 
volume is 10 billion +. At present, Kylin has basically covered all the 
business scenarios of Youzan.&lt;/p&gt;
+
+&lt;h3 id=&quot;the-challenges-of-kylin-3&quot;&gt;The challenges of Kylin 
3&lt;/h3&gt;
+&lt;p&gt;With Youzan’s rapid development and in-depth use of Kylin, we also 
encountered some challenges:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;First of all, the build performance of Kylin on HBase cannot meet 
the favorable expectations, and the build performance will affect the user’s 
failure recovery time and stability experience;&lt;/li&gt;
+  &lt;li&gt;Secondly, with the access of more large merchants (tens of 
millions of members in a single store, with hundreds of thousands of goods for 
each store), it also brings great challenges to our OLAP system. Kylin on HBase 
is limited by the single-point query of Query Server, and cannot support these 
complex scenarios well;&lt;/li&gt;
+  &lt;li&gt;Finally, because HBase is not a cloud-native system, it is 
difficult to achieve flexible scale up and scale down. With the continuous 
growth of data volume, this system has peaks and valleys for businesses, which 
results in the average resource utilization rate is not high enough.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Faced with these challenges, Youzan chose to move closer and upgrade 
to the more cloud-native Apache Kylin 4.&lt;/p&gt;
+
+&lt;h2 id=&quot;introduction-to-kylin-4&quot;&gt;02 Introduction to Kylin 
4&lt;/h2&gt;
+&lt;p&gt;First of all, let’s introduce the main advantages of Kylin 4. 
Apache Kylin 4 completely depends on Spark for cubing job and query. It can 
make full use of Spark’s parallelization, quantization(向量化), and global 
dynamic code generation technologies to improve the efficiency of large 
queries.&lt;br /&gt;
+Here is a brief introduction to the principle of Kylin 4, that is storage 
engine, build engine and query engine.&lt;/p&gt;
+
+&lt;h3 id=&quot;storage-engine&quot;&gt;Storage engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/2 kylin4_storage.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;First of all, let’s take a look at the new storage engine, 
comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of 
Kylin on HBase is stored in the table of HBase. Single Segment corresponds to 
one HBase table. Aggregation is pushed down to HBase coprocessor.&lt;/p&gt;
+
+&lt;p&gt;But as we know,  HBase is not a real Columnar Storage and its 
throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, 
all the data is stored in files. Each segment will have a corresponding HDFS 
directory. All queries and cubing jobs read and write files without HBase . 
Although there will be a certain loss of performance for simple queries, the 
improvement brought about by complex queries is more considerable and 
worthwhile.&lt;/p&gt;
+
+&lt;h3 id=&quot;build-engine&quot;&gt;Build engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/3 kylin4_build_engine.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The second is the new build engine. Based on our test, the build 
speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. 
There are several reasons:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Kylin 4 removes the encoding of the dimension, eliminating a 
building step of encoding;&lt;/li&gt;
+  &lt;li&gt;Removed the HBase File generation step;&lt;/li&gt;
+  &lt;li&gt;Kylin on Parquet changes the granularity of cubing to cuboid 
level, which is conducive to further improving parallelism of cubing 
job.&lt;/li&gt;
+  &lt;li&gt;Enhanced implementation for global dictionary. In the new 
algorithm, dictionary and source data are hashed into the same buckets, making 
it possible for loading only piece of dictionary bucket to encode source 
data.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;As you can see on the right, after upgradation to Kylin 4, cubing job 
changes from ten steps to two steps, the performance improvement of the 
construction is very obvious.&lt;/p&gt;
+
+&lt;h3 id=&quot;query-engine&quot;&gt;Query engine&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/4 kylin4_query.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Next is the new query engine of Kylin 4. As you can see, the 
calculation of Kylin on HBase is completely dependent on the coprocessor of 
HBase and query server process. When the data is read from HBase into query 
server to do aggregation, sorting, etc, the bottleneck will be restricted by 
the single point of query server. But Kylin 4 is converted to a fully 
distributed query mechanism based on Spark, what’s more, it ‘s able to do 
configuration tuning automatically in spark query step !&lt;/p&gt;
+
+&lt;h2 id=&quot;how-to-optimize-performance-of-kylin-4&quot;&gt;03 How to 
optimize performance of Kylin 4&lt;/h2&gt;
+&lt;p&gt;Next, I’d like to share some performance optimizations made by 
Youzan in Kylin 4.&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-query-engine&quot;&gt;Optimization of query 
engine&lt;/h3&gt;
+&lt;p&gt;#### 1.Cache Calcite physical plan&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/5 cache_calcite_plan.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;In Kylin4, SQL will be analyzed, optimized and do code generation in 
calcite. This step takes up about 150ms for some queries. We have supported 
PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured 
SQL don’t have to do the same step again. With this optimization it saved 
about 150ms of time cost.&lt;/p&gt;
+
+&lt;h4 id=&quot;tunning-spark-configuration&quot;&gt;2.Tunning spark 
configuration&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/6 
tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin4 uses spark as query engine. As spark is a distributed engine 
designed for massive data processing, it’s inevitable to loose some 
performance for small queries. We have tried to do some tuning to catch up with 
the latency in Kylin on HBase for small queries.&lt;/p&gt;
+
+&lt;p&gt;Our first optimization is to make more calculations finish in memory. 
The key is to avoid data spill during aggregation, shuffle and sort. Tuning the 
following configuration is helpful.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;1.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.sql.objectHashAggregate.sortBased.fallbackThreshold&lt;/code&gt;
 to larger value to avoid HashAggregate fall back to Sort Based Aggregate, 
which really kills performance when happens.&lt;/li&gt;
+  &lt;li&gt;2.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.shuffle.spill.initialMemoryThreshold&lt;/code&gt;
 to a large value to avoid to many spills during shuffle.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Secondly, we route small queries to Query Server which run spark in 
local mode. Because the overhead of task schedule, shuffle read and variable 
broadcast is enlarged for small queries on YARN/Standalone mode.&lt;/p&gt;
+
+&lt;p&gt;Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM 
disk as TMPFS and set spark.local.dir to directory using RAM disk.&lt;/p&gt;
+
+&lt;p&gt;Lastly, we disabled spark’s whole stage code generation for small 
queries, for spark’s whole stage code generation will cost about 100ms~200ms, 
whereas it’s not beneficial to small queries which is a simple 
project.&lt;/p&gt;
+
+&lt;h4 id=&quot;parquet-optimization&quot;&gt;3.Parquet optimization&lt;/h4&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/7 
parquet_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Optimizing parquet is also important for queries.&lt;/p&gt;
+
+&lt;p&gt;The first principal is that we’d better always include shard by 
column in our filter condition, for parquet files are shard by shard-by-column, 
filter using shard by column reduces the data files to read.&lt;/p&gt;
+
+&lt;p&gt;Then look into parquet files, data within files are sorted by rowkey 
columns, that is to say, prefix match in query is as important as Kylin on 
HBase. When a query condition satisfies prefix match, it can filter row groups 
with column’s max/min index. Furthermore, we can reduce row group size to 
make finer index granularity, but be aware that the compression rate will be 
lower if we set row group size smaller.&lt;/p&gt;
+
+&lt;h4 
id=&quot;dynamic-elimination-of-partitioning-dimensions&quot;&gt;4.Dynamic 
elimination of partitioning dimensions&lt;/h4&gt;
+&lt;p&gt;Kylin4 have a new ability that the older version is not capable of, 
which is able to reduce dozens of times of data reading and computing for some 
big queries. It’s offen the case that partition column is used to filter data 
but not used as group dimension. For those cases Kylin would always choose 
cuboid with partition column, but now it is able to use different cuboid in 
that query to reduce IO read and computing.&lt;/p&gt;
+
+&lt;p&gt;The key of this optimization is to split a query into two parts, one 
of the part uses all segment’s data so that partition column doesn’t have 
to be included in cuboid, the other part that uses part of segments data will 
choose cuboid with partition dimension to do the data filter.&lt;/p&gt;
+
+&lt;p&gt;We have tested that in some situations the response time reduced from 
20s to 6s, 10s to 3s.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/8 
Dynamic_elimination_of_partitioning_dimensions.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optimization-of-build-engine&quot;&gt;Optimization of build 
engine&lt;/h3&gt;
+&lt;p&gt;#### 1.cache parent dataset&lt;br /&gt;
+&lt;img src=&quot;/images/blog/youzan/9 cache_parent_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Kylin build cube layer by layer. For a parent layer with multi 
cuboids to build, we can choose to cache parent dataset by setting 
kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. 
But notice that if you set this value too small, it will affect the parallelism 
of build job, as the build granularity is at cuboid level.&lt;/p&gt;
+
+&lt;h2 id=&quot;practice-of-kylin-4-in-youzan&quot;&gt;04 Practice of Kylin 4 
in Youzan&lt;/h2&gt;
+&lt;p&gt;After introducing Youzan’s experience of performance optimization, 
let’s share the optimization effect. That is, Kylin 4’s practice in Youzan 
includes the upgrade process and the performance of online system.&lt;/p&gt;
+
+&lt;h3 id=&quot;upgrade-metadata-to-adapt-to-kylin-4&quot;&gt;Upgrade metadata 
to adapt to Kylin 4&lt;/h3&gt;
+&lt;p&gt;First of all, for metadata for Kylin 3 which stored on HBase, we have 
developed a tool for seamless upgrading of metadata. First of all, our metadata 
in Kylin on HBase is stored in HBase. We export the metadata in HBase into 
local files, and then use tools to transform and write back the new metadata 
into MySQL. We also updated the operation documents and general principles in 
the official wiki of Apache Kylin. For more details, you can refer to: &lt;a 
href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;How
 to migrate metadata to Kylin 4&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;Let’s give a general introduction to some compatibility in the 
whole process. The project metadata, tables metadata, permission-related 
metadata, and model metadata do not need be modified. What needs to be modified 
is the cube metadata, including the type of storage and query used by Cube. 
After updating these two fields, you need to recalculate the Cube signature. 
The function of this signature is designed internally by Kylin to avoid some 
problems caused by Cube after Cube is determined.&lt;/p&gt;
+
+&lt;h3 
id=&quot;performance-of-kylin-4-on-youzan-online-system&quot;&gt;Performance of 
Kylin 4 on Youzan online system&lt;/h3&gt;
+&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/10 commodity_insight.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;After the migration of metadata to Kylin4, let’s share the 
qualitative changes and substantial performance improvements brought about by 
some of the promising scenarios. First of all, in a scenario like Commodity 
Insight, there is a large store with several hundred thousand of commodities. 
We have to analyze its transactions and traffic, etc. There are more than a 
dozen precise precisely count distinct measures in single cube. Precisely count 
distinct measure is actually very inefficient if it is not optimized through 
pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely 
count distinct measure. In a scene that requires complex queries to sort 
hundreds of thousands of commodities in various UV(precisely count distinct 
measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced 
from 27 seconds to less than 2 seconds.&lt;/p&gt;
+
+&lt;p&gt;What I find most appealing to me about Kylin 4 is that it’s like a 
manual transmission car, you can control its query concurrency at your will, 
whereas you can’t change query concurrency in Kylin on HBase freely, because 
its concurrency is completely tied to the number of regions.&lt;/p&gt;
+
+&lt;h3 id=&quot;plan-for-kylin-4-in-youzan&quot;&gt;Plan for Kylin 4 in 
Youzan&lt;/h3&gt;
+&lt;p&gt;We have made full test, fixed several bugs and improved apache KYLIN4 
for several months. Now we are migrating cubes from older version to newer 
version. For the cubes already migrated to KYLIN4, its small queries’ 
performance meet our expectations, its complex query and build performance did 
bring us a big surprise. We are planning to migrate all cubes from older 
version to Kylin4.&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>有赞为什么选择 Kylin4</title>
         <description>&lt;p&gt;在 2021å¹´5月29日举办的 QCon å…
¨çƒè½¯ä»¶å¼€å‘者大会上,来自有赞的数据基础平台负责人 
郑生俊 在大数据开源框架与应用专题上分享了有赞内部对 
Kylin 4.0 的使用经历和优化实践,对于众多 Kylin 
老用户来说,这也是升级 Kylin 4 的实用攻略。&lt;/p&gt;
 
@@ -376,155 +885,6 @@ For example, a query joins two subquerie
       </item>
     
       <item>
-        <title>Why did Youzan choose Kylin4</title>
-        <description>&lt;p&gt;At the QCon Global Software Developers 
Conference held on May 29, 2021, Zheng Shengjun, head of Youzan’s data 
infrastructure platform, shared Youzan’s internal use experience and 
optimization practice of Kylin 4.0 on the meeting room of open source big data 
frameworks and applications. &lt;br /&gt;
-For many users of Kylin2/3(Kylin on HBase), this is also a chance to learn how 
and why to upgrade to Kylin 4.&lt;/p&gt;
-
-&lt;p&gt;This sharing is mainly divided into the following parts:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;The reason for choosing Kylin 4&lt;/li&gt;
-  &lt;li&gt;Introduction to Kylin 4&lt;/li&gt;
-  &lt;li&gt;How to optimize performance of Kylin 4&lt;/li&gt;
-  &lt;li&gt;Practice of Kylin 4 in Youzan&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;the-reason-for-choosing-kylin-4&quot;&gt;01 The reason for 
choosing Kylin 4&lt;/h2&gt;
-
-&lt;h3 id=&quot;introduction-to-youzan&quot;&gt;Introduction to 
Youzan&lt;/h3&gt;
-&lt;p&gt;China Youzan Co., Ltd (stock code 08083.HK). is an enterprise mainly 
engaged in retail technology services.&lt;br /&gt;
-At present, it owns several tools and solutions to provide SaaS software 
products and talent services to help merchants operate mobile social e-commerce 
and new retail channels in an all-round way. &lt;br /&gt;
-Currently Youzan has hundreds of millions of consumers and 6 million existing 
merchants.&lt;/p&gt;
-
-&lt;h3 id=&quot;history-of-kylin-in-youzan&quot;&gt;History of Kylin in 
Youzan&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/1 
history_of_youzan_OLAP.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;First of all, I would like to share why Youzan chose to upgrade to 
Kylin 4. Here, let me briefly reviewed the history of Youzan OLAP 
infra.&lt;/p&gt;
-
-&lt;p&gt;In the early days of Youzan, in order to iterate develop process 
quickly, we chose the method of pre-computation + MySQL; in 2018, Druid was 
introduced because of query flexibility and development efficiency, but there 
were problems such as low pre-aggregation, not supporting precisely count 
distinct measure. In this situation, Youzan introduced Apache Kylin and 
ClickHouse. Kylin supports high aggregation, precisely count distinct measure 
and the lowest RT, while ClickHouse is quite flexible in usage(ad hoc 
query).&lt;/p&gt;
-
-&lt;p&gt;From the introduction of Kylin in 2018 to now, Youzan has used Kylin 
for more than three years. With the continuous enrichment of business scenarios 
and the continuous accumulation of data volume, Youzan currently has 6 million 
existing merchants, GMV in 2020 is 107.3 billion, and the daily build data 
volume is 10 billion +. At present, Kylin has basically covered all the 
business scenarios of Youzan.&lt;/p&gt;
-
-&lt;h3 id=&quot;the-challenges-of-kylin-3&quot;&gt;The challenges of Kylin 
3&lt;/h3&gt;
-&lt;p&gt;With Youzan’s rapid development and in-depth use of Kylin, we also 
encountered some challenges:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;First of all, the build performance of Kylin on HBase cannot meet 
the favorable expectations, and the build performance will affect the user’s 
failure recovery time and stability experience;&lt;/li&gt;
-  &lt;li&gt;Secondly, with the access of more large merchants (tens of 
millions of members in a single store, with hundreds of thousands of goods for 
each store), it also brings great challenges to our OLAP system. Kylin on HBase 
is limited by the single-point query of Query Server, and cannot support these 
complex scenarios well;&lt;/li&gt;
-  &lt;li&gt;Finally, because HBase is not a cloud-native system, it is 
difficult to achieve flexible scale up and scale down. With the continuous 
growth of data volume, this system has peaks and valleys for businesses, which 
results in the average resource utilization rate is not high enough.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Faced with these challenges, Youzan chose to move closer and upgrade 
to the more cloud-native Apache Kylin 4.&lt;/p&gt;
-
-&lt;h2 id=&quot;introduction-to-kylin-4&quot;&gt;02 Introduction to Kylin 
4&lt;/h2&gt;
-&lt;p&gt;First of all, let’s introduce the main advantages of Kylin 4. 
Apache Kylin 4 completely depends on Spark for cubing job and query. It can 
make full use of Spark’s parallelization, quantization(向量化), and global 
dynamic code generation technologies to improve the efficiency of large 
queries.&lt;br /&gt;
-Here is a brief introduction to the principle of Kylin 4, that is storage 
engine, build engine and query engine.&lt;/p&gt;
-
-&lt;h3 id=&quot;storage-engine&quot;&gt;Storage engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/2 kylin4_storage.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;First of all, let’s take a look at the new storage engine, 
comparison between Kylin on HBase and Kylin on Parquet. The cuboid data of 
Kylin on HBase is stored in the table of HBase. Single Segment corresponds to 
one HBase table. Aggregation is pushed down to HBase coprocessor.&lt;/p&gt;
-
-&lt;p&gt;But as we know,  HBase is not a real Columnar Storage and its 
throughput is not enough for OLAP System. Kylin 4 replaces HBase with Parquet, 
all the data is stored in files. Each segment will have a corresponding HDFS 
directory. All queries and cubing jobs read and write files without HBase . 
Although there will be a certain loss of performance for simple queries, the 
improvement brought about by complex queries is more considerable and 
worthwhile.&lt;/p&gt;
-
-&lt;h3 id=&quot;build-engine&quot;&gt;Build engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/3 kylin4_build_engine.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;The second is the new build engine. Based on our test, the build 
speed of Kylin on Parquet has been optimized from 82 minutes to 15 minutes. 
There are several reasons:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Kylin 4 removes the encoding of the dimension, eliminating a 
building step of encoding;&lt;/li&gt;
-  &lt;li&gt;Removed the HBase File generation step;&lt;/li&gt;
-  &lt;li&gt;Kylin on Parquet changes the granularity of cubing to cuboid 
level, which is conducive to further improving parallelism of cubing 
job.&lt;/li&gt;
-  &lt;li&gt;Enhanced implementation for global dictionary. In the new 
algorithm, dictionary and source data are hashed into the same buckets, making 
it possible for loading only piece of dictionary bucket to encode source 
data.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;As you can see on the right, after upgradation to Kylin 4, cubing job 
changes from ten steps to two steps, the performance improvement of the 
construction is very obvious.&lt;/p&gt;
-
-&lt;h3 id=&quot;query-engine&quot;&gt;Query engine&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/4 kylin4_query.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Next is the new query engine of Kylin 4. As you can see, the 
calculation of Kylin on HBase is completely dependent on the coprocessor of 
HBase and query server process. When the data is read from HBase into query 
server to do aggregation, sorting, etc, the bottleneck will be restricted by 
the single point of query server. But Kylin 4 is converted to a fully 
distributed query mechanism based on Spark, what’s more, it ‘s able to do 
configuration tuning automatically in spark query step !&lt;/p&gt;
-
-&lt;h2 id=&quot;how-to-optimize-performance-of-kylin-4&quot;&gt;03 How to 
optimize performance of Kylin 4&lt;/h2&gt;
-&lt;p&gt;Next, I’d like to share some performance optimizations made by 
Youzan in Kylin 4.&lt;/p&gt;
-
-&lt;h3 id=&quot;optimization-of-query-engine&quot;&gt;Optimization of query 
engine&lt;/h3&gt;
-&lt;p&gt;#### 1.Cache Calcite physical plan&lt;br /&gt;
-&lt;img src=&quot;/images/blog/youzan/5 cache_calcite_plan.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;In Kylin4, SQL will be analyzed, optimized and do code generation in 
calcite. This step takes up about 150ms for some queries. We have supported 
PreparedStatementCache in Kylin4 to cache calcite plan, so that the structured 
SQL don’t have to do the same step again. With this optimization it saved 
about 150ms of time cost.&lt;/p&gt;
-
-&lt;h4 id=&quot;tunning-spark-configuration&quot;&gt;2.Tunning spark 
configuration&lt;/h4&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/6 
tuning_spark_configuration.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Kylin4 uses spark as query engine. As spark is a distributed engine 
designed for massive data processing, it’s inevitable to loose some 
performance for small queries. We have tried to do some tuning to catch up with 
the latency in Kylin on HBase for small queries.&lt;/p&gt;
-
-&lt;p&gt;Our first optimization is to make more calculations finish in memory. 
The key is to avoid data spill during aggregation, shuffle and sort. Tuning the 
following configuration is helpful.&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;1.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.sql.objectHashAggregate.sortBased.fallbackThreshold&lt;/code&gt;
 to larger value to avoid HashAggregate fall back to Sort Based Aggregate, 
which really kills performance when happens.&lt;/li&gt;
-  &lt;li&gt;2.set &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.shuffle.spill.initialMemoryThreshold&lt;/code&gt;
 to a large value to avoid to many spills during shuffle.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Secondly, we route small queries to Query Server which run spark in 
local mode. Because the overhead of task schedule, shuffle read and variable 
broadcast is enlarged for small queries on YARN/Standalone mode.&lt;/p&gt;
-
-&lt;p&gt;Thirdly, we use RAM disk to enhance shuffle performance. Mount RAM 
disk as TMPFS and set spark.local.dir to directory using RAM disk.&lt;/p&gt;
-
-&lt;p&gt;Lastly, we disabled spark’s whole stage code generation for small 
queries, for spark’s whole stage code generation will cost about 100ms~200ms, 
whereas it’s not beneficial to small queries which is a simple 
project.&lt;/p&gt;
-
-&lt;h4 id=&quot;parquet-optimization&quot;&gt;3.Parquet optimization&lt;/h4&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/7 
parquet_optimization.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Optimizing parquet is also important for queries.&lt;/p&gt;
-
-&lt;p&gt;The first principal is that we’d better always include shard by 
column in our filter condition, for parquet files are shard by shard-by-column, 
filter using shard by column reduces the data files to read.&lt;/p&gt;
-
-&lt;p&gt;Then look into parquet files, data within files are sorted by rowkey 
columns, that is to say, prefix match in query is as important as Kylin on 
HBase. When a query condition satisfies prefix match, it can filter row groups 
with column’s max/min index. Furthermore, we can reduce row group size to 
make finer index granularity, but be aware that the compression rate will be 
lower if we set row group size smaller.&lt;/p&gt;
-
-&lt;h4 
id=&quot;dynamic-elimination-of-partitioning-dimensions&quot;&gt;4.Dynamic 
elimination of partitioning dimensions&lt;/h4&gt;
-&lt;p&gt;Kylin4 have a new ability that the older version is not capable of, 
which is able to reduce dozens of times of data reading and computing for some 
big queries. It’s offen the case that partition column is used to filter data 
but not used as group dimension. For those cases Kylin would always choose 
cuboid with partition column, but now it is able to use different cuboid in 
that query to reduce IO read and computing.&lt;/p&gt;
-
-&lt;p&gt;The key of this optimization is to split a query into two parts, one 
of the part uses all segment’s data so that partition column doesn’t have 
to be included in cuboid, the other part that uses part of segments data will 
choose cuboid with partition dimension to do the data filter.&lt;/p&gt;
-
-&lt;p&gt;We have tested that in some situations the response time reduced from 
20s to 6s, 10s to 3s.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/8 
Dynamic_elimination_of_partitioning_dimensions.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
-
-&lt;h3 id=&quot;optimization-of-build-engine&quot;&gt;Optimization of build 
engine&lt;/h3&gt;
-&lt;p&gt;#### 1.cache parent dataset&lt;br /&gt;
-&lt;img src=&quot;/images/blog/youzan/9 cache_parent_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Kylin build cube layer by layer. For a parent layer with multi 
cuboids to build, we can choose to cache parent dataset by setting 
kylin.engine.spark.parent-dataset.max.persist.count to a number greater than 0. 
But notice that if you set this value too small, it will affect the parallelism 
of build job, as the build granularity is at cuboid level.&lt;/p&gt;
-
-&lt;h2 id=&quot;practice-of-kylin-4-in-youzan&quot;&gt;04 Practice of Kylin 4 
in Youzan&lt;/h2&gt;
-&lt;p&gt;After introducing Youzan’s experience of performance optimization, 
let’s share the optimization effect. That is, Kylin 4’s practice in Youzan 
includes the upgrade process and the performance of online system.&lt;/p&gt;
-
-&lt;h3 id=&quot;upgrade-metadata-to-adapt-to-kylin-4&quot;&gt;Upgrade metadata 
to adapt to Kylin 4&lt;/h3&gt;
-&lt;p&gt;First of all, for metadata for Kylin 3 which stored on HBase, we have 
developed a tool for seamless upgrading of metadata. First of all, our metadata 
in Kylin on HBase is stored in HBase. We export the metadata in HBase into 
local files, and then use tools to transform and write back the new metadata 
into MySQL. We also updated the operation documents and general principles in 
the official wiki of Apache Kylin. For more details, you can refer to: &lt;a 
href=&quot;https://wiki.apache.org/confluence/display/KYLIN/How+to+migrate+metadata+to+Kylin+4&quot;&gt;How
 to migrate metadata to Kylin 4&lt;/a&gt;.&lt;/p&gt;
-
-&lt;p&gt;Let’s give a general introduction to some compatibility in the 
whole process. The project metadata, tables metadata, permission-related 
metadata, and model metadata do not need be modified. What needs to be modified 
is the cube metadata, including the type of storage and query used by Cube. 
After updating these two fields, you need to recalculate the Cube signature. 
The function of this signature is designed internally by Kylin to avoid some 
problems caused by Cube after Cube is determined.&lt;/p&gt;
-
-&lt;h3 
id=&quot;performance-of-kylin-4-on-youzan-online-system&quot;&gt;Performance of 
Kylin 4 on Youzan online system&lt;/h3&gt;
-&lt;p&gt;&lt;img src=&quot;/images/blog/youzan/10 commodity_insight.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;After the migration of metadata to Kylin4, let’s share the 
qualitative changes and substantial performance improvements brought about by 
some of the promising scenarios. First of all, in a scenario like Commodity 
Insight, there is a large store with several hundred thousand of commodities. 
We have to analyze its transactions and traffic, etc. There are more than a 
dozen precise precisely count distinct measures in single cube. Precisely count 
distinct measure is actually very inefficient if it is not optimized through 
pre-calculation and Bitmap. Kylin currently uses Bitmap to support precisely 
count distinct measure. In a scene that requires complex queries to sort 
hundreds of thousands of commodities in various UV(precisely count distinct 
measure), the RT of Kylin 2 is 27 seconds, while the RT of Kylin 4 is reduced 
from 27 seconds to less than 2 seconds.&lt;/p&gt;
-
-&lt;p&gt;What I find most appealing to me about Kylin 4 is that it’s like a 
manual transmission car, you can control its query concurrency at your will, 
whereas you can’t change query concurrency in Kylin on HBase freely, because 
its concurrency is completely tied to the number of regions.&lt;/p&gt;
-
-&lt;h3 id=&quot;plan-for-kylin-4-in-youzan&quot;&gt;Plan for Kylin 4 in 
Youzan&lt;/h3&gt;
-&lt;p&gt;We have made full test, fixed several bugs and improved apache KYLIN4 
for several months. Now we are migrating cubes from older version to newer 
version. For the cubes already migrated to KYLIN4, its small queries’ 
performance meet our expectations, its complex query and build performance did 
bring us a big surprise. We are planning to migrate all cubes from older 
version to Kylin4.&lt;/p&gt;
-</description>
-        <pubDate>Thu, 17 Jun 2021 08:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/06/17/Why-did-Youzan-choose-Kylin4/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
-      <item>
         <title>你离可视化酷炫大屏只差一套 Kylin + Davinci</title>
         <description>&lt;p&gt;Kylin 提供与 BI 工具的整合能力,如 
Tableau,PowerBI/Excel,MSTR,QlikSense,Hue 和 
SuperSet。但就可视化工具而言,Davinci 
良好的交互性和个性化的可视化大屏展现效果,使其与 Kylin 
的结合能让大部分用户有更好的可视化分析体验。&lt;/p&gt;
 
@@ -1394,304 +1754,6 @@ if (assignments.getPartitionsByReplicaSe
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Use Python for Data Science with Apache Kylin</title>
-        <description>&lt;p&gt;Original from &lt;a 
href=&quot;https://kyligence.io/blog/use-python-for-data-science-with-apache-kylin/&quot;&gt;Kyligence
 tech blog&lt;/a&gt;&lt;/p&gt;
-
-&lt;p&gt;In today’s world, Big Data, data science, and machine learning 
analytics and are not only hot topics, they’re also an essential part of our 
society. Data is everywhere, and the amount of digital data that exists is 
growing at a rapid rate. According to &lt;a 
href=&quot;https://www.forbes.com/sites/tomcoughlin/2018/11/27/175-zettabytes-by-2025/#622d803d5459&quot;&gt;Forbes&lt;/a&gt;,
 around 175 Zettabytes of data will be generated annually by 2025.&lt;/p&gt;
-
-&lt;p&gt;The economy, healthcare, agriculture, energy, media, education and 
all other critical human activities rely more and more on the advanced 
processing and analysis of large quantities of collected data. However, these 
massive datasets pose a real challenge to data analytics, data mining, machine 
learning and data science.&lt;/p&gt;
-
-&lt;p&gt;Data Scientists and analysts have often expressed frustration while 
trying to work with Big Data. The good news is that there is a solution: Apache 
Kylin. Kylin solves this Big Data dilemma by integrating with Python to help 
analysts &amp;amp; data scientists finally gain unfettered access to their 
large-scale (terabyte and petabyte) datasets.&lt;/p&gt;
-
-&lt;h2 id=&quot;machine-learning-challenges&quot;&gt;Machine Learning 
Challenges&lt;/h2&gt;
-
-&lt;p&gt;One of the main challenges machine learning (ML) engineers and data 
scientists encounter when running computations with Big Data comes from the 
principle that higher volume or scale equates to greater computational 
complexity.&lt;/p&gt;
-
-&lt;p&gt;Consequently, as datasets scale up, even trivial operations can 
become costly. Moreover, as data volume rises, algorithm performance becomes 
increasingly dependent on the architecture used to store and move data. 
Parallel data structures, data partitioning and placement, and data reuse 
become more important as the amount of data one is working with grows.&lt;/p&gt;
-
-&lt;h2 id=&quot;what-apache-kylin-is-and-how-it-helps&quot;&gt;What Apache 
Kylin Is and How It Helps&lt;/h2&gt;
-
-&lt;p&gt;Apache Kylin is an open source distributed Big Data analytics engine 
designed to provide a SQL interface for multi-dimensional analysis (MOLAP) on 
Hadoop. It allows enterprises to rapidly analyze their massive datasets in a 
fraction of the time it would take using other approaches or Big Data analytics 
tools.&lt;/p&gt;
-
-&lt;p&gt;With Apache Kylin, data teams are able to dramatically cut down on 
analytics processing time and associated IT and ops costs. It’s able to do 
this by pre-computing large datasets into one (or another very small amount) of 
OLAP cubes and storing them in a columnar database. This allows ML Engineers, 
data scientists, and analysts to quickly access the data and perform data 
mining activities to uncover hidden trends easily.&lt;/p&gt;
-
-&lt;p&gt;The Following diagram illustrates how machine learning and data 
science activities on big data become much easier when Apache Kylin is 
introduced.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/python-data-science/diagram1.png&quot; 
alt=&quot;diagram1&quot; /&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;how-to-integrate-python-with-apache-kylin&quot;&gt;How to 
Integrate Python with Apache Kylin&lt;/h2&gt;
-
-&lt;p&gt;Python has quickly risen in prominence to take its spot as one of the 
leading programming languages in the data analytics field (as well as outside 
the field). With its ease of use and extensive collection of libraries, Python 
has become well-positioned to take on Big Data.&lt;/p&gt;
-
-&lt;p&gt;Python also provides plenty of data mining tools to assist in the 
handling of data, offering up a variety of applications already adopted by the 
machine learning and data science communities. Simply put, if you’re working 
with Big Data, there’s probably a way Python can make your job 
easier.&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin can be easily integrated with Python with support from 
&lt;a 
href=&quot;https://github.com/Kyligence/kylinpy&quot;&gt;Kylinpy&lt;/a&gt;. 
Kylinpy is a python library that provides a SQLAlchemy Dialect implementation. 
Thus, any application that uses SQLAlchemy can now query Kylin OLAP cubes. 
Additionally, it also allows users to access data via Pandas data 
frames.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Sample code to access data via 
Pandas:&lt;/strong&gt;&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;$ python
-
- &amp;gt;&amp;gt;&amp;gt; import sqlalchemy as sa
- &amp;gt;&amp;gt;&amp;gt; import pandas as pd
- &amp;gt;&amp;gt;&amp;gt; kylin_engine = 
sa.create_engine(&#39;kylin://&amp;lt;username&amp;gt;:&amp;lt;password&amp;gt;@&amp;lt;IP&amp;gt;:&amp;lt;PORT&amp;gt;/&amp;lt;project_name&amp;gt;&#39;,
-​     connect_args={&#39;is_ssl&#39;: True, &#39;timeout&#39;: 60})
- &amp;gt;&amp;gt;&amp;gt; sql = &#39;select * from kylin_sales limit 10&#39;
- &amp;gt;&amp;gt;&amp;gt; dataframe = pd.read_sql(sql, kylin_engine)
- &amp;gt;&amp;gt;&amp;gt; print(dataframe)
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;p&gt;&lt;strong&gt;Benefits of using Apache Kylin as Data 
Source:&lt;/strong&gt;&lt;/p&gt;
-

[... 249 lines stripped ...]

Reply via email to