Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1899035&r1=1899034&r2=1899035&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Fri Mar 18 14:13:30 2022
@@ -19,11 +19,739 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Thu, 10 Mar 2022 20:07:16 -0800</pubDate>
-    <lastBuildDate>Thu, 10 Mar 2022 20:07:16 -0800</lastBuildDate>
+    <pubDate>Fri, 18 Mar 2022 06:59:44 -0700</pubDate>
+    <lastBuildDate>Fri, 18 Mar 2022 06:59:44 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>安排!Kylin 4 现已支持 AWS Glue Catalog</title>
+        <description>&lt;h2 id=&quot;emr--kylin--glue-&quot;&gt;为什么在 
EMR 部署 Kylin 需要支持 Glue ?&lt;/h2&gt;
+
+&lt;h3 id=&quot;aws-glue&quot;&gt;什么是 AWS Glue?&lt;/h3&gt;
+
+&lt;p&gt;AWS Glue 是一项完全托管的 ETL(提取、转换和加
载)服务,使 AWS 
用户能够轻松而经济高效地对数据进行分类、清理和扩充
,并在各种数据存储之间可靠地移动数据。AWS Glue 
由一个称为 AWS Glue 数据目录的中央å…
ƒæ•°æ®å­˜å‚¨åº“、一个自动生成代码的 ETL 
引擎以及一个处理依赖项解析、作业监控和重试的灵活计划程序组成。AWS
 Glue 是无服务器服务,因此无
需设置或管理基础设施。&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin--aws-glue-catalog&quot;&gt;Kylin 为什么需要支持 
AWS Glue Catalog?&lt;/h3&gt;
+
+&lt;p&gt;目前社区有很多 Kylin 用户在使用 AWS EMR,组件主要包
括 Hadoop、Spark、Hive、Presto 等,如果没有配置使用 AWS Glue data 
Catalog,那么在各个数据仓库组件如 Hive、Spark、Presto 
建的数据表,在其它组件上是找不到的,也就不能使用,å…
¬å¸åº•层的数据仓库是提供给各个业务部门来进行使用,为了解决这个问题,在创建
 AWS EMR 集群时就可以使用 AWS Glue data Catalog 来存储å…
ƒæ•°æ®ï¼Œå¯¹å„个组件共享数据源,对各个业务部门进行共享æ�
 
�°æ®æºï¼Œå°†å„个业务部门的数据构建成一个大的数据立方体,能够快速响应å
…¬å¸é«˜é€Ÿå‘展的业务需求。&lt;br /&gt;
+现代公司的数据都是基于云平台搭建,大数据团队使用的 
AWS EMR 来进行数据加
工、数据分析、以及模型训练,随着数据暴增带来提数æ…
¢ã€ææ•°éš¾ï¼ŒEMR/Spark/Hive 
很难满足数据分析师、运营人员、销售的快速查询数据的需求,于是一些用户选择了
 Apache Kylin 作为开源 OLAP 解决方案。&lt;br /&gt;
+但是最近社区用户联系到我们,告知 Kylin 4 还不支持从 Glue 
读取表å…
ƒæ•°æ®ï¼Œæ‰€ä»¥æˆ‘们和社区用户合作一起检查这里遇到的问题并最终解决了问题,从而使得
 Kylin 4 支持了 AWS Glue Catalog,这样带来的好处在于 
Hive、Presto、Spark、Kylin 中可以å…
±äº«è¡¨å’Œæ•°æ®ï¼Œä½¿å¾—每个主题都串联起来形成一个大的数据分析平台,打ç
 ´å…ƒæ•°æ®éšœç¢ã€‚&lt;/p&gt;
+
+&lt;h3 id=&quot;apache-kylin--aws-glue-&quot;&gt;Apache Kylin 支持 AWS Glue 
吗?&lt;/h3&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt; &lt;/th&gt;
+      &lt;th&gt;支持 Glue 的 Kylin 版本&lt;/th&gt;
+      &lt;th&gt;Issue Link&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Kylin on HBase (Before Kylin 4)&lt;/td&gt;
+      &lt;td&gt;2.6.6 or higher&lt;br /&gt; 3.1.0 or higher&lt;/td&gt;
+      &lt;td&gt;https://issues.apache.org/jira/browse/KYLIN-4206&lt;br 
/&gt;https://zhuanlan.zhihu.com/p/99481373&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Kylin on Parquet&lt;/td&gt;
+      &lt;td&gt;4.0.1 or higher&lt;/td&gt;
+      &lt;td&gt;本文。&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h2 id=&quot;section&quot;&gt;部署前准备&lt;/h2&gt;
+
+&lt;h3 id=&quot;section-1&quot;&gt;软件信息一览&lt;/h3&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;&lt;strong&gt;Software&lt;/strong&gt;&lt;/th&gt;
+      &lt;th&gt;&lt;strong&gt;Version&lt;/strong&gt;&lt;/th&gt;
+      &lt;th&gt;Reference&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Apache Kylin&lt;/td&gt;
+      &lt;td&gt;4.0.1 or higher&lt;/td&gt;
+      &lt;td&gt;必须是 4.0.1 以及上,详情参考 &lt;a 
href=&quot;https://cwiki.apache.org/confluence/display/KYLIN/KIP+10+refactor+hive+and+hadoop+dependency&quot;&gt;KIP
 10 refactor hive and hadoop dependency&lt;/a&gt;.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;AWS EMR&lt;/td&gt;
+      &lt;td&gt;6.5.0 or higher&lt;br /&gt;5.33.1 or higher&lt;/td&gt;
+      &lt;td&gt;覆盖EMR 6 / EMR 5 的较新版本,&lt;a 
href=&quot;https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/emr-650-release.html&quot;&gt;Amazon
 EMR release 6.5.0 - Amazon EMR&lt;/a&gt;.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h3 id=&quot;glue-&quot;&gt;准备 Glue 数据库和表&lt;/h3&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/1_prepare_aws_glue_table_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/2_prepare_aws_glue_table_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;创建 AWS EMR 集群。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;这里启动一个 EMR 的集群,需要注意的是,这里通过é…
ç½® &lt;code 
class=&quot;highlighter-rouge&quot;&gt;hive.metastore.client.factory.class&lt;/code&gt;
 启动了 Glue 外部元数据。以下命令可以作为参考。&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;aws emr create-cluster 
--applications &lt;span class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Hadoop &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Hive &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Spark &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;ZooKeeper &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Tez &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Ganglia &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --ec2-attributes &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --release-label emr-6.5.0 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --log-uri &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --instance-groups &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --configurations &lt;span 
class=&quot;s1&quot;&gt;&#39;[{&quot;Classification&quot;:&quot;hive-site&quot;,&quot;Properties&quot;:{&quot;hive.metastore.client.factory.class&quot;:&quot;com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory&quot;}}]&#39;&lt;/span&gt;
 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --auto-scaling-role EMR_AutoScaling_DefaultRole &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --ebs-root-volume-size 100 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --service-role EMR_DefaultRole &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --enable-debugging &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --name &lt;span 
class=&quot;s1&quot;&gt;&#39;Kylin4_on_EMR65_with_Glue&#39;&lt;/span&gt; 
&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --region cn-northwest-1
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;登录 Master 节点,并且检查 Hadoop 版本 和 Hadoop 
集群是否启动成功。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/3_prepare_hadoop_cluster_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/4_prepare_hadoop_cluster_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optional&quot;&gt;获取环境信息(Optional)&lt;/h3&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;如果你使用 RDS 或者其他元数据存储,请酌情
跳过此步。&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;p&gt;由于 Kylin 4.X 推荐使用 RDBMS 作为å…
ƒæ•°æ®å­˜å‚¨ï¼Œå¤„于测试目的,这里使用 Master 节点自带的 
MariaDB 作为元数据存储;关于 MariaDB 的主机名称、账号、密ç 
ç­‰ä¿¡æ¯ï¼Œå¯ä»¥ä»Ž &lt;code 
class=&quot;highlighter-rouge&quot;&gt;/etc/hive/conf/hive-site.xml&lt;/code&gt;
 获取。&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;kylin.metadata.url&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;kylin4_on_cloud@jdbc,url&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;jdbc:mysql://&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;:3306/hue,username&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;hive,password&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;PASSWORD&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;,maxActive&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;10,maxIdle&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;10,driverClassName&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;org.mariadb.jdbc.Driver  
+kylin.env.zookeeper-connect-string&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;获取这些信息后,并且替换以上 Kylin é…
ç½®é¡¹é‡Œé¢çš„变量,如 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;${PASSWORD}&lt;/code&gt;,保存到本地,供下一步启动
 Kylin 进程使用。&lt;/p&gt;
+
+&lt;h3 id=&quot;spark-sql--aws-glue-&quot;&gt;测试 Spark SQL 和 AWS Glue 
的连通性&lt;/h3&gt;
+
+&lt;p&gt;通过 spark-sql 来测试 AWS 的 Spark SQL 是否能够通过 Glue 
获取数据库和表的å…
ƒæ•°æ®ï¼Œé¦–次会发现启动报错失败。&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/5_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;其通过以下命令替换 Spark 使用的 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;hive-site.xml&lt;/code&gt;。&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; /etc/spark/conf
+sudo mv hive-site.xml hive-site.xml.bak
+sudo cp /etc/hive/conf/hive-site.xml .
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;并且修改 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;/etc/spark/conf/hive-site.xml&lt;/code&gt;
 文件中 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;hive.execution.engine&lt;/code&gt; 
的值为&lt;code 
class=&quot;highlighter-rouge&quot;&gt;mr&lt;/code&gt;,再次尝试启动 
Spark-SQL CLI,验证对 Glue 的表数据执行查询成功。&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/6_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/7_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-spark-enginejaroptional&quot;&gt;准备 
kylin-spark-engine.jar(Optional)&lt;/h3&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;如果 Apache Kylin 4.0.2 
已经发布,那么应该已经修改该问题,可以跳过此步。否则请参考以下步骤,替换
 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin-spark-engine.jar&lt;/code&gt;:&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;p&gt;参考下面的命令,克隆 kylin 仓库,执行 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;mvn clean package 
-DskipTests&lt;/code&gt;,获取 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin-spark-project/kylin-spark-engine/target/kylin-spark-engine-4.0.0-SNAPSHOT.jar&lt;/code&gt;
 。&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;git clone 
https://github.com/hit-lacus/kylin.git
+&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;kylin
+git checkout KYLIN-5160
+mvn clean package -DskipTests
+
+&lt;span class=&quot;c&quot;&gt;# find -name 
kylin-spark-engine-4.0.0-SNAPSHOT.jar 
kylin-spark-project/kylin-spark-engine/target&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Patch link: &lt;a 
href=&quot;https://github.com/apache/kylin/pull/1819&quot;&gt;https://github.com/apache/kylin/pull/1819&lt;/a&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;kylin--glue&quot;&gt;部署 Kylin 并连接 Glue&lt;/h2&gt;
+
+&lt;h3 id=&quot;kylin&quot;&gt;下载 Kylin&lt;/h3&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;下载并解压 Kylin ,请根据 EMR 的版本选择对应的 
Kylin package,具体来说,EMR 5.X 使用 spark2 的 package,EMR 6.X 
使用 spark3 的 package。&lt;br /&gt;
+ &lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ # aws s3 cp s3://${BUCKET}/apache-kylin-4.0.1-bin-spark3.tar.gz .
+ # wget apache-kylin-4.0.1-bin-spark3.tar.gz
+ tar zxvf apache-kylin-4.0.1-bin-spark3.tar.gz .
+ cd apache-kylin-4.0.1-bin-spark3
+ export KYLIN_HOME=/home/hadoop/apache-kylin-4.0.1-bin-spark3
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;获取 RDBMS 的 驱动 jar(Optional)&lt;/p&gt;
+
+    &lt;blockquote&gt;
+      &lt;p&gt;如果你是用别的 RDBMS 作为å…
ƒæ•°æ®å­˜å‚¨ï¼Œè¯·è·³è¿‡æ­¤æ­¥éª¤ã€‚&lt;/p&gt;
+    &lt;/blockquote&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ cd $KYLIN_HOME
+ mkdir ext
+ cp /usr/lib/hive/lib/mariadb-connector-java.jar $KYLIN_HOME/ext
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h3 id=&quot;spark&quot;&gt;准备 Spark&lt;/h3&gt;
+
+&lt;p&gt;由于 AWS Spark 内置对 AWS Glue 的支持,所以 
&lt;strong&gt;加载表元数据和执行构建需要使用 AWS 
Spark&lt;/strong&gt;;但是考虑到 Kylin 4.0.1 是支持 Apache 
Spark,并且 AWS Spark 相对 Apache Spark 有比较大的代ç 
ä¿®æ”¹ï¼Œä¸¤è€…兼容性较差,所以&lt;strong&gt;查询 Cube 需要使用 
Apache Spark&lt;/strong&gt;。综上所述,需要根据 Kylin 
需要执行查询任务还是构建任务,来切换所使用的的 
Spark。&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;准备 AWS Spark&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+mkdir ext
+cp /usr/lib/hive/lib/mariadb-connector-java.jar &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;/ext
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;准备 Apache Spark
+    &lt;ul&gt;
+      &lt;li&gt;请根据 EMR 的版本选择对应的 Spark  版本安装包
,具体来说,EMR 5.X 使用 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Spark 2.4.7&lt;/code&gt; 的 Spark 安装
包,EMR 6.X 使用 &lt;code class=&quot;highlighter-rouge&quot;&gt;Spark 
3.1.2&lt;/code&gt; 的 Spark 安装包。&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+cd $KYLIN_HOME
+aws s3 cp s3://${BUCKET}/spark-2.4.7-bin-hadoop2.7.tgz $KYLIN_HOME # Or 
downloads spark-2.4.7-bin-hadoop2.7.tgz from offical website
+tar zxvf spark-2.4.7-bin-hadoop2.7.tgz
+mv spark-2.4.7-bin-hadoop2.7 spark-apache
+&lt;/code&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;因为要先加载 Glue 
表,所以这里通过软链接将&lt;code 
class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt;指向 AWS 
Spark;请注意无需设置 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME&lt;/code&gt;,因为在 
&lt;code class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt; 
存在并且 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME&lt;/code&gt; 未设置的情
况下,Kylin 会默认使用 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt; 
。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;ln -s spark-aws spark
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;kylin-&quot;&gt;修改 Kylin 启动脚本&lt;/h3&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;启动 Spark SQL CLI,不退出&lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;通过 &lt;code class=&quot;highlighter-rouge&quot;&gt;jps -ml 
${PID}&lt;/code&gt; 获取 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SparkSQLCLIDriver&lt;/code&gt; 的 
PID,然后获取 Driver 的 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.driver.extraClasspath&lt;/code&gt;。或è€
…也可以从 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;/etc/spark/conf/spark-defaults.conf&lt;/code&gt;
 获取。&lt;br /&gt;
+ &lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ jps -ml | grep SparkSubmit
+ jinfo ${PID} | grep &quot;spark.driver.extraClassPath&quot;
+&lt;/code&gt;&lt;br /&gt;
+ &lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/8_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;编辑 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;bin/kylin.sh&lt;/code&gt;,修改 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;KYLIN_TOMCAT_CLASSPATH&lt;/code&gt; 
变量,追加 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt; 
;保存好 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;bin/kylin.sh&lt;/code&gt; 后退出 
Spark SQL CLI&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;修改前的 kylin.sh&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/9_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;针对 EMR 6.5.0,修改后的 kylin.sh:&lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt; 
放到最后。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/10_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;针对 EMR 5.33.1,修改后的 kylin.sh:&lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt; 
放到 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$SPARK_HOME/jars&lt;/code&gt; 
之前。&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/11_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-1&quot;&gt;配置 Kylin&lt;/h3&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+vim conf/kylin.properties 
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h4 id=&quot;minimal-kylin-configuration&quot;&gt;Minimal Kylin 
Configuration&lt;/h4&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Property Key&lt;/th&gt;
+      &lt;th&gt;Property Value(Example)&lt;/th&gt;
+      &lt;th&gt;Notes&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.metadata.url&lt;/td&gt;
+      
&lt;td&gt;kylin4_on_cloud@jdbc,url=jdbc:mysql://${HOSTNAME}:3306/hue,username=hive,password=${PASSWORD},maxActive=10,maxIdle=10,driverClassName=org.mariadb.jdbc.Driver&lt;/td&gt;
+      &lt;td&gt;N/A&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.env.zookeeper-connect-string&lt;/td&gt;
+      &lt;td&gt;${HOSTNAME}&lt;/td&gt;
+      &lt;td&gt;N/A&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.engine.spark-conf.spark.driver.extraClassPath&lt;/td&gt;
+      
&lt;td&gt;/usr/lib/hadoop-lzo/lib/&lt;em&gt;:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/&lt;/em&gt;:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar&lt;/td&gt;
+      &lt;td&gt;Copied from spark.driver.extraClasspath in 
/etc/spark/conf/spark-default.conf&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h3 id=&quot;kylin--1&quot;&gt;启动 Kylin 并验证构建&lt;/h3&gt;
+
+&lt;h4 id=&quot;kylin-2&quot;&gt;启动 Kylin&lt;/h4&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+ln -s spark spark_aws &lt;span class=&quot;c&quot;&gt;# skip this step if soft 
link &#39;spark&#39; exists &lt;/span&gt;
+bin/kylin.sh restart
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/12_start_kylin_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/13_start_kylin_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;kylin-spark-enginejar-optional&quot;&gt;替换 
kylin-spark-engine.jar (Optional)&lt;/h4&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;仅对于 4.0.1 需要操作该步骤。&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;/tomcat/webapps/kylin/WEB-INF/lib/
+mv kylin-spark-engine-4.0.1.jar kylin-spark-engine-4.0.1.jar.bak &lt;span 
class=&quot;c&quot;&gt;# remove old one &lt;/span&gt;
+cp kylin-spark-engine-4.0.0-SNAPSHOT.jar  .
+
+bin/kylin.sh restart &lt;span class=&quot;c&quot;&gt;# restart kylin to make 
new jar be loaded&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h4 id=&quot;glue--1&quot;&gt;加载 Glue 表、构建&lt;/h4&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;加载 Glue 表元数据&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/14_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/15_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;创建 Model 和 Cube,然后触发构建&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/16_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;section-2&quot;&gt;验证查询&lt;/h3&gt;
+
+&lt;p&gt;切换 Kylin 使用的 Spark,重启 Kylin。&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+rm spark &lt;span class=&quot;c&quot;&gt;# &#39;spark&#39; is a soft link, it 
is point to aws spark&lt;/span&gt;
+ln -s spark_apache spark &lt;span class=&quot;c&quot;&gt;# switch from aws 
spark to apache spark&lt;/span&gt;
+bin/kylin.sh restart
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;执行测试查询,查询成功&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/17_verify_query_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;section-3&quot;&gt;讨论和问答&lt;/h2&gt;
+
+&lt;h3 id=&quot;sparkaws-spark--apache-spark&quot;&gt;为什么必
须使用两个 Spark(AWS Spark &amp;amp; Apache Spark)?&lt;/h3&gt;
+
+&lt;p&gt;由于 AWS Spark 内置对 AWS Glue Catalog 的支持,并且加
载表和构建引擎需要获取表,所以&lt;strong&gt;加载表å…
ƒæ•°æ®å’Œæ‰§è¡Œæž„建需要使用 AWS Spark&lt;/strong&gt;;但是考虑到 
Kylin 4.0.1 是支持 Apache Spark,并且 AWS Spark 相对 Apache Spark 
有比较大的代码修改,造成两者å…
¼å®¹æ€§è¾ƒå·®ï¼Œæ‰€ä»¥&lt;strong&gt;查询 Cube 需要使用 Apache 
Spark&lt;/strong&gt;。综上所述,需要根据 Kylin 
需要执行查询任务还是构建任务,来切换所使用的的 
Spark。&lt;br /&gt;
+在实际使用过程中,可以考虑 Job Node(构建任务)使用 AWS 
Spark,Query Node(查询任务)使用 Apache Spark。&lt;/p&gt;
+
+&lt;h3 id=&quot;kylinsh&quot;&gt;为什么需要修改 kylin.sh?&lt;/h3&gt;
+
+&lt;p&gt;Kylin 进程作为 Spark Driver 需要通过&lt;code 
class=&quot;highlighter-rouge&quot;&gt;aws-glue-datacatalog-spark-client.jar&lt;/code&gt;åŠ
 è½½è¡¨å…ƒæ•°æ®ï¼Œæ‰€ä»¥è¿™å—需要修改 kylin.sh,将相关 jar 加载到 
Kylin 进程的 classpath。&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Mar 2022 04:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/cn_blog/2022/03/17/kylin4-now-supporting-aws-glue-catalog/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/cn_blog/2022/03/17/kylin4-now-supporting-aws-glue-catalog/</guid>
+        
+        
+        <category>cn_blog</category>
+        
+      </item>
+    
+      <item>
+        <title>Kylin 4 now is supporting AWS Glue Catalog</title>
+        <description>&lt;h2 
id=&quot;why-does-installing-kylin-on-emr-need-to-support-aws-glue&quot;&gt;Why 
does installing Kylin on EMR need to support AWS Glue?&lt;/h2&gt;
+
+&lt;h3 id=&quot;what-is-aws-glue&quot;&gt;What is AWS Glue?&lt;/h3&gt;
+
+&lt;p&gt;AWS Glue is a fully hosted ETL (Extract, Transform, and Load) service 
that enables AWS users to easily and cost-effectively classify, cleanse, enrich 
data and move data between various data storages. AWS Glue consists of a 
central metastore called AWS Glue Data Catalog, an ETL engine that can 
automatically generate code and a flexible scheduler that can handle dependency 
resolution, monitor jobs and retry. AWS Glue is a serverless service, so there 
is no infrastructure to set up or manage.&lt;/p&gt;
+
+&lt;h3 id=&quot;why-does-kylin-need-aws-glue-catalog&quot;&gt;Why does Kylin 
need AWS Glue Catalog?&lt;/h3&gt;
+
+&lt;p&gt;At present, many users in the Kylin community use AWS EMR for running 
large-scale distributed data processing jobs on Hadoop, Spark, Hive, Presto, 
etc. Without AWS Glue Data Catalog, tables built on these data warehouse 
components (like Hive, Spark and Presto) can not be used by any other 
components. As the data warehouse needs to answer requirements from various 
business departments, they use AWS Glue Data Catalog for metadata storage when 
creating the AWS EMR clusters, to share the data sources among different 
components and business departments. That is, to build one data cube with data 
from each business department, so they can provide quick responses to different 
business requirements.&lt;br /&gt;
+In modern companies, data is saved on cloud object storage and big data teams 
use AWS EMR for data processing, data analysis and model training. But with 
data explosion, it becomes really difficult to extract data and the response 
time is too long. In other words, the solution of EMR + Spark/Hive cannot meet 
the speedy data query requirements from data analysts, O&amp;amp;M personnel 
and sales. So some users turn to Apache Kylin as their open-source OLAP 
solution.&lt;br /&gt;
+Recently, our users approached us with the request that Kylin 4 could directly 
read table metadata from AWS Glue. After some collaboration, now Kylin 4 
supports AWS Glue Catalog, making it possible for tables and data to be shared 
among Hive, Presto, Spark and Kylin. This helps to break down the metadata 
barrier, so different topics can be combined to form a big data analysis 
platform.&lt;/p&gt;
+
+&lt;h3 id=&quot;does-kylin-support-aws-glue&quot;&gt;Does Kylin support AWS 
Glue?&lt;/h3&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt; &lt;/th&gt;
+      &lt;th&gt;Kylin version which supports Glue&lt;/th&gt;
+      &lt;th&gt;Issue Link&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Kylin on HBase (Before Kylin 4)&lt;/td&gt;
+      &lt;td&gt;2.6.6 or higher&lt;br /&gt;3.1.0 or higher&lt;/td&gt;
+      &lt;td&gt;https://issues.apache.org/jira/browse/KYLIN-4206&lt;br 
/&gt;https://zhuanlan.zhihu.com/p/99481373&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Kylin on Parquet&lt;/td&gt;
+      &lt;td&gt;4.0.1 or higher&lt;/td&gt;
+      &lt;td&gt;This article.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h2 id=&quot;prerequisites-for-deployment&quot;&gt;Prerequisites for 
deployment&lt;/h2&gt;
+
+&lt;h3 id=&quot;software-version&quot;&gt;Software Version&lt;/h3&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;&lt;strong&gt;Software&lt;/strong&gt;&lt;/th&gt;
+      &lt;th&gt;&lt;strong&gt;Version&lt;/strong&gt;&lt;/th&gt;
+      &lt;th&gt;Reference&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Apache Kylin&lt;/td&gt;
+      &lt;td&gt;4.0.1 or higher&lt;/td&gt;
+      &lt;td&gt;&lt;a 
href=&quot;https://cwiki.apache.org/confluence/display/KYLIN/KIP+10+refactor+hive+and+hadoop+dependency&quot;&gt;KIP
 10 refactor hive and hadoop dependency&lt;/a&gt;.&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;AWS EMR&lt;/td&gt;
+      &lt;td&gt;6.5.0 or higher&lt;br /&gt;5.33.1 or higher&lt;/td&gt;
+      &lt;td&gt;&lt;a 
href=&quot;https://docs.amazonaws.cn/en_us/emr/latest/ReleaseGuide/emr-650-release.html&quot;&gt;Amazon
 EMR release 6.5.0 - Amazon EMR&lt;/a&gt;.&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h3 id=&quot;prepare-aws-glue-database-and-tables&quot;&gt;Prepare AWS Glue 
database and tables&lt;/h3&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/1_prepare_aws_glue_table_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/2_prepare_aws_glue_table_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Create an EMR cluster.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Note: Parameter hive.metastore.client.factory.class is configured to 
enable AWS Glue. For details, you may refer to the commands below.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;aws emr create-cluster 
--applications &lt;span class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Hadoop &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Hive &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Spark &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;ZooKeeper &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Tez &lt;span 
class=&quot;nv&quot;&gt;Name&lt;/span&gt;&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;Ganglia &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --ec2-attributes &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --release-label emr-6.5.0 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --log-uri &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --instance-groups &lt;span class=&quot;k&quot;&gt;${}&lt;/span&gt; &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --configurations &lt;span 
class=&quot;s1&quot;&gt;&#39;[{&quot;Classification&quot;:&quot;hive-site&quot;,&quot;Properties&quot;:{&quot;hive.metastore.client.factory.class&quot;:&quot;com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory&quot;}}]&#39;&lt;/span&gt;
 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --auto-scaling-role EMR_AutoScaling_DefaultRole &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --ebs-root-volume-size 100 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --service-role EMR_DefaultRole &lt;span 
class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --enable-debugging &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --name &lt;span 
class=&quot;s1&quot;&gt;&#39;Kylin4_on_EMR65_with_Glue&#39;&lt;/span&gt; 
&lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
+  --region cn-northwest-1
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Log in to the Master node. Check the Hadoop version and whether 
the Hadoop cluster is successfully started.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/3_prepare_hadoop_cluster_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/4_prepare_hadoop_cluster_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optionalget-environmental-information&quot;&gt;(Optional)Get 
environmental information&lt;/h3&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;If you are using RDS or other metadata storage, you may skip this 
step.&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;p&gt;RDBMS is recommended for metastore in Kylin 4. So for testing 
purposes, in this article, we use MariaDB which comes with the Master node for 
metastore; for hostname, account and password of MariaDB, see &lt;code 
class=&quot;highlighter-rouge&quot;&gt;/etc/hive/conf/hive-site.xml&lt;/code&gt;.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;kylin.metadata.url&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;kylin4_on_cloud@jdbc,url&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;jdbc:mysql://&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;:3306/hue,username&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;hive,password&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;PASSWORD&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;,maxActive&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;10,maxIdle&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;10,driverClassName&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;org.mariadb.jdbc.Driver  
+kylin.env.zookeeper-connect-string&lt;span 
class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;${&lt;/span&gt;&lt;span 
class=&quot;nv&quot;&gt;HOSTNAME&lt;/span&gt;&lt;span 
class=&quot;k&quot;&gt;}&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Configure the variables as per the actual information, for example, 
replace  ${PASSWORD} with the real password, save it locally and it will be 
used to start Kylin.&lt;/p&gt;
+
+&lt;h3 
id=&quot;test-the-connectivity-between-spark-sql-and-aws-glue&quot;&gt;Test the 
connectivity between Spark SQL and AWS Glue&lt;/h3&gt;
+
+&lt;p&gt;Test whether AWS Spark SQL can access databases and table metadata 
through AWS Glue with Spark-SQL. For the first test, you will find that the 
startup fails with an error.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/5_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Replace &lt;code 
class=&quot;highlighter-rouge&quot;&gt;hive-site.xml&lt;/code&gt; used by Spark 
with the following commands.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; /etc/spark/conf
+sudo mv hive-site.xml hive-site.xml.bak
+sudo cp /etc/hive/conf/hive-site.xml .
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Then change the value of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;hive.execution.engine&lt;/code&gt; in 
file &lt;code 
class=&quot;highlighter-rouge&quot;&gt;/etc/spark/conf/hive-site.xml&lt;/code&gt;
 to &lt;code class=&quot;highlighter-rouge&quot;&gt;mr&lt;/code&gt;, restart 
Spark-SQL CLI and verify whether the query for AWS Glue’s table data is 
successful.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/6_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/7_test_sparksql_glue_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;optional-prepare-kylin-spark-enginejar&quot;&gt;(Optional) 
Prepare kylin-spark-engine.jar&lt;/h3&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;This issue will be fixed in Apache Kylin 4.0.2. So you can skip 
this step after updating to Apache Kylin 4.0.2. For users with Kylin 4.0.1, 
please refer to the following steps to replace kylin-spark-engine.jar:&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;p&gt;Clone Kylin git repository, execute &lt;code 
class=&quot;highlighter-rouge&quot;&gt;mvn clean package 
-DskipTests&lt;/code&gt; to build a new &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin-spark-project/kylin-spark-engine/target/kylin-spark-engine-4.0.0-SNAPSHOT.jar&lt;/code&gt;
 .&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;git clone 
https://github.com/hit-lacus/kylin.git
+&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;kylin
+git checkout KYLIN-5160
+mvn clean package -DskipTests
+
+&lt;span class=&quot;c&quot;&gt;# find -name 
kylin-spark-engine-4.0.0-SNAPSHOT.jar 
kylin-spark-project/kylin-spark-engine/target&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Patch link: &lt;a 
href=&quot;https://github.com/apache/kylin/pull/1819&quot;&gt;https://github.com/apache/kylin/pull/1819&lt;/a&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;deploy-kylin-and-connect-to-aws-glue&quot;&gt;Deploy Kylin and 
connect to AWS Glue&lt;/h2&gt;
+
+&lt;h3 id=&quot;download-kylin&quot;&gt;Download Kylin&lt;/h3&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Download and decompress Kylin. Please download the corresponding 
Kylin package according to your EMR version. That is, with EMR 5.X you can 
download Spark 2 package; with EMR 6.X you can download Spark 3 package.&lt;br 
/&gt;
+ &lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ # aws s3 cp s3://${BUCKET}/apache-kylin-4.0.1-bin-spark3.tar.gz .
+ # wget apache-kylin-4.0.1-bin-spark3.tar.gz
+ tar zxvf apache-kylin-4.0.1-bin-spark3.tar.gz .
+ cd apache-kylin-4.0.1-bin-spark3
+ export KYLIN_HOME=/home/hadoop/apache-kylin-4.0.1-bin-spark3
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;(Optional) Get MariaDB driver jar&lt;br /&gt;
+ &amp;gt; If you are using other databases for metastore, please skip this 
step.&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ cd $KYLIN_HOME
+ mkdir ext
+ cp /usr/lib/hive/lib/mariadb-connector-java.jar $KYLIN_HOME/ext
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h3 id=&quot;prepare-spark&quot;&gt;Prepare Spark&lt;/h3&gt;
+
+&lt;p&gt;AWS Spark has built-in support of AWS Glue, so you will use AWS Spark 
when loading table metadata and building jobs. Kylin 4.0.1 supports Apache 
Spark officially. Because the compatibility between Apache Spark and AWS Spark 
is not very good, we will use Apache Spark for cube queries. To sum up, you 
need to switch between AWS Spark and Apache Spark according to your task (query 
task or build task).&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Prepare AWS Spark&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+mkdir ext
+cp /usr/lib/hive/lib/mariadb-connector-java.jar &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;/ext
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Download Apache Spark
+    &lt;ul&gt;
+      &lt;li&gt;Please download the corresponding Spark installation package 
according to your EMR version. That is, with EMR 5.X you can download Spark 
2.4.7 and with EMR 6.X you can download Spark 3.1.2.&lt;br /&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+cd $KYLIN_HOME
+aws s3 cp s3://${BUCKET}/spark-2.4.7-bin-hadoop2.7.tgz $KYLIN_HOME # Or 
downloads spark-2.4.7-bin-hadoop2.7.tgz from offical website
+tar zxvf spark-2.4.7-bin-hadoop2.7.tgz
+mv spark-2.4.7-bin-hadoop2.7 spark-apache
+&lt;/code&gt;&lt;/li&gt;
+    &lt;/ul&gt;
+  &lt;/li&gt;
+  &lt;li&gt;First, you need to load AWS Glue table, so direct &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt; to AWS 
Spark with soft link. Note: you do not need to set up &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME&lt;/code&gt;, because if 
&lt;code class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt; 
exists and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME&lt;/code&gt; is not set up, 
Kylin will use &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME/spark&lt;/code&gt; as 
&lt;code class=&quot;highlighter-rouge&quot;&gt;SPARK_HOME&lt;/code&gt; by 
default.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;ln -s spark-aws spark
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;modify-kylin-startup-script&quot;&gt;Modify Kylin startup 
script&lt;/h3&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;Start Spark SQL CLI and keep it in running status.&lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Acquire PID of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SparkSQLCLIDriver&lt;/code&gt; with 
&lt;code class=&quot;highlighter-rouge&quot;&gt;jps -ml ${PID}&lt;/code&gt;. 
Then acquire &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark.driver.extraClasspath&lt;/code&gt; 
of &lt;strong&gt;Driver&lt;/strong&gt;. Or, you can acquire these from 
/etc/spark/conf/spark-defaults.conf.&lt;br /&gt;
+ &lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ jps -ml | grep SparkSubmit
+ jinfo ${PID} | grep &quot;spark.driver.extraClassPath&quot;
+&lt;/code&gt;&lt;br /&gt;
+ &lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/8_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;Edit &lt;code 
class=&quot;highlighter-rouge&quot;&gt;bin/kylin.sh&lt;/code&gt;, modify 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;KYLIN_TOMCAT_CLASSPATH&lt;/code&gt;  and 
add &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt;; 
save bin/kylin.sh, then exit Spark SQL CLI.&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;kylin.sh before modifying&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/9_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;For EMR 6.5.0, in the modified &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin.sh&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt; is 
at the end of the code.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/10_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;For EMR 5.33.1, in the modified &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin.sh&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_driver_classpath&lt;/code&gt; is 
placed before &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$SPARK_HOME/jars&lt;/code&gt;.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/11_kylin_start_up_script_en.png&quot;
 alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;configure-kylin&quot;&gt;Configure Kylin&lt;/h3&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+vim conf/kylin.properties 
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h4 id=&quot;minimal-kylin-configuration&quot;&gt;Minimal Kylin 
Configuration&lt;/h4&gt;
+
+&lt;table&gt;
+  &lt;thead&gt;
+    &lt;tr&gt;
+      &lt;th&gt;Property Key&lt;/th&gt;
+      &lt;th&gt;Property Value(Example)&lt;/th&gt;
+      &lt;th&gt;Notes&lt;/th&gt;
+    &lt;/tr&gt;
+  &lt;/thead&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.metadata.url&lt;/td&gt;
+      
&lt;td&gt;kylin4_on_cloud@jdbc,url=jdbc:mysql://${HOSTNAME}:3306/hue,username=hive,password=${PASSWORD},maxActive=10,maxIdle=10,driverClassName=org.mariadb.jdbc.Driver&lt;/td&gt;
+      &lt;td&gt;N/A&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.env.zookeeper-connect-string&lt;/td&gt;
+      &lt;td&gt;${HOSTNAME}&lt;/td&gt;
+      &lt;td&gt;N/A&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;kylin.engine.spark-conf.spark.driver.extraClassPath&lt;/td&gt;
+      
&lt;td&gt;/usr/lib/hadoop-lzo/lib/&lt;em&gt;:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/&lt;/em&gt;:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar&lt;/td&gt;
+      &lt;td&gt;Copied from spark.driver.extraClasspath in 
/etc/spark/conf/spark-default.conf&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h3 id=&quot;start-kylin-and-verify-the-building-job&quot;&gt;Start Kylin 
and verify the building job&lt;/h3&gt;
+
+&lt;h4 id=&quot;start-kylin&quot;&gt;Start Kylin&lt;/h4&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+ln -s spark spark_aws &lt;span class=&quot;c&quot;&gt;# skip this step if soft 
link &#39;spark&#39; exists &lt;/span&gt;
+bin/kylin.sh restart
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/12_start_kylin_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/13_start_kylin_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;optional-replace-kylin-spark-enginejar&quot;&gt;(Optional) 
Replace kylin-spark-engine.jar&lt;/h4&gt;
+
+&lt;blockquote&gt;
+  &lt;p&gt;This step is only required for Kylin 4.0.1 users.&lt;/p&gt;
+&lt;/blockquote&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;/tomcat/webapps/kylin/WEB-INF/lib/
+mv kylin-spark-engine-4.0.1.jar kylin-spark-engine-4.0.1.jar.bak &lt;span 
class=&quot;c&quot;&gt;# remove old one &lt;/span&gt;
+cp kylin-spark-engine-4.0.0-SNAPSHOT.jar  .
+
+bin/kylin.sh restart &lt;span class=&quot;c&quot;&gt;# restart kylin to make 
new jar be loaded&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h4 id=&quot;load-aws-glue-table-and-build&quot;&gt;Load AWS Glue table and 
build&lt;/h4&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Load AWS Glue table metadata&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/14_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/15_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Create Model and Cube, then trigger a building job.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/16_load_glue_meta_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;verify-the-query&quot;&gt;Verify the query&lt;/h3&gt;
+
+&lt;p&gt;Switch the Spark used by Kylin and restart Kylin.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;nb&quot;&gt;cd&lt;/span&gt; &lt;span 
class=&quot;nv&quot;&gt;$KYLIN_HOME&lt;/span&gt;
+rm spark &lt;span class=&quot;c&quot;&gt;# &#39;spark&#39; is a soft link, it 
is point to aws spark&lt;/span&gt;
+ln -s spark_apache spark &lt;span class=&quot;c&quot;&gt;# switch from aws 
spark to apache spark&lt;/span&gt;
+bin/kylin.sh restart
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Perform a test query and this query is successful.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_support_aws_glue/17_verify_query_en.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;discussion-and-qa&quot;&gt;Discussion and 
Q&amp;amp;A&lt;/h2&gt;
+
+&lt;h3 id=&quot;why-we-must-use-both-aws-spark-and-apache-spark&quot;&gt;Why 
we must use both AWS Spark and Apache Spark?&lt;/h3&gt;
+
+&lt;p&gt;AWS Spark has built-in support for AWS Glue so you will use AWS Spark 
when loading table metadata and building jobs;  Kylin 4.0.1 supports Apache 
Spark.  Because the compatibility between Apache Spark and AWS Spark is not 
very good, we will use Apache Spark for cube query. To sum up, you need to 
switch between AWS Spark and Apache Spark according to your task (query task or 
build task).&lt;/p&gt;
+
+&lt;h3 id=&quot;why-do-users-need-to-modify-kylinsh&quot;&gt;Why do users need 
to modify kylin.sh?&lt;/h3&gt;
+
+&lt;p&gt;As Spark Driver, Kylin needs to load table metadata through &lt;code 
class=&quot;highlighter-rouge&quot;&gt;aws-glue-datacatalog-spark-client.jar&lt;/code&gt;,
 so you need to modify kylin.sh and load the relevant jar into classpath of 
Kylin process.&lt;/p&gt;
+
+&lt;h3 id=&quot;if-i-faced-more-questions-where-should-i-asked&quot;&gt;If I 
faced more questions, where should I asked?&lt;/h3&gt;
+
+&lt;p&gt;If you have any questions about using Kylin on AWS, please contact us 
via mailling list(&lt;a 
href=&quot;&amp;#109;&amp;#097;&amp;#105;&amp;#108;&amp;#116;&amp;#111;:&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&quot;&gt;&amp;#117;&amp;#115;&amp;#101;&amp;#114;&amp;#064;&amp;#107;&amp;#121;&amp;#108;&amp;#105;&amp;#110;&amp;#046;&amp;#097;&amp;#112;&amp;#097;&amp;#099;&amp;#104;&amp;#101;&amp;#046;&amp;#111;&amp;#114;&amp;#103;&lt;/a&gt;),
 please check for detail &lt;a 
href=&quot;https://kylin.apache.org/community/&quot;&gt;https://kylin.apache.org/community/&lt;/a&gt;
 .&lt;/p&gt;
+</description>
+        <pubDate>Thu, 17 Mar 2022 04:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2022/03/17/kylin4-now-supporting-aws-glue-catalog/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2022/03/17/kylin4-now-supporting-aws-glue-catalog/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>The future of Apache Kylin:More powerful and easy-to-use 
OLAP</title>
         <description>&lt;h2 id=&quot;apache-kylin-today&quot;&gt;01 Apache 
Kylin Today&lt;/h2&gt;
 
@@ -287,6 +1015,137 @@ If users use cloud object storage as Kyl
       </item>
     
       <item>
+        <title>How Meituan Dominates Online Shopping with Apache Kylin</title>
+        <description>&lt;p&gt;Let’s face it, online shopping now affects 
nearly every part of our shopping lives. From ordering groceries to &lt;a 
href=&quot;https://www.carvana.com/&quot;&gt;purchasing a car&lt;/a&gt;, 
we’re living in an age of limitless choices when it comes to online commerce. 
Nowhere is this more the case than with the world’s 2nd largest consumer 
market: China.&lt;/p&gt;
+
+&lt;p&gt;Leading the online shopping revolution in China is Meituan, who since 
2016 has grown to support nearly 460 million consumers from over 2,000 
industries, regularly processing hundreds of $billions in transactions. To 
support these staggering operations, Meituan has invested heavily in its data 
analytics system and employs more than 10,000 engineers to ensure a stable and 
reliable experience for their customers.&lt;/p&gt;
+
+&lt;p&gt;But the driving force behind Meituan’s success is not simply a 
robust analytics system. While the organization’s executives might think so, 
its engineers understand that it is the OLAP engine that system is built upon 
that has empowered the company to move quickly and win in the market.&lt;/p&gt;
+
+&lt;h2 
id=&quot;meituans-secret-weapon-apache-kylin&quot;&gt;&lt;strong&gt;Meituan’s 
Secret Weapon: Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Since 2016, Meituan’s technical team has relied on&lt;a 
href=&quot;https://kyligence.io/apache-kylin-overview/&quot;&gt; Apache 
Kylin&lt;/a&gt; to power their&lt;a 
href=&quot;https://kyligence.io/resources/extreme-olap-with-apache-kylin/&quot;&gt;
 OLAP engine&lt;/a&gt;. Apache Kylin, an open source OLAP engine built on the 
Hadoop platform, resolves complex queries at sub-second speeds through 
multidimensional precomputation, allowing for blazing-fast analysis on even the 
largest datasets.&lt;/p&gt;
+
+&lt;p&gt;However, the limitations of this open source solution became apparent 
as the company’s business grew, becoming less and less efficient as cubes and 
queries became larger and more complex. To solve this problem, the engineering 
team leveraged Kylin’s open source foundations to dig into the engine, 
understand its underlying principles, and develop an implementation strategy 
that other organizations using Kylin can adopt to greatly improve their data 
output efficiency.&lt;/p&gt;
+
+&lt;p&gt;Meituan’s technical team has graciously shared their story of this 
process below so that you can apply it toward solving your own big data 
challenges.&lt;/p&gt;
+
+&lt;h2 
id=&quot;a-global-pandemic-and-a-new-normal-for-business&quot;&gt;&lt;strong&gt;A
 Global Pandemic and a New Normal for Business&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;For the last four years, Meituan’s Qingtian sales system has served 
as the company’s data processing workhorse, handling massive amounts of daily 
sales data involving a wide range of highly complex technical scenarios. The 
stability and efficiency of this system is paramount, and it’s why 
Meituan’s engineers have made significant investments in optimizing the OLAP 
engine Qingtian is built upon.&lt;/p&gt;
+
+&lt;p&gt;After a thorough investigation, the team identified Apache Kylin as 
the only OLAP engine that could meet their needs and scale with anticipated 
growth. The engine was rolled out in 2016 and, over the next few years, Kylin 
played an important role in the company’s evolving data analytics 
system.&lt;/p&gt;
+
+&lt;p&gt;Growth expectations, however, turned out to be severely 
underestimated, as a global pandemic quickly drove major changes in how 
consumers shopped and how businesses sold their goods. Such a massive shift in 
online shopping led to even faster growth for Meituan as well as a nearly 
untenable amount of new business data.&lt;/p&gt;
+
+&lt;p&gt;This caused efficiency bottlenecks that even their Kylin-based system 
started to struggle with. Cube building and query performance was unable to 
keep up with these changes in consumer behaviors, slowing down data analysis 
and decision-making and creating a major obstacle towards addressing user 
experiences.&lt;/p&gt;
+
+&lt;p&gt;Meituan’s technical team would spend the next six months carrying 
out optimizations and iterations for Kylin, including dimension pruning, model 
design, resource adaptation, and improving SLA compliance.&lt;/p&gt;
+
+&lt;h2 
id=&quot;responding-to-new-consumer-behaviors-with-apache-kylin&quot;&gt;&lt;strong&gt;Responding
 to New Consumer Behaviors with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;In order to understand the approach taken when optimizing Meituan’s 
data architecture, it’s important to understand how the business is managed. 
The company’s sales force operates with two business models – in-store 
sales and phone sales – and is then further broken down by various 
territories and corporate departments. All analytics data must be communicated 
across both business models.&lt;/p&gt;
+
+&lt;p&gt;With this in mind, Meituan engineers incorporated Kylin into their 
design of the data architecture as follows:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-01.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 3. Apache Kylin’s layer-by-layer building data flow&lt;/p&gt;
+
+&lt;p&gt;While this design addressed many of Meituan’s initial concerns 
around scalability and efficiency, continued shifts in consumer behaviors and 
the organization’s response to dramatic changes in the market put enormous 
pressure on Kylin when it came to building cubes. This lead to an unsustainable 
level of consumption of both resources and time.&lt;/p&gt;
+
+&lt;p&gt;It became clear that Kylin’s MOLAP model was presenting the 
following challenges:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The build process involved many steps that were highly correlated, 
making it difficult to root cause problems.&lt;/li&gt;
+  &lt;li&gt;MapReduce - instead of the more efficient Spark - was still being 
used as the build engine for historical tasks.&lt;/li&gt;
+  &lt;li&gt;The platform’s default dynamic resource adaption method demanded 
considerable resources for small tasks. Data was sharded unnecessarily and a 
large number of small files were generated, resulting in a waste of 
resources.&lt;/li&gt;
+  &lt;li&gt;Data volumes Meituan was now having to work with were well beyond 
the original architectural plan, resulting in two hours of cube building every 
day.&lt;/li&gt;
+  &lt;li&gt;The overall SLA fulfillment rate remained lower than 
expected.&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;Recognizing these problems, the team set a goal of improving the 
platform’s efficiency (you can see the quantitative targets below). Finding a 
solution would involve classifying Kylin’s build process, digging into how 
Kylin worked under the hood, breaking down that process, and finally 
implementing a solution.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-02.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 4. Implementation path diagram&lt;/p&gt;
+
+&lt;h2 
id=&quot;optimization-understanding-how-apache-kylin-builds-cubes&quot;&gt;&lt;strong&gt;Optimization:
 Understanding How Apache Kylin Builds Cubes&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Understanding the cube building process is critical for pinpointing 
efficiency and performance issues. In the case of Kylin, a solid grasp of its 
precomputation approach and its “by layer” cubing algorithm are necessary 
when formulating a solution.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Precomputation with Apache 
Kylin&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;Apache Kylin generates all possible dimensional combinations and 
pre-calculates the metrics that may be used in future multidimensional 
analysis, saving the results as a cube. Metric aggregation results are saved on 
&lt;em&gt;cuboids&lt;/em&gt; (a logical branch of the cube), and during queries 
relevant cuboids are found through SQL statements, and then read and quickly 
returned as metric values.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-03.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 5. Precomputation across four dimensions example&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Apache Kylin’s By-Layer Cubing 
Algorithm&lt;/strong&gt;&lt;/p&gt;
+
+&lt;p&gt;An N-dimensional cube is composed of 1 N-dimensional sub-cube, N 
(N-1)-dimensional sub-cubes, N*(N-1)/2 (N-2)-dimensional sub-cubes, …, N 
1-dimensional sub-cubes, and one 0-dimensional sub-cube, consisting of a total 
of 2^N sub-cubes. In Kylin’s by-layer cubing algorithm, the number of 
dimensions decreases with the calculation of each layer, and each layer’s 
calculation is based on the calculation result of its parent layer (except the 
first layer, which bases it on the source data).&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-04.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 6. Cuboid example&lt;/p&gt;
+
+&lt;h2 id=&quot;the-proof-is-in-the-process&quot;&gt;&lt;strong&gt;The Proof 
Is in the Process&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Understanding the principles outlined above, the Meituan team 
identified five key areas to focus on for optimization: engine selection, data 
reading, dictionary building, layer-by-layer build, and file conversion. 
Addressing these areas would lead to the greatest gains in reducing the 
required resources for calculation and shortening processing time.&lt;/p&gt;
+
+&lt;p&gt;The team outlined the challenges, their solutions, and key objectives 
in the following table:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-05.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 7. Breakdown of Apache Kylin’s process&lt;/p&gt;
+
+&lt;h2 
id=&quot;putting-apache-kylin-to-the-test&quot;&gt;&lt;strong&gt;Putting Apache 
Kylin to the Test&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;With their solutions in place, the next step was to test if Kylin’s 
build process had actually improved. To do this, the team selected a set of 
critical sales tasks and ran a pilot (outlined below):&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-06.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 8. Meituan’s pilot program for their Apache Kylin 
optimizations&lt;/p&gt;
+
+&lt;p&gt;The results of the pilot were astonishing. Ultimately, the team was 
able to realize a significant reduction in resource consumption as seen in the 
following chart:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-07.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 9. Resource usage and performance of Apache Kylin before and 
after pilot&lt;/p&gt;
+
+&lt;h2 id=&quot;analytics-optimized&quot;&gt;&lt;strong&gt;Analytics 
Optimize&lt;/strong&gt;d&lt;/h2&gt;
+
+&lt;p&gt;Today, Meituan’s Qingtian system is processing over 20 different 
Kylin tasks, and after six months of constant optimization, the monthly CU 
usage for Kylin’s resource queue and the CU usage for pending tasks have seen 
significant reductions.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-08.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 10. Current performance of Apache Kylin after solution 
implementation&lt;/p&gt;
+
+&lt;p&gt;Resource usage isn’t the only area of impressive improvement. The 
Qingtian system’s SLA compliance also was able to reach 100% as of June 
2020.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-09.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Figure 11. Meituan SLA compliance after Apache Kylin 
optimization&lt;/p&gt;
+
+&lt;h2 
id=&quot;taking-on-the-future-with-apache-kylin&quot;&gt;&lt;strong&gt;Taking 
on the Future with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
+
+&lt;p&gt;Over the past four years, Meituan’s technical team has accumulated 
a great deal of experience in optimizing query performance and build efficiency 
with Apache Kylin. But Meituan’s success is also the story of open source’s 
success.&lt;/p&gt;
+
+&lt;p&gt;The&lt;a href=&quot;http://kylin.apache.org/community/&quot;&gt; 
Apache Kylin community&lt;/a&gt; has many active and outstanding code 
contributors (&lt;a 
href=&quot;https://kyligence.io/comparing-kylin-vs-kyligence/&quot;&gt;including
 Kyligence&lt;/a&gt;), who are relentlessly working to expand the Kylin 
ecosystem and add more new features. It’s in sharing success stories like 
this that Apache Kylin is able to remain the leading open source solution for 
analytics on massive datasets.&lt;/p&gt;
+
+&lt;p&gt;Together, with the entire Apache Kylin community, Meituan is making 
sure critical analytics work can remain unburdened by growing datasets, and 
that when the next major shift in business takes place, industry leaders like 
Meituan will be able to analyze what’s happening and quickly take 
action.&lt;/p&gt;
+</description>
+        <pubDate>Tue, 03 Aug 2021 08:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Kylin 在美团到店餐饮的实践和优化</title>
         
<description>&lt;p&gt;从2016年开始,美团到店餐饮技术团队就开始使用Apache
 
Kylin作为OLAP引擎,但是随着业务的高速发展,在构建和查询层面都出现了效率问题。于是,技术团队从原理解读开始,然后对过程进行层层拆解,并制定了由点及面的实施路线。本文总结了一些经验和心得,希望能够帮助业界更多的技术团队提高数据的产出效率。&lt;/p&gt;
 
@@ -516,137 +1375,6 @@ If users use cloud object storage as Kyl
       </item>
     
       <item>
-        <title>How Meituan Dominates Online Shopping with Apache Kylin</title>
-        <description>&lt;p&gt;Let’s face it, online shopping now affects 
nearly every part of our shopping lives. From ordering groceries to &lt;a 
href=&quot;https://www.carvana.com/&quot;&gt;purchasing a car&lt;/a&gt;, 
we’re living in an age of limitless choices when it comes to online commerce. 
Nowhere is this more the case than with the world’s 2nd largest consumer 
market: China.&lt;/p&gt;
-
-&lt;p&gt;Leading the online shopping revolution in China is Meituan, who since 
2016 has grown to support nearly 460 million consumers from over 2,000 
industries, regularly processing hundreds of $billions in transactions. To 
support these staggering operations, Meituan has invested heavily in its data 
analytics system and employs more than 10,000 engineers to ensure a stable and 
reliable experience for their customers.&lt;/p&gt;
-
-&lt;p&gt;But the driving force behind Meituan’s success is not simply a 
robust analytics system. While the organization’s executives might think so, 
its engineers understand that it is the OLAP engine that system is built upon 
that has empowered the company to move quickly and win in the market.&lt;/p&gt;
-
-&lt;h2 
id=&quot;meituans-secret-weapon-apache-kylin&quot;&gt;&lt;strong&gt;Meituan’s 
Secret Weapon: Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;Since 2016, Meituan’s technical team has relied on&lt;a 
href=&quot;https://kyligence.io/apache-kylin-overview/&quot;&gt; Apache 
Kylin&lt;/a&gt; to power their&lt;a 
href=&quot;https://kyligence.io/resources/extreme-olap-with-apache-kylin/&quot;&gt;
 OLAP engine&lt;/a&gt;. Apache Kylin, an open source OLAP engine built on the 
Hadoop platform, resolves complex queries at sub-second speeds through 
multidimensional precomputation, allowing for blazing-fast analysis on even the 
largest datasets.&lt;/p&gt;
-
-&lt;p&gt;However, the limitations of this open source solution became apparent 
as the company’s business grew, becoming less and less efficient as cubes and 
queries became larger and more complex. To solve this problem, the engineering 
team leveraged Kylin’s open source foundations to dig into the engine, 
understand its underlying principles, and develop an implementation strategy 
that other organizations using Kylin can adopt to greatly improve their data 
output efficiency.&lt;/p&gt;
-
-&lt;p&gt;Meituan’s technical team has graciously shared their story of this 
process below so that you can apply it toward solving your own big data 
challenges.&lt;/p&gt;
-
-&lt;h2 
id=&quot;a-global-pandemic-and-a-new-normal-for-business&quot;&gt;&lt;strong&gt;A
 Global Pandemic and a New Normal for Business&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;For the last four years, Meituan’s Qingtian sales system has served 
as the company’s data processing workhorse, handling massive amounts of daily 
sales data involving a wide range of highly complex technical scenarios. The 
stability and efficiency of this system is paramount, and it’s why 
Meituan’s engineers have made significant investments in optimizing the OLAP 
engine Qingtian is built upon.&lt;/p&gt;
-
-&lt;p&gt;After a thorough investigation, the team identified Apache Kylin as 
the only OLAP engine that could meet their needs and scale with anticipated 
growth. The engine was rolled out in 2016 and, over the next few years, Kylin 
played an important role in the company’s evolving data analytics 
system.&lt;/p&gt;
-
-&lt;p&gt;Growth expectations, however, turned out to be severely 
underestimated, as a global pandemic quickly drove major changes in how 
consumers shopped and how businesses sold their goods. Such a massive shift in 
online shopping led to even faster growth for Meituan as well as a nearly 
untenable amount of new business data.&lt;/p&gt;
-
-&lt;p&gt;This caused efficiency bottlenecks that even their Kylin-based system 
started to struggle with. Cube building and query performance was unable to 
keep up with these changes in consumer behaviors, slowing down data analysis 
and decision-making and creating a major obstacle towards addressing user 
experiences.&lt;/p&gt;
-
-&lt;p&gt;Meituan’s technical team would spend the next six months carrying 
out optimizations and iterations for Kylin, including dimension pruning, model 
design, resource adaptation, and improving SLA compliance.&lt;/p&gt;
-
-&lt;h2 
id=&quot;responding-to-new-consumer-behaviors-with-apache-kylin&quot;&gt;&lt;strong&gt;Responding
 to New Consumer Behaviors with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;In order to understand the approach taken when optimizing Meituan’s 
data architecture, it’s important to understand how the business is managed. 
The company’s sales force operates with two business models – in-store 
sales and phone sales – and is then further broken down by various 
territories and corporate departments. All analytics data must be communicated 
across both business models.&lt;/p&gt;
-
-&lt;p&gt;With this in mind, Meituan engineers incorporated Kylin into their 
design of the data architecture as follows:&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-01.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 3. Apache Kylin’s layer-by-layer building data flow&lt;/p&gt;
-
-&lt;p&gt;While this design addressed many of Meituan’s initial concerns 
around scalability and efficiency, continued shifts in consumer behaviors and 
the organization’s response to dramatic changes in the market put enormous 
pressure on Kylin when it came to building cubes. This lead to an unsustainable 
level of consumption of both resources and time.&lt;/p&gt;
-
-&lt;p&gt;It became clear that Kylin’s MOLAP model was presenting the 
following challenges:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;The build process involved many steps that were highly correlated, 
making it difficult to root cause problems.&lt;/li&gt;
-  &lt;li&gt;MapReduce - instead of the more efficient Spark - was still being 
used as the build engine for historical tasks.&lt;/li&gt;
-  &lt;li&gt;The platform’s default dynamic resource adaption method demanded 
considerable resources for small tasks. Data was sharded unnecessarily and a 
large number of small files were generated, resulting in a waste of 
resources.&lt;/li&gt;
-  &lt;li&gt;Data volumes Meituan was now having to work with were well beyond 
the original architectural plan, resulting in two hours of cube building every 
day.&lt;/li&gt;
-  &lt;li&gt;The overall SLA fulfillment rate remained lower than 
expected.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Recognizing these problems, the team set a goal of improving the 
platform’s efficiency (you can see the quantitative targets below). Finding a 
solution would involve classifying Kylin’s build process, digging into how 
Kylin worked under the hood, breaking down that process, and finally 
implementing a solution.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-02.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 4. Implementation path diagram&lt;/p&gt;
-
-&lt;h2 
id=&quot;optimization-understanding-how-apache-kylin-builds-cubes&quot;&gt;&lt;strong&gt;Optimization:
 Understanding How Apache Kylin Builds Cubes&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;Understanding the cube building process is critical for pinpointing 
efficiency and performance issues. In the case of Kylin, a solid grasp of its 
precomputation approach and its “by layer” cubing algorithm are necessary 
when formulating a solution.&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Precomputation with Apache 
Kylin&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;Apache Kylin generates all possible dimensional combinations and 
pre-calculates the metrics that may be used in future multidimensional 
analysis, saving the results as a cube. Metric aggregation results are saved on 
&lt;em&gt;cuboids&lt;/em&gt; (a logical branch of the cube), and during queries 
relevant cuboids are found through SQL statements, and then read and quickly 
returned as metric values.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-03.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 5. Precomputation across four dimensions example&lt;/p&gt;
-
-&lt;p&gt;&lt;strong&gt;Apache Kylin’s By-Layer Cubing 
Algorithm&lt;/strong&gt;&lt;/p&gt;
-
-&lt;p&gt;An N-dimensional cube is composed of 1 N-dimensional sub-cube, N 
(N-1)-dimensional sub-cubes, N*(N-1)/2 (N-2)-dimensional sub-cubes, …, N 
1-dimensional sub-cubes, and one 0-dimensional sub-cube, consisting of a total 
of 2^N sub-cubes. In Kylin’s by-layer cubing algorithm, the number of 
dimensions decreases with the calculation of each layer, and each layer’s 
calculation is based on the calculation result of its parent layer (except the 
first layer, which bases it on the source data).&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-04.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 6. Cuboid example&lt;/p&gt;
-
-&lt;h2 id=&quot;the-proof-is-in-the-process&quot;&gt;&lt;strong&gt;The Proof 
Is in the Process&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;Understanding the principles outlined above, the Meituan team 
identified five key areas to focus on for optimization: engine selection, data 
reading, dictionary building, layer-by-layer build, and file conversion. 
Addressing these areas would lead to the greatest gains in reducing the 
required resources for calculation and shortening processing time.&lt;/p&gt;
-
-&lt;p&gt;The team outlined the challenges, their solutions, and key objectives 
in the following table:&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-05.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 7. Breakdown of Apache Kylin’s process&lt;/p&gt;
-
-&lt;h2 
id=&quot;putting-apache-kylin-to-the-test&quot;&gt;&lt;strong&gt;Putting Apache 
Kylin to the Test&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;With their solutions in place, the next step was to test if Kylin’s 
build process had actually improved. To do this, the team selected a set of 
critical sales tasks and ran a pilot (outlined below):&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-06.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 8. Meituan’s pilot program for their Apache Kylin 
optimizations&lt;/p&gt;
-
-&lt;p&gt;The results of the pilot were astonishing. Ultimately, the team was 
able to realize a significant reduction in resource consumption as seen in the 
following chart:&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-07.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 9. Resource usage and performance of Apache Kylin before and 
after pilot&lt;/p&gt;
-
-&lt;h2 id=&quot;analytics-optimized&quot;&gt;&lt;strong&gt;Analytics 
Optimize&lt;/strong&gt;d&lt;/h2&gt;
-
-&lt;p&gt;Today, Meituan’s Qingtian system is processing over 20 different 
Kylin tasks, and after six months of constant optimization, the monthly CU 
usage for Kylin’s resource queue and the CU usage for pending tasks have seen 
significant reductions.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-08.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 10. Current performance of Apache Kylin after solution 
implementation&lt;/p&gt;
-
-&lt;p&gt;Resource usage isn’t the only area of impressive improvement. The 
Qingtian system’s SLA compliance also was able to reach 100% as of June 
2020.&lt;/p&gt;
-
-&lt;p&gt;&lt;img src=&quot;/images/blog/meituan/chart-09.jpeg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;Figure 11. Meituan SLA compliance after Apache Kylin 
optimization&lt;/p&gt;
-
-&lt;h2 
id=&quot;taking-on-the-future-with-apache-kylin&quot;&gt;&lt;strong&gt;Taking 
on the Future with Apache Kylin&lt;/strong&gt;&lt;/h2&gt;
-
-&lt;p&gt;Over the past four years, Meituan’s technical team has accumulated 
a great deal of experience in optimizing query performance and build efficiency 
with Apache Kylin. But Meituan’s success is also the story of open source’s 
success.&lt;/p&gt;
-
-&lt;p&gt;The&lt;a href=&quot;http://kylin.apache.org/community/&quot;&gt; 
Apache Kylin community&lt;/a&gt; has many active and outstanding code 
contributors (&lt;a 
href=&quot;https://kyligence.io/comparing-kylin-vs-kyligence/&quot;&gt;including
 Kyligence&lt;/a&gt;), who are relentlessly working to expand the Kylin 
ecosystem and add more new features. It’s in sharing success stories like 
this that Apache Kylin is able to remain the leading open source solution for 
analytics on massive datasets.&lt;/p&gt;
-
-&lt;p&gt;Together, with the entire Apache Kylin community, Meituan is making 
sure critical analytics work can remain unburdened by growing datasets, and 
that when the next major shift in business takes place, industry leaders like 
Meituan will be able to analyze what’s happening and quickly take 
action.&lt;/p&gt;
-</description>
-        <pubDate>Tue, 03 Aug 2021 08:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/08/03/How-Meituan-Dominates-Online-Shopping-with-Apache-Kylin/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
-      <item>
         <title>Apache kylin4 新架构分享</title>
         <description>&lt;p&gt;这篇文章主要分为以下几
个部分:&lt;br /&gt;
 - Apache Kylin 使用场景&lt;br /&gt;
@@ -836,314 +1564,6 @@ For example, a query joins two subquerie
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>有赞为什么选择 Kylin4</title>
-        <description>&lt;p&gt;在 2021å¹´5月29日举办的 QCon å…
¨çƒè½¯ä»¶å¼€å‘者大会上,来自有赞的数据基础平台负责人 
郑生俊 在大数据开源框架与应用专题上分享了有赞内部对 
Kylin 4.0 的使用经历和优化实践,对于众多 Kylin 
老用户来说,这也是升级 Kylin 4 的实用攻略。&lt;/p&gt;
-
-&lt;p&gt;本次分享主要分为以下四个部分:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;有赞选用 Kylin 4 的原因&lt;/li&gt;
-  &lt;li&gt;Kylin 4 原理介绍&lt;/li&gt;
-  &lt;li&gt;Kylin 4 性能优化&lt;/li&gt;
-  &lt;li&gt;Kylin 4 在有赞的实践&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;kylin-4-&quot;&gt;01 有赞选用 Kylin 4 的原因&lt;/h2&gt;
-&lt;p&gt;首先分享有赞为什么会选择升级为 Kylin 4,这里å…
ˆç®€å•回顾一下有赞 OLAP 
的发展历程:有赞初期为了快速迭代,选择了预计算 + MySQL 
的方式;2018年,因为查询灵活和开发效率引入了 
Druid,但是存在预聚合度不高、不支持精确去重和明细 OLAP 
等问题;在这样的背景下,有赞引å…
¥äº†æ»¡è¶³èšåˆåº¦é«˜ã€æ”¯æŒç²¾ç¡®åŽ»é‡å’Œ RT 最低的 Apache Kylin 
和查询非常灵活的 ROLAP ClickHouse。&lt;/p&gt;
-
-&lt;p&gt;从2018年引入 Kylin 到现在,有赞已经使用 Kylin 
三年多了。随着业务场景的不断丰富和数据量的不断积累,有赞目前有
 600 万的存量商家,2020年 GMV 是 1073亿,日构建量为 100 
亿+,目前 Kylin 
已经基本覆盖了有赞所有的业务范围。&lt;/p&gt;
-
-&lt;p&gt;随着有赞自身的迅速发展和不断深入地使用 
Kylin,我们也遇到一些挑战:&lt;br /&gt;
-- 首先 Kylin on HBase 的构建性能无
法满足有赞的预期,构建性能会影响到用户的故
障恢复时间和稳定性的体验;&lt;br /&gt;
-- å…
¶æ¬¡ï¼Œéšç€æ›´å¤šå¤§å•†å®¶ï¼ˆå•店千万级别会员、数十万商品)的接å
…¥ï¼Œå¯¹æˆ‘们的查询也带来了很大的挑战。Kylin on HBase 受限于 
QueryServer 单点查询的局限,无
法很好地支持这些复杂的场景;&lt;br /&gt;
-- 最后,因为 HBase 
不是一个云原生系统,很难做到弹性的资源伸缩,随着数据量的不断增长,这个系统对于商家而言,使用时间是存在高峰和低谷的,这就é€
 æˆå¹³å‡çš„资源使用率不够高。&lt;/p&gt;
-
-&lt;p&gt;面对这些挑战,有赞选择去向更云原生的 Apache Kylin 4 
去靠拢和升级。&lt;/p&gt;
-
-&lt;h2 id=&quot;kylin-4--1&quot;&gt;02 Kylin 4 原理介绍&lt;/h2&gt;
-&lt;p&gt;首先介绍一下 Kylin 4 的主要优势。Apache Kylin 4 是完å…
¨åŸºäºŽ Spark 去做构建和查询的,能够充分地利用 
Spark的并行化、向量化和全局动态代ç 
ç”Ÿæˆç­‰æŠ€æœ¯ï¼ŒåŽ»æé«˜å¤§æŸ¥è¯¢çš„æ•ˆçŽ‡ã€‚&lt;br /&gt;

[... 282 lines stripped ...]

Reply via email to