Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1901905&r1=1901904&r2=1901905&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Tue Jun 14 14:08:04 2022
@@ -19,8 +19,8 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Fri, 13 May 2022 06:59:22 -0700</pubDate>
-    <lastBuildDate>Fri, 13 May 2022 06:59:22 -0700</lastBuildDate>
+    <pubDate>Tue, 14 Jun 2022 06:59:09 -0700</pubDate>
+    <lastBuildDate>Tue, 14 Jun 2022 06:59:09 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
@@ -298,6 +298,317 @@ FROM [covid_trip_dataset]
       </item>
     
       <item>
+        <title>Kylin on Cloud — Build A Data Analysis Platform on the Cloud 
in Two Hours Part 2</title>
+        <description>&lt;p&gt;This is the second part of the blog series, for 
part 1, see :&lt;a href=&quot;../kylin4-on-cloud-part1/&quot;&gt;Kylin on 
Cloud — Build A Data Analysis Platform on the Cloud in Two Hours Part 
1&lt;/a&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;video-tutorials&quot;&gt;Video Tutorials&lt;/h3&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://youtu.be/LPHxqZ-au4w&quot;&gt;Kylin on Cloud 
— Build A Data Analysis Platform on the Cloud in Two Hours Part 
2&lt;/a&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-query-cluster&quot;&gt;Kylin query cluster&lt;/h3&gt;
+
+&lt;h4 id=&quot;start-kylin-query-cluster&quot;&gt;Start Kylin query 
cluster&lt;/h4&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Besides the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_configs.yaml&lt;/code&gt; file for 
starting the build cluster, we will also enable MDX with the command 
below:&lt;/p&gt;
+
+    &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;ENABLE_MDX: &amp;amp;ENABLE_MDX 
&#39;true&#39;
+&lt;/code&gt;&lt;/pre&gt;
+    &lt;/div&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Then execute the deploy command to start the cluster:&lt;/p&gt;
+
+    &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;python deploy.py --type deploy 
--mode query
+&lt;/code&gt;&lt;/pre&gt;
+    &lt;/div&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h4 id=&quot;query-with-kylin&quot;&gt;Query with Kylin&lt;/h4&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;After the query cluster is successfully started, first execute 
&lt;code class=&quot;highlighter-rouge&quot;&gt;python deploy.py --type 
list&lt;/code&gt; to get all node information, and then type in your browser 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;http://${kylin_node_public_ip}:7070/kylin&lt;/code&gt;
 to log in to Kylin web UI:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/14_kylin_web_ui.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Execute the same SQL on Insight page as what we have done with 
spark-SQL:&lt;/p&gt;
+
+    &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;select 
TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH, count(*), 
sum(TAXI_TRIP_RECORDS_VIEW.TRIP_TIME_HOUR), 
sum(TAXI_TRIP_RECORDS_VIEW.TOTAL_AMOUNT)
+from TAXI_TRIP_RECORDS_VIEW
+left join NEWYORK_ZONE
+on TAXI_TRIP_RECORDS_VIEW.PULOCATIONID = NEWYORK_ZONE.LOCATIONID
+group by TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH;
+&lt;/code&gt;&lt;/pre&gt;
+    &lt;/div&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/15_query_in_kylin.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;As we can see, when the query hits the cube, that is, the query is 
directly answered by the pre-computed data, the query result is returned in 
about 4s, a great reduction from the over 100s of query latency.&lt;/p&gt;
+
+&lt;h3 id=&quot;pre-computation-reduces-query-cost&quot;&gt;Pre-computation 
reduces query cost&lt;/h3&gt;
+
+&lt;p&gt;In this test, we used the New York taxi order data with fact table 
containing 200+ million entries of data. As we can see from the result, Kylin 
has significantly improved the query efficiency in this big data analysis 
scenario against hundreds of millions of data entries. Moreover, the build data 
could be reused to answer thousands of subsequent queries, thereby reducing 
query cost.&lt;/p&gt;
+
+&lt;h3 id=&quot;configure-semantic-layer&quot;&gt;Configure semantic 
layer&lt;/h3&gt;
+
+&lt;h4 id=&quot;import-dataset-into-mdx-for-kylin&quot;&gt;Import Dataset into 
MDX for Kylin&lt;/h4&gt;
+
+&lt;p&gt;With &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt;, you can create &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Dataset&lt;/code&gt; based on the Kylin 
Cube, define Cube relations, and create business metrics. To make it easy for 
beginners, you can directly download Dataset file from S3 and import it into 
&lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt;:&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Download the dataset to your local machine from S3.&lt;/p&gt;
+
+    &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;wget 
https://s3.cn-north-1.amazonaws.com.cn/public.kyligence.io/kylin/kylin_demo/covid_trip_project_covid_trip_dataset.json
+&lt;/code&gt;&lt;/pre&gt;
+    &lt;/div&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Access &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt; web UI&lt;/p&gt;
+
+    &lt;p&gt;Enter &lt;code 
class=&quot;highlighter-rouge&quot;&gt;http://${kylin_node_public_ip}:7080&lt;/code&gt;
 in your browser to access &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX 
for Kylin&lt;/code&gt; web UI and log in with the default username and password 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;ADMIN/KYLIN&lt;/code&gt;:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/16_mdx_web_ui.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Confirm Kylin connection&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt; is already configured with the information of the Kylin node 
to be connected. You only need to type in the username and password (&lt;code 
class=&quot;highlighter-rouge&quot;&gt;ADMIN/KYLIN&lt;/code&gt;) for the Kylin 
node when logging in for the first time.&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/17_connect_to_kylin.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/18_exit_management.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Import Dataset&lt;/p&gt;
+
+    &lt;p&gt;After Kylin is successfully connected, click the icon in the 
upper right corner to exit the management page:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/19_kylin_running.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;Switch to the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_trip_project&lt;/code&gt; project 
and click &lt;code class=&quot;highlighter-rouge&quot;&gt;Import 
Dataset&lt;/code&gt; on &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Dataset&lt;/code&gt; page:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/20_import_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;Select and import the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_trip_project_covid_trip_dataset.json&lt;/code&gt;
 file we just download from S3.&lt;/p&gt;
+
+    &lt;p&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_trip_dataset&lt;/code&gt; contains 
specific dimensions and measures for each atomic metric, such as YTD, MTD, 
annual growth, monthly growth, time hierarchy, and regional hierarchy; as well 
as various business metrics including COVID-19 death rate, the average speed of 
taxi trips, etc. For more information on how to manually create a dataset, see 
Create dataset in &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt; or &lt;a 
href=&quot;https://kyligence.github.io/mdx-kylin/&quot;&gt;MDX for Kylin User 
Manual&lt;/a&gt;.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h2 id=&quot;data-analysis-with-bi-and-excel&quot;&gt;Data analysis with BI 
and Excel&lt;/h2&gt;
+
+&lt;h3 id=&quot;data-analysis-using-tableau&quot;&gt;Data analysis using 
Tableau&lt;/h3&gt;
+
+&lt;p&gt;Let’s take Tableau installed on a local Windows machine as an 
example to connect to MDX for Kylin for data analysis.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Select Tableau’s built-in &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Microsoft Analysis Service&lt;/code&gt; 
to connect to &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt;. (Note: Please install the &lt;a 
href=&quot;https://www.tableau.com/support/drivers?_ga=2.104833284.564621013.1647953885-1839825424.1608198275&quot;&gt;&lt;code
 class=&quot;highlighter-rouge&quot;&gt;Microsoft Analysis 
Services&lt;/code&gt; driver&lt;/a&gt; in advance, which can be downloaded from 
Tableau).&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/21_tableau_connect.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;In the pop-up settings page, enter the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MDX for Kylin&lt;/code&gt; server 
address, the username and password. The server address is &lt;code 
class=&quot;highlighter-rouge&quot;&gt;http://${kylin_node_public_ip}:7080/mdx/xmla/covid_trip_project&lt;/code&gt;:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/22_tableau_server.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Select covid_trip_dataset as the dataset:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/23_tableau_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Then we can run data analysis with the worksheet. Since we have 
defined the business metrics with &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MDX for Kylin&lt;/code&gt;, when we want 
to generate a business report with Tableau, we can directly drag the 
pre-defined business metrics into the worksheet to create a report.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Firstly, we will analyze the pandemic data and draw the 
national-level pandemic map with the number of confirmed cases and mortality 
rate. We only need to drag and drop &lt;code 
class=&quot;highlighter-rouge&quot;&gt;COUNTRY_SHORT_NAME&lt;/code&gt; under 
&lt;code class=&quot;highlighter-rouge&quot;&gt;REGION_HIERARCHY&lt;/code&gt; 
to the Columns field and drop and drop &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SUM_NEW_POSITIVE_CASES&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;CFR_COVID19&lt;/code&gt; 
(fatality rate) under Measures to the Rows field, and then select to display 
the data results as a map:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/24_tableau_covid19_map.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;The size of the symbols represents the level of COVID-19 death 
count and the shade of the color represents the level of the mortality rate. 
According to the pandemic map, the United States and India have more confirmed 
cases, but the mortality rates in the two countries are not significantly 
different from the other countries. However, countries with much fewer 
confirmed cases, such as Peru, Vanuatu, and Mexico, have persistently high 
death rates. You can continue to explore the reasons behind this if you are 
interested.&lt;/p&gt;
+
+    &lt;p&gt;Since we have set up a regional hierarchy, we can break down the 
country-level situation to the provincial/state level to see the pandemic 
situation in different regions of each country:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/25_tableau_province.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;Zoom in on the COVID map to see the status in each state of the 
United States:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/26_tableau_us_covid19.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;It can be concluded that there is no significant difference in 
the mortality rate in each state of the United States, which is around 0.01. In 
terms of the number of confirmed cases, it is significantly higher in 
California, Texas, Florida, and New York City. These regions are economically 
developed and have a large population. This might be the reason behind the 
higher number of confirmed COVID-19 cases. In the following part, we will 
combine the pandemic data with the New York taxi dataset to analyze the impact 
of the pandemic on the New York Taxi industry.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;For the New York taxi order dataset, we want to compare the order 
numbers and travel speed in different boroughs.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Drag and drop &lt;code 
class=&quot;highlighter-rouge&quot;&gt;BOROUGH&lt;/code&gt; under &lt;code 
class=&quot;highlighter-rouge&quot;&gt;PICKUP_NEWYORK_ZONE&lt;/code&gt; to 
Columns, and drag and drop &lt;code 
class=&quot;highlighter-rouge&quot;&gt;ORDER_COUNT&lt;/code&gt; and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;trip_mean_speed&lt;/code&gt; under 
Measures to Rows, and display the results as a map. The color shade represents 
the average speed and the size of the symbol represents the order number. We 
can see that taxi orders departing from Manhattan are higher than all the other 
boroughs combined, but the average speed is the lowest. Queens ranks second in 
terms of order number while Staten Island has the lowest amount of taxi 
activities. The average speed of taxi trips departing from the Bronx is 82 mph, 
several times higher than that of the other boroughs. This also reflects the 
population density and the level of economic development in different New York 
borou
 ghs.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/27_tableau_taxi_1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Then we will replace the field &lt;code 
class=&quot;highlighter-rouge&quot;&gt;BOROUGH&lt;/code&gt; from &lt;code 
class=&quot;highlighter-rouge&quot;&gt;PICKUP_NEWYORK_ZONE&lt;/code&gt; with 
&lt;code class=&quot;highlighter-rouge&quot;&gt;BOROUGH&lt;/code&gt; from 
&lt;code 
class=&quot;highlighter-rouge&quot;&gt;DROPOFF_NEWYORK_ZONE&lt;/code&gt;, to 
analyze the number of taxi orders and average speed by drop-off ID:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/27_tableau_taxi_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The pick-up and drop-off data of Brooklyn, Queens, and Bronx differ 
greatly, for example, the taxi orders to Brooklyn or Bronx are much higher than 
those departing from there, while there are much fewer taxi trips to Queens 
than those starting from it.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Travel habits change after the pandemic (long-distance vs. 
short-distance travels)&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;To analyze the average trip mileage we can get the residents’ 
travel habit changes, drag and drop dimension &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MONTH_START&lt;/code&gt; to Rows, and 
drag and drop the metric &lt;code 
class=&quot;highlighter-rouge&quot;&gt;trip_mean_distance&lt;/code&gt; to 
Columns:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/28_tableau_taxi_3.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Based on the histogram we can see that there have been significant 
changes in people’s travel behavior before and after the outbreak of 
COVID-19, as the average trip mileage has increased significantly since March 
2020 and in some months is even several times higher, and the trip mileage of 
each month fluctuated greatly. We can combine these data with the pandemic data 
in the month dimension, so we drag and drop &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SUM_NEW_POSITIVE_CASES&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;MTD_ORDER_COUNT&lt;/code&gt; to 
Rows and add &lt;code 
class=&quot;highlighter-rouge&quot;&gt;PROVINCE_STATE_NAME=New 
York&lt;/code&gt; as the filter condition:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/29_tableau_taxi_4.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;It is interesting to see that the number of taxi orders decreased 
sharply at the beginning of the outbreak while the average trip mileage 
increased, indicating people have cut unnecessary short-distance travels or 
switched to a safer means of transportation. By comparing the data curves, we 
can see that the severity of the pandemic and people’s travel patterns are 
highly related, taxi orders drop and average trip mileage increases when the 
pandemic worsens, while when the situation improves, taxi order increases while 
average trip mileage drops.&lt;/p&gt;
+
+&lt;h3 id=&quot;data-analysis-via-excel&quot;&gt;Data analysis via 
Excel&lt;/h3&gt;
+
+&lt;p&gt;With &lt;code class=&quot;highlighter-rouge&quot;&gt;MDX for 
Kylin&lt;/code&gt;, we can also use Kylin for big data analysis with Excel. In 
this test, we will use Excel installed on a local Windows machine to connect 
MDX for Kylin.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Open Excel, select &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Data&lt;/code&gt; -&amp;gt; &lt;code 
class=&quot;highlighter-rouge&quot;&gt;Get Data&lt;/code&gt; -&amp;gt; &lt;code 
class=&quot;highlighter-rouge&quot;&gt;From Database&lt;/code&gt; -&amp;gt; 
&lt;code class=&quot;highlighter-rouge&quot;&gt;From Analysis 
Services&lt;/code&gt;:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/30_excel_connect.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;In &lt;code class=&quot;highlighter-rouge&quot;&gt;Data 
Connection Wizard&lt;/code&gt;, enter the connection information as the server 
name:&lt;code 
class=&quot;highlighter-rouge&quot;&gt;http://${kylin_node_public_ip}:7080/mdx/xmla/covid_trip_project&lt;/code&gt;:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/31_excel_server.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/32_tableau_dataset.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Then create a PivotTable for this data connection. We can see the 
data listed here is the same as that when we are using Tableau. So no matter 
whether analysts are using Tableau or Excel, they are working on identical sets 
of data models, dimensions, and business metrics, thereby realizing unified 
semantics.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;We have just created a pandemic map and run a trend analysis 
using &lt;code class=&quot;highlighter-rouge&quot;&gt;covid19&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;newyork_trip_data&lt;/code&gt; 
with Tableau. In Excel, we can check more details for the same datasets and 
data scenarios.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;For COVID-19 related data, we add &lt;code 
class=&quot;highlighter-rouge&quot;&gt;REGION_HIERARCHY&lt;/code&gt; and 
pre-defined &lt;code 
class=&quot;highlighter-rouge&quot;&gt;SUM_NEW_POSITIVE_CASES&lt;/code&gt; and 
mortality rate &lt;code 
class=&quot;highlighter-rouge&quot;&gt;CFR_COVID19&lt;/code&gt; to the 
PivotTable:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/33_tableau_covid19_1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The highest level of the regional hierarchy is &lt;code 
class=&quot;highlighter-rouge&quot;&gt;CONTINENT_NAME&lt;/code&gt;, which 
includes the number of confirmed cases and mortality rate in each continent. We 
can see that Europe has the highest number of confirmed cases while Africa has 
the highest mortality rate. In this PivotTable, we can easily drill down to 
lower regional levels to check more fine-grained data, such as data from 
different Asian countries, and sort them in descending order according to the 
number of confirmed cases:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/34_excel_covid20_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The data shows that India, Turkey, and Iran are the countries with 
the highest number of confirmed cases.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Regarding the problem, does the pandemic have a significant impact 
on taxi orders, we first look at the YTD and growth rate of taxi orders from 
the year dimension by creating a PivotTable with &lt;code 
class=&quot;highlighter-rouge&quot;&gt;TIME_HIERARCHY&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;YOY_ORDER_COUNT&lt;/code&gt;, and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;YTD_ORDER_COUNT&lt;/code&gt; as 
the dimension for time hierarchy:&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/35_excel_taxi_1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;It can be seen that since the outbreak of the pandemic in 2020, there 
is a sharp decrease in taxi orders. The growth rate in 2020 is -0.7079, that 
is, a reduction of 70% in taxi orders. The growth rate in 2021 is still 
negative, but the decrease is not so obvious compared to 2020 when the pandemic 
just started.&lt;/p&gt;
+
+&lt;p&gt;Click to expand the time hierarchy to view the data at quarter, 
month, and even day levels. By selecting &lt;code 
class=&quot;highlighter-rouge&quot;&gt;MOM_ORDER_COUNT&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;ORDER_COUNT&lt;/code&gt;, we 
can check the monthly order growth rate and order numbers in different time 
hierarchies:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/36_excel_taxi_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The order growth rate in March 2020 was -0.52, which is already a 
significant fall. The rate dropped even further to -0.92 in April, that is, a 
90% reduction in orders. Then the decreasing rate becomes less obvious.  But 
taxi orders were still much lower than before the outbreak.&lt;/p&gt;
+
+&lt;h3 
id=&quot;use-api-to-integrate-kylin-with-data-analysis-platform&quot;&gt;Use 
API to integrate Kylin with data analysis platform&lt;/h3&gt;
+
+&lt;p&gt;In addition to mainstream BI tools such as Excel and Tableau, many 
companies also like to develop their in-house data analysis platforms. For such 
self-developed data analysis platforms, users can still use Kylin + MDX for 
Kylin as the base for the analysis platform by calling API to ensure a unified 
data definition. In the following part, we will show you how to send a query to 
MDX for Kylin through Olap4j, the Java library similar to JDBC driver that can 
access any OLAP service.&lt;/p&gt;
+
+&lt;p&gt;We also provide a simple demo for our users, you may click &lt;a 
href=&quot;https://github.com/apache/kylin/tree/mdx-query-demo&quot;&gt;mdx 
query demo&lt;/a&gt; to download the source code.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Download jar package for the demo:&lt;/p&gt;
+
+    &lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;wget 
https://s3.cn-north-1.amazonaws.com.cn/public.kyligence.io/kylin/kylin_demo/mdx_query_demo.tgz
+tar -xvf mdx_query_demo.tgz
+cd mdx_query_demo
+&lt;/code&gt;&lt;/pre&gt;
+    &lt;/div&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Run demo&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Make sure Java 8 is installed before running the demo:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/kylin4_on_cloud/37_jdk_8.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Two parameters are needed to run the demo: the IP of the MDX node and 
the MDX query to be run. The default port is 7080. The MDX node IP here is the 
public IP of the Kylin node.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;java -cp 
olap4j-xmla-1.2.0.jar:olap4j-1.2.0.jar:xercesImpl-2.9.1.jar:mdx-query-demo-0.0.1.jar
 io.kyligence.mdxquerydemo.MdxQueryDemoApplication 
&quot;${kylin_node_public_ip}&quot; &quot;${mdx_query}&quot;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Or you could just enter the IP of the MDX node, the system will 
automatically run the following MDX statement to count the order number and 
average trip mileage of each borough according to the pickup ID:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;SELECT
+{[Measures].[ORDER_COUNT],
+[Measures].[trip_mean_distance]}
+DIMENSION PROPERTIES [MEMBER_UNIQUE_NAME],[MEMBER_ORDINAL],[MEMBER_CAPTION] ON 
COLUMNS,
+NON EMPTY [PICKUP_NEWYORK_ZONE].[BOROUGH].[BOROUGH].AllMembers
+DIMENSION PROPERTIES [MEMBER_UNIQUE_NAME],[MEMBER_ORDINAL],[MEMBER_CAPTION] ON 
ROWS
+FROM [covid_trip_dataset]
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;We will also use the default query in this tutorial. After the 
execution is completed, we can get the query result in the command 
line:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/38_demo_result.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;As you can see, we have successfully obtained the data needed. The 
result shows that the largest number of taxi orders are from Manhattan, with an 
average order distance of only about 2.4 miles, which is reasonable if we 
consider the area and dense population in Manhattan; while the average distance 
of orders departing from Bronx is 33 miles, much higher than any other 
boroughs, probably due to Bronx’s remote location.&lt;/p&gt;
+
+&lt;p&gt;As with Tableau and Excel, the MDX statement here can directly use 
the metrics defined in Kylin and MDX for Kylin. Users can do further analysis 
of the data with their own data analysis platform.&lt;/p&gt;
+
+&lt;h3 id=&quot;unified-data-definition&quot;&gt;Unified data 
definition&lt;/h3&gt;
+
+&lt;p&gt;We have demonstrated 3 ways to work with Kylin + MDX for Kylin, from 
which we can see that with the help of Kylin multi-dimensional database and MDX 
for Kylin semantic layer, no matter which data analytic system you are using, 
you can always use the same data model and business metrics and enjoy the 
advantages brought by unified semantics.&lt;/p&gt;
+
+&lt;h2 id=&quot;delete-clusters&quot;&gt;Delete clusters&lt;/h2&gt;
+
+&lt;h3 id=&quot;delete-query-cluster&quot;&gt;Delete query cluster&lt;/h3&gt;
+
+&lt;p&gt;After the analysis, we can execute the cluster destruction command to 
delete the query cluster. If you also want to delete metadata database RDS, 
monitor node and VPC of Kylin and MDX for Kylin, you can execute the following 
cluster destroy command:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;python deploy.py --type destroy-all
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;check-aws-resources&quot;&gt;Check AWS resources&lt;/h3&gt;
+
+&lt;p&gt;After all cluster resources are deleted, there should be no Kylin 
deployment tool-related Stack on &lt;code 
class=&quot;highlighter-rouge&quot;&gt;CloudFormation&lt;/code&gt;. If you also 
want to delete the deployment-related files and data from S3, you can manually 
delete the following folders under the S3 working directory:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/39_check_s3_demo.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;
+
+&lt;p&gt;You only need an AWS account to follow the steps in this tutorial to 
explore our Kylin deployment tool on the Cloud. Kylin + MDX for Kylin, with our 
pre-computation technology, multi-dimensional models, and basic metrics 
management capabilities, enables users to build a big data analysis platform on 
the cloud in a convenient way. In addition, we also support seamless connection 
to mainstream BI tools, helping our users to better leverage their data with 
higher efficiency and the lowest TCO.&lt;/p&gt;
+</description>
+        <pubDate>Wed, 20 Apr 2022 04:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2022/04/20/kylin4-on-cloud-part2/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2022/04/20/kylin4-on-cloud-part2/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Kylin on Cloud —— 
两小时快速搭建云上数据分析平台(上)</title>
         <description>&lt;h2 id=&quot;section&quot;&gt;背景&lt;/h2&gt;
 
@@ -661,6 +972,403 @@ sed -i &lt;span class=&quot;s2&quot;&gt;
       </item>
     
       <item>
+        <title>Kylin on Cloud — Build A Data Analysis Platform on the Cloud 
in Two Hours Part 1</title>
+        <description>&lt;h2 id=&quot;video-tutorials&quot;&gt;Video 
Tutorials&lt;/h2&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://youtu.be/5kKXEMjO1Sc&quot;&gt;Kylin on Cloud 
— Build A Data Analysis Platform on the Cloud in Two Hours Part 
1&lt;/a&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
+
+&lt;p&gt;Apache Kylin is a multidimensional database based on pre-computation 
and multidimensional models. It also supports standard SQL query interface. In 
Kylin, users can define table relationships by creating Models, define 
dimensions and measures by creating Cubes, and run data aggregation with Cube 
building. The pre-computed data will be saved to answer user queries and users 
also can perform further aggregation on the pre-computed data, significantly 
improving the query performance.&lt;/p&gt;
+
+&lt;p&gt;With the release of Kylin 4.0, Kylin can now be deployed without a 
Hadoop environment. To make it easier for users to deploy Kylin on the cloud, 
Kylin community recently developed a cloud deployment tool that allows users to 
obtain a complete Kylin cluster by executing just one line of command, 
delivering a fast and efficient analysis experience for the users. Moreover, in 
January 2022, the Kylin community released MDX for Kylin to enhance the 
semantic capability of Kylin as a multidimensional database. MDX for Kylin 
provides the MDX query interface, users can define business metrics based on 
the multidimensional model and translate the Kylin data models into a 
business-friendly language to give data business values, making it easier to 
integrate with Excel, Tableau, and other BI tools.&lt;/p&gt;
+
+&lt;p&gt;With all these innovations, users can easily and quickly deploy Kylin 
clusters on the cloud, create multi-dimensional models, and enjoy the short 
query latency brought by pre-computation; what’s more, users can also use MDX 
for Kylin to define and manage business metrics, leveraging both the advantages 
of data warehouse and business semantics.&lt;/p&gt;
+
+&lt;p&gt;With Kylin + MDX for Kylin, users can directly work with BI tools for 
multidimensional data analysis, or use it as the basis to build complex 
applications such as metrics platforms. Compared with the solution of building 
a metrics platform directly with computing engines such as Spark and Hive that 
perform Join and aggregated query computation at runtime, Kylin, with our 
multidimensional modeling, pre-computation technology, and semantics layer 
capabilities empowered by MDX for Kylin, provides users with key functions such 
as massive data computation, extremely fast query response, unified 
multidimensional model, interface to a variety of BI tools, and basic business 
metrics management capabilities.&lt;/p&gt;
+
+&lt;p&gt;This tutorial will start from a data engineer’s perspective to show 
how to build a Kylin on Cloud data analysis platform, which will deliver a 
high-performance query experience for hundreds of millions of rows of data with 
a lower TCO, the capability to manage business metrics through MDX for Kylin, 
and direct connection to BI tools for quick reports generating.&lt;/p&gt;
+
+&lt;p&gt;Each step of this tutorial is explained in detail with illustrations 
and checkpoints to help newcomers. All you need to start is to an AWS account 
and 2 hours. Note: The cloud cost to finish this tutorial is around 
15$.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/0_deploy_kylin.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;business-scenario&quot;&gt;Business scenario&lt;/h2&gt;
+
+&lt;p&gt;Since the beginning of 2020, COVID-19 has spread rapidly all over the 
world, which has greatly changed people’s daily life, especially their travel 
habits. This tutorial wants to learn the impact of the pandemic on the New York 
taxi industry based on the pandemic data and New York taxi travel data since 
2018 and indicators such as positive cases, fatality rate, taxi orders, and 
average travel mileage will be analyzed. We hope this analysis could provide 
some insights for future decision-making.&lt;/p&gt;
+
+&lt;h3 id=&quot;business-issues&quot;&gt;Business issues&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;The severity of the pandemic in different countries and 
regions&lt;/li&gt;
+  &lt;li&gt;Travel metrics of different blocks in New York City, such as order 
number, travel mileage, etc.&lt;/li&gt;
+  &lt;li&gt;Does the pandemic have a significant impact on taxi 
orders?&lt;/li&gt;
+  &lt;li&gt;Travel habits change after the pandemic (long-distance vs. 
short-distance travels)&lt;/li&gt;
+  &lt;li&gt;Is the severity of the pandemic strongly related to taxi 
travel?&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;dataset&quot;&gt;Dataset&lt;/h3&gt;
+
+&lt;h4 id=&quot;covid-19-dataset&quot;&gt;COVID-19 Dataset&lt;/h4&gt;
+
+&lt;p&gt;The COVID-19 dataset includes a fact table &lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_19_activity&lt;/code&gt; and a 
dimension table &lt;code 
class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_19_activity&lt;/code&gt; contains 
the number of confirmed cases and deaths reported each day in different regions 
around the world. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt; is a date 
dimension table that holds time-extended information, such as the beginning of 
the year, and the beginning of the month for each date. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;covid_19_activity&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt; 
are associated by date.&lt;br /&gt;
+COVID-19 数据集相关信息如下:&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Data size&lt;/td&gt;
+      &lt;td&gt;235 MB&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Fact table row count&lt;/td&gt;
+      &lt;td&gt;2,753,688&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Data range&lt;/td&gt;
+      &lt;td&gt;2020-01-21~2022-03-07&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Download address provided by the dataset provider&lt;/td&gt;
+      
&lt;td&gt;https://data.world/covid-19-data-resource-hub/covid-19-case-counts/workspace/file?filename=COVID-19+Activity.csv&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;S3 directory of the dataset&lt;/td&gt;
+      
&lt;td&gt;s3://public.kyligence.io/kylin/kylin_demo/data/covid19_data/&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h4 id=&quot;nyc-taxi-order-dataset&quot;&gt;NYC taxi order 
dataset&lt;/h4&gt;
+
+&lt;p&gt;The NYC taxi order dataset consists of a fact table &lt;code 
class=&quot;highlighter-rouge&quot;&gt;taxi_trip_records_view&lt;/code&gt;, and 
two dimension tables, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;newyork_zone&lt;/code&gt; and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt;.&lt;/p&gt;
+
+&lt;p&gt;Among them, each record in &lt;code 
class=&quot;highlighter-rouge&quot;&gt;taxi_trip_records_view&lt;/code&gt; 
corresponds to one taxi trip and contains information like the pick-up ID, 
drop-off ID, trip duration, order amount, travel mileage, etc. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;newyork_zone&lt;/code&gt; records the 
administrative district corresponding to the location ID. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;taxi_trip_records_view&lt;/code&gt; are 
connected with &lt;code 
class=&quot;highlighter-rouge&quot;&gt;newyork_zone&lt;/code&gt; through 
columns PULocationID and DOLocationID to get the information about pick-up and 
drop-off blocks. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt; is the same 
dimension table as in the COVID-19 dataset. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;taxi_trip_records_view&lt;/code&gt; and 
&lt;code class=&quot;highlighter-rouge&quot;&gt;lookup_calendar&lt;/code&gt; 
are connected by
  date.&lt;/p&gt;
+
+&lt;p&gt;NYC taxi order dataset information:&lt;/p&gt;
+
+&lt;table&gt;
+  &lt;tbody&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Data size&lt;/td&gt;
+      &lt;td&gt;19 G&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Fact table row count&lt;/td&gt;
+      &lt;td&gt;226,849,274&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Data range&lt;/td&gt;
+      &lt;td&gt;2018-01-01~2021-07-31&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;Download address provided by the dataset provider&lt;/td&gt;
+      
&lt;td&gt;https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page&lt;/td&gt;
+    &lt;/tr&gt;
+    &lt;tr&gt;
+      &lt;td&gt;S3 directory of the dataset&lt;/td&gt;
+      
&lt;td&gt;s3://public.kyligence.io/kylin/kylin_demo/data/trip_data_2018-2021/&lt;/td&gt;
+    &lt;/tr&gt;
+  &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;h4 id=&quot;er-diagram&quot;&gt;ER Diagram&lt;/h4&gt;
+
+&lt;p&gt;The ER diagram of the COVID-19 dataset and NYC taxi order dataset is 
as follows:&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/kylin4_on_cloud/1_table_ER.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h3 id=&quot;metrics-design&quot;&gt;Metrics design&lt;/h3&gt;
+
+&lt;p&gt;Based on what we try to solve with this model, we designed the 
following atomic metrics and business metrics:&lt;/p&gt;
+
+&lt;h6 id=&quot;atomic-metrics&quot;&gt;1. Atomic metrics&lt;/h6&gt;
+
+&lt;p&gt;Atomic metrics refer to measures created in Kylin Cube, which are 
relatively simple, as they only run aggregated calculations on one 
column.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Covid19 case count: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(covid_19_activity.people_positive_cases_count)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Covid19 fatality: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(covid_19_activity. 
people_death_count)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Covid19 new positive case count: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(covid_19_activity. 
people_positive_new_cases_count)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Covid19 new death count: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(covid_19_activity. 
people_death_new_count)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Taxi trip mileage: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(taxi_trip_records_view. 
trip_distance)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Taxi order amount: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(taxi_trip_records_view. 
total_amount)&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Taxi trip count: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;count()&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;Taxi trip duration: &lt;code 
class=&quot;highlighter-rouge&quot;&gt;sum(taxi_trip_records_view.trip_time_hour)&lt;/code&gt;&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h6 id=&quot;business-metrics&quot;&gt;2. Business metrics&lt;/h6&gt;
+
+&lt;p&gt;Business metrics are various compound operations based on atomic 
metrics that have specific business meanings.&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;MTD, YTD of each atomic metric&lt;/li&gt;
+  &lt;li&gt;MOM, YOY of each atomic metric&lt;/li&gt;
+  &lt;li&gt;Covid19 fatality rate: death count/positive case count&lt;/li&gt;
+  &lt;li&gt;Average taxi trip speed: taxi trip distance/taxi trip 
duration&lt;/li&gt;
+  &lt;li&gt;Average taxi trip mileage: taxi trip distance/taxi trip 
count&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h2 id=&quot;operation-overview&quot;&gt;Operation Overview&lt;/h2&gt;
+
+&lt;p&gt;The diagram below is the main steps to build a cloud data analysis 
platform with Apache Kylin and how to perform data analysis:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/2_step_overview.jpg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;cluster-architecture&quot;&gt;Cluster architecture&lt;/h2&gt;
+
+&lt;p&gt;Here is the architecture of the Kylin cluster deployed by the cloud 
deployment tool:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/3_kylin_cluster.jpg&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h2 id=&quot;kylin-on-cloud-deployment&quot;&gt;Kylin on Cloud 
deployment&lt;/h2&gt;
+
+&lt;h3 id=&quot;prerequisites&quot;&gt;Prerequisites&lt;/h3&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;GitHub Desktop: for downloading the deployment tool;&lt;/li&gt;
+  &lt;li&gt;Python 3.6.6: for running the deployment tool&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;h3 id=&quot;aws-permission-check-and-initialization&quot;&gt;AWS 
permission check and initialization&lt;/h3&gt;
+
+&lt;p&gt;Log in to AWS with your account to check the permission status and 
then create the Access Key, IAM Role, Key Pair, and S3 working directory 
according to the document &lt;a 
href=&quot;https://github.com/apache/kylin/blob/kylin4_on_cloud/readme/prerequisites.md&quot;&gt;Prerequisites&lt;/a&gt;.
 Subsequent AWS operations will be performed with this account.&lt;/p&gt;
+
+&lt;h3 id=&quot;configure-the-deployment-tool&quot;&gt;Configure the 
deployment tool&lt;/h3&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Execute the following command to clone the code for the Kylin on 
AWS deployment tool.&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ git clone -b kylin4_on_cloud --single-branch 
https://github.com/apache/kylin.git &amp;amp;&amp;amp; cd kylin
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Initialize the virtual environment for your Python on local 
machine.&lt;/p&gt;
+
+    &lt;p&gt;Run the command below to check the Python version. Note: Python 
3.6.6 or above is needed:&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ python --version
+&lt;/code&gt;&lt;/p&gt;
+
+    &lt;p&gt;Initialize the virtual environment for Python and install 
dependencies:&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ bin/init.sh
+ source venv/bin/activate
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Modify the configuration file &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_configs.yaml&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Open kylin_configs.yaml file, and replace the configuration items 
with the actual values:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;AWS_REGION&lt;/code&gt;: Region for EC2 
instance, the default value is &lt;code 
class=&quot;highlighter-rouge&quot;&gt;cn-northwest-1&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;${IAM_ROLE_NAME}&lt;/code&gt;: IAM Role 
just created, e.g. &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_deploy_role&lt;/code&gt;&lt;/li&gt;
+  &lt;li&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;${S3_URI}&lt;/code&gt;: S3 working 
directory for deploying Kylin, e.g. s3://kylindemo/kylin_demo_dir/&lt;/li&gt;
+  &lt;li&gt;&lt;code 
class=&quot;highlighter-rouge&quot;&gt;${KEY_PAIR}&lt;/code&gt;: Key pairs just 
created, e.g. kylin_deploy_key&lt;/li&gt;
+  &lt;li&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;${Cidr 
Ip}&lt;/code&gt;: IP address range that is allowed to access EC2 instances, 
e.g. 10.1.0.0/32, usually set as your external IP address to ensure that only 
you can access these EC2 instances&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;As Kylin adopts a read-write separation architecture to separate 
build and query resources, in the following steps, we will first start a build 
cluster to connect to Glue to create tables, load data sources, and submit 
build jobs for pre-computation, then delete the build cluster but save the 
metadata. Then we will start a query cluster with MDX for Kylin to create 
business metrics, connect to BI tools for queries, and perform data analysis. 
Kylin on AWS cluster uses RDS to store metadata and S3 to store the built data. 
It also supports loading data sources from AWS Glue. Except for the EC2 nodes, 
the other resources used are permanent and will not disappear with the deletion 
of nodes. Therefore, when there is no query or build job, users can delete the 
build or query clusters and only keep the metadata and S3 working 
directory.&lt;/p&gt;
+
+&lt;h3 id=&quot;kylin-build-cluster&quot;&gt;Kylin build cluster&lt;/h3&gt;
+
+&lt;h4 id=&quot;start-kylin-build-cluster&quot;&gt;Start Kylin build 
cluster&lt;/h4&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Start the build cluster with the following command. The whole 
process may take 15-30 minutes depending on your network conditions.&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ python deploy.py --type deploy --mode job
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;You may check the terminal to see if the build cluster is 
successfully deployed:&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/4_deploy_cluster_successfully.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;check-aws-service&quot;&gt;Check AWS Service&lt;/h4&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Go to CloudFormation on AWS console, where you can see 7 stacks 
are created by the Kylin deployment tool:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/5_check_aws_stacks.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Users can view the details of EC2 nodes through the AWS console 
or use the command below to check the names, private IPs, and public IPs of all 
EC2 nodes.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;python deploy.py --type list
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/6_list_cluster_node.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;spark-sql-query-response-time&quot;&gt;Spark-SQL query 
response time&lt;/h4&gt;
+
+&lt;p&gt;Let’s first check the query response time in Spark-SQL environment 
as a comparison.&lt;/p&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;First, log in to the EC2 where Kylin is deployed with the public 
IP of the Kylin node, switch to root user, and execute &lt;code 
class=&quot;highlighter-rouge&quot;&gt;~/.bash_profile&lt;/code&gt; to 
implement the environment variables set beforehand.&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ ssh -i &quot;${KEY_PAIR}&quot; ec2-user@${kylin_node_public_ip}
+ sudo su
+ source ~/.bash_profile
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Go to &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$SPARK_HOME&lt;/code&gt; to modify 
configuration file &lt;code 
class=&quot;highlighter-rouge&quot;&gt;conf/spark-defaults.conf&lt;/code&gt;, 
change spark_master_node_private_ip to a private IP of the Spark master 
node:&lt;/p&gt;
+
+    &lt;p&gt;```shell&lt;br /&gt;
+ cd $SPARK_HOME&lt;br /&gt;
+ vim conf/spark-defaults.conf&lt;/p&gt;
+
+    &lt;p&gt;## Replace spark_master_node_private_ip with the private IP of 
the real Spark master node&lt;br /&gt;
+  spark.master spark://spark_master_node_private_ip:7077&lt;br /&gt;
+ ```&lt;/p&gt;
+
+    &lt;p&gt;In &lt;code 
class=&quot;highlighter-rouge&quot;&gt;spark-defaults.conf&lt;/code&gt;, the 
resource allocation for driver and executor is the same as that for Kylin query 
cluster.&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Create table in Spark-SQL&lt;/p&gt;
+
+    &lt;p&gt;All data from the test dataset is stored in S3 bucket of &lt;code 
class=&quot;highlighter-rouge&quot;&gt;cn-north-1&lt;/code&gt; and &lt;code 
class=&quot;highlighter-rouge&quot;&gt;us-east-1&lt;/code&gt;. If your S3 
bucket is in &lt;code 
class=&quot;highlighter-rouge&quot;&gt;cn-north-1&lt;/code&gt; or &lt;code 
class=&quot;highlighter-rouge&quot;&gt;us-east-1&lt;/code&gt;, you can directly 
run SQL to create the table; Or, you will need to execute the following script 
to copy the data to the S3 working directory set up in &lt;code 
class=&quot;highlighter-rouge&quot;&gt;kylin_configs.yaml&lt;/code&gt;, and 
modify your SQL for creating the table:&lt;/p&gt;
+
+    &lt;p&gt;```shell&lt;br /&gt;
+ ## AWS CN user&lt;br /&gt;
+ aws s3 sync s3://public.kyligence.io/kylin/kylin_demo/data/ ${S3_DATA_DIR} 
–region cn-north-1&lt;/p&gt;
+
+    &lt;p&gt;## AWS Global user&lt;br /&gt;
+ aws s3 sync s3://public.kyligence.io/kylin/kylin_demo/data/ ${S3_DATA_DIR} 
–region us-east-1&lt;/p&gt;
+
+    &lt;p&gt;## Modify create table SQL&lt;br /&gt;
+ sed -i 
“s#s3://public.kyligence.io/kylin/kylin_demo/data/#${S3_DATA_DIR}#g” 
/home/ec2-user/kylin_demo/create_kylin_demo_table.sql&lt;br /&gt;
+ ```&lt;/p&gt;
+
+    &lt;p&gt;Execute SQL for creating table:&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ bin/spark-sql -f /home/ec2-user/kylin_demo/create_kylin_demo_table.sql
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Execute query in Spark-SQL&lt;/p&gt;
+
+    &lt;p&gt;Go to Spark-SQL:&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ bin/spark-sql
+&lt;/code&gt;&lt;/p&gt;
+
+    &lt;p&gt;Run query in Spark-SQL:&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;sql
+ use kylin_demo;
+ select TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH, count(*), 
sum(TAXI_TRIP_RECORDS_VIEW.TRIP_TIME_HOUR), 
sum(TAXI_TRIP_RECORDS_VIEW.TOTAL_AMOUNT)
+ from TAXI_TRIP_RECORDS_VIEW
+ left join NEWYORK_ZONE
+ on TAXI_TRIP_RECORDS_VIEW.PULOCATIONID = NEWYORK_ZONE.LOCATIONID
+ group by TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH;
+&lt;/code&gt;&lt;/p&gt;
+
+    &lt;p&gt;We can see that with the same configuration as Kylin query 
cluster, direct query using Spark-SQL takes over 100s:&lt;/p&gt;
+
+    &lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/7_query_in_spark_sql.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;After the query is successfully executed, we should exit the 
Spark-SQL before proceeding to the following steps to save resources.&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h4 id=&quot;import-kylin-metadata&quot;&gt;Import Kylin metadata&lt;/h4&gt;
+
+&lt;ol&gt;
+  &lt;li&gt;
+    &lt;p&gt;Go to &lt;code 
class=&quot;highlighter-rouge&quot;&gt;$KYLIN_HOME&lt;/code&gt;&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ cd $KYLIN_HOME
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Import metadata&lt;/p&gt;
+
+    &lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;shell
+ bin/metastore.sh restore /home/ec2-user/meta_backups/
+&lt;/code&gt;&lt;/p&gt;
+  &lt;/li&gt;
+  &lt;li&gt;
+    &lt;p&gt;Reload metadata&lt;/p&gt;
+  &lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;p&gt;Type &lt;code 
class=&quot;highlighter-rouge&quot;&gt;http://${kylin_node_public_ip}:7070/kylin&lt;/code&gt;
 (relace the IP with the public IP of the EC2 node) in your browser to log in 
to Kylin web UI, and log in with the default username and password 
ADMIN/KYLIN:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/8_kylin_web_ui.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Reload Kylin metadata by clicking System - &amp;gt; Configuration - 
&amp;gt; Reload Metadata:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/9_reload_kylin_metadata.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;If you’d like to learn how to manually create the Model and Cube 
included in Kylin metadata, please refer to &lt;a 
href=&quot;https://cwiki.apache.org/confluence/display/KYLIN/Create+Model+and+Cube+in+Kylin&quot;&gt;Create
 model and cube in Kylin&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h4 id=&quot;run-build&quot;&gt;Run build&lt;/h4&gt;
+
+&lt;p&gt;Submit the Cube build job. Since no partition column is set in the 
model, we will directly perform a full build for the two cubes:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/10_full_build_cube.png.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/11_kylin_job_complete.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;destroy-build-cluster&quot;&gt;Destroy build cluster&lt;/h4&gt;
+
+&lt;p&gt;After the building Job is completed, execute the cluster delete 
command to close the build cluster. By default, the RDS stack, monitor stack, 
and VPC stack will be kept.&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;python deploy.py --type destroy
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;p&gt;Cluster is successfully closed:&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/12_destroy_job_cluster.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;h4 id=&quot;check-aws-resource&quot;&gt;Check AWS resource&lt;/h4&gt;
+
+&lt;p&gt;After the cluster is successfully deleted, you can go to the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;CloudFormation&lt;/code&gt; page in AWS 
console to confirm whether there are remaining resources. Since the metadata 
RDS, monitor nodes, and VPC nodes are kept by default, you will see only the 
following three stacks on the page.&lt;/p&gt;
+
+&lt;p&gt;&lt;img 
src=&quot;/images/blog/kylin4_on_cloud/13_check_aws_stacks.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;The resources in the three stacks will still be used when we start 
the query cluster, to ensure that the query cluster and the build cluster use 
the same set of metadata.&lt;/p&gt;
+
+&lt;h4 id=&quot;intro-to-next-part&quot;&gt;Intro to next part&lt;/h4&gt;
+
+&lt;p&gt;That’s all for the first part of Kylin on Cloud —— Build A Data 
Analysis Platform on the Cloud in Two Hours, please see part 2 here: &lt;a 
href=&quot;../kylin4-on-cloud-part2/&quot;&gt;Kylin on Cloud —— Quickly 
Build Cloud Data Analysis Service Platform within Two Hours&lt;/a&gt; (Part 
2)&lt;/p&gt;
+</description>
+        <pubDate>Wed, 20 Apr 2022 04:00:00 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2022/04/20/kylin4-on-cloud-part1/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2022/04/20/kylin4-on-cloud-part1/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>如何使用 Excel 查询 Kylin?MDX for Kylin!</title>
         <description>&lt;h2 id=&quot;kylin--mdx&quot;&gt;Kylin 为什么需要 
MDX?&lt;/h2&gt;
 
@@ -2337,157 +3045,6 @@ Kylin 4.0 对构建和查
         
         
         <category>cn_blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Performance optimization of Kylin 4.0 in cloud -- local cache 
and soft affinity scheduling</title>
-        <description>&lt;h2 id=&quot;background-introduction&quot;&gt;01 
Background Introduction&lt;/h2&gt;
-&lt;p&gt;Recently, the Apache Kylin community released Kylin 4.0.0 with a new 
architecture. The architecture of Kylin 4.0 supports the separation of storage 
and computing, which enables kylin users to run Kylin 4.0 in a more flexible 
cloud deployment mode with flexible computing resources. With the cloud 
infrastructure, users can choose to use cheap and reliable object storage to 
store cube data, such as S3. However, in the architecture of separation of 
storage and computing, we need to consider that reading data from remote 
storage by computing nodes through the network is still a costly operation, 
which often leads to performance loss.&lt;br /&gt;
-In order to improve the query performance of Kylin 4.0 when using cloud object 
storage as the storage, we try to introduce the local cache mechanism into the 
Kylin 4.0 query engine. When executing the query, the frequently used data is 
cached on the local disk to reduce the delay caused by pulling data from the 
remote object storage and achieve faster query response. In addition, in order 
to avoid wasting disk space when the same data is cached on a large number of 
spark executors at the same time, and the computing node can read more required 
data from the local cache, we introduce the scheduling strategy of soft 
affinity. The soft affinity strategy is to establish a corresponding 
relationship between the spark executor and the data file through some method, 
In most cases, the same data can always be read on the same executor, so as to 
improve the hit rate of the cache.&lt;/p&gt;
-
-&lt;h2 id=&quot;implementation-principle&quot;&gt;02 Implementation 
Principle&lt;/h2&gt;
-
-&lt;h4 id=&quot;local-cache&quot;&gt;1. Local Cache&lt;/h4&gt;
-
-&lt;p&gt;When Kylin 4.0 executes a query, it mainly goes through the following 
stages, in which the stages where local cache can be used to improve 
performance are marked with dotted lines:&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/Local_cache_stage.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;File list cache:Cache the file status on the spark driver side. 
When executing the query, the spark driver needs to read the file list and 
obtain some file information for subsequent scheduling execution. Here, the 
file status information will be cached locally to avoid frequent reading of 
remote file directories.&lt;/li&gt;
-  &lt;li&gt;Data cache:Cache the data on the spark executor side. You can 
set the data cache to memory or disk. If it is set to cache to memory, you need 
to appropriately increase the executor memory to ensure that the executor has 
enough memory for data cache; If it is cached to disk, you need to set the data 
cache directory, preferably SSD disk directory.&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;Based on the above design, different types of caches are made on the 
driver side and the executor side of the query engine of kylin 4.0. The basic 
architecture is as follows:&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/kylin4_local_cache.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;h4 id=&quot;soft-affinity-scheduling&quot;&gt;2. Soft Affinity 
Scheduling&lt;/h4&gt;
-
-&lt;p&gt;When doing data cache on the executor side, if all data is cached on 
all executors, the size of cached data will be very considerable and a great 
waste of disk space, and it is easy to cause frequent evict cache data. In 
order to maximize the cache hit rate of the spark executor, the spark driver 
needs to schedule the tasks of the same file to the same executor as far as 
possible when the resource conditions are me, so as to ensure that the data of 
the same file can be cached on a specific one or several executors, and the 
data can be read through the cache when it is read again.&lt;br /&gt;
-To this end, we calculate the target executor list by calculating the hash 
according to the file name and then modulo with the executor num. The number of 
executors to cache is determined by the number of data cache replications 
configured by the user. Generally, the larger the number of cache replications, 
the higher the probability of hitting the cache. When the target executors are 
unreachable or have no resources for scheduling, the scheduler will fall back 
to the random scheduling mechanism of spark. This scheduling method is called 
soft affinity scheduling strategy. Although it can not guarantee 100% hit to 
the cache, it can effectively improve the cache hit rate and avoid a large 
amount of disk space wasted by full cache on the premise of minimizing 
performance loss.&lt;/p&gt;
-
-&lt;h2 id=&quot;related-configuration&quot;&gt;03 Related 
Configuration&lt;/h2&gt;
-
-&lt;p&gt;According to the above principles, we implemented the basic function 
of local cache + soft affinity scheduling in Kylin 4.0, and tested the query 
performance based on SSB data set and TPCH data set respectively.&lt;br /&gt;
-Several important configuration items are listed here for users to understand. 
The actual configuration will be given in the attachment at the end:&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Enable soft affinity 
scheduling:kylin.query.spark-conf.spark.kylin.soft-affinity.enabled&lt;/li&gt;
-  &lt;li&gt;Enable local 
cache:kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled&lt;/li&gt;
-  &lt;li&gt;The number of data cache replications, that is, how many executors 
cache the same data 
file:kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num&lt;/li&gt;
-  &lt;li&gt;Cache to memory or local directory. Set cache to memory as buff 
and cache to local as local: 
kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type&lt;/li&gt;
-  &lt;li&gt;Maximum cache 
capacity:kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;h2 id=&quot;performance-benchmark&quot;&gt;04 Performance 
Benchmark&lt;/h2&gt;
-
-&lt;p&gt;We conducted performance tests in three scenarios under AWS EMR 
environment. When scale factor = 10, we conducted single concurrent query test 
on SSB dataset, single concurrent query test and 4 concurrent query test on 
TPCH dataset. S3 was configured as storage in the experimental group and the 
control group. Local cache and soft affinity scheduling were enabled in the 
experimental group, but not in the control group. In addition, we also compare 
the results of the experimental group with the results when HDFS is used as 
storage in the same environment, so that users can intuitively feel the 
optimization effect of local cache + soft affinity scheduling on deploying 
Kylin 4.0 on the cloud and using object storage as storage.&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_ssb.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_tpch1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_tpch4.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;As can be seen from the above results:&lt;/p&gt;
-
-&lt;ol&gt;
-  &lt;li&gt;In the single concurrency scenario of SSB data set, when S3 is 
used as storage, turning on the local cache and soft affinity scheduling can 
achieve about three times the performance improvement, which can be the same as 
that of HDFS, or even improved.&lt;/li&gt;
-  &lt;li&gt;Under TPCH data set, when S3 is used as storage, whether single 
concurrent query or multiple concurrent query, after local cache and soft 
affinity scheduling are enabled, the performance of all queries can be greatly 
improved.&lt;/li&gt;
-&lt;/ol&gt;
-
-&lt;p&gt;However, in the comparison results of Q21 under the 4 concurrent 
tests of TPCH dataset, we observed that the results of enabling local cache and 
soft affinity scheduling are lower than those when using S3 alone as storage. 
Here, it may be that the data is not read through the cache for some reason. 
The underlying reason is not further analyzed in this test, in the subsequent 
optimization process, we will gradually improve. Moreover, because the query of 
TPCH is complex and the SQL types are different, compared with the results of 
HDFS, the performance of some SQL is improved, while the performance of some 
SQL is slightly insufficient, but generally speaking, it is very close to the 
results of HDFS as storage.&lt;br /&gt;
-The result of this performance test is a preliminary verification of the 
performance improvement effect of local cache + soft affinity scheduling. On 
the whole, local cache + soft affinity scheduling can achieve significant 
performance improvement for both simple queries and complex queries, but there 
is a certain performance loss in the scenario of high concurrent queries.&lt;br 
/&gt;
-If users use cloud object storage as Kylin 4.0 storage, they can get a good 
performance experience when local cache + soft affinity scheduling is enabled, 
which provides performance guarantee for Kylin 4.0 to use the separation 
architecture of computing and storage in the cloud.&lt;/p&gt;
-
-&lt;h2 id=&quot;code-implementation&quot;&gt;05 Code Implementation&lt;/h2&gt;
-
-&lt;p&gt;Since the current code implementation is still in the basic stage, 
there are still many details to be improved, such as implementing consistent 
hash, how to deal with the existing cache when the number of executors changes, 
so the author has not submitted PR to the community code base. Developers who 
want to preview in advance can view the source code through the following 
link:&lt;/p&gt;
-
-&lt;p&gt;&lt;a 
href=&quot;https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35&quot;&gt;The
 code implementation of local cache and soft affinity 
scheduling&lt;/a&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;related-link&quot;&gt;06 Related Link&lt;/h2&gt;
-
-&lt;p&gt;You can view the performance test result data and specific 
configuration through the link:&lt;br /&gt;
-&lt;a href=&quot;https://github.com/Kyligence/kylin-tpch/issues/9&quot;&gt;The 
benchmark of Kylin4.0 with local cache and soft affinity 
scheduling&lt;/a&gt;&lt;/p&gt;
-</description>
-        <pubDate>Thu, 21 Oct 2021 04:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</guid>
-        
-        
-        <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Kylin4 
云上性能优化:本地缓存和软亲和性调度</title>
-        <description>&lt;h2 id=&quot;section&quot;&gt;01 
背景介绍&lt;/h2&gt;
-&lt;p&gt;日前,Apache Kylin 社区发布了全新架构的 Kylin 
4.0。Kylin 4.0 的架构支持存储和计算分离,这使得 kylin 
用户可以采取更加
灵活、计算资源可以弹性伸缩的云上部署方式来运行 Kylin 
4.0。借助云上的基础设施,用户可以选择使用便宜且可靠
的对象存储来储存 cube 数据,比如 S3 
等。不过在存储与计算分离的架构下,我们需要考虑到,计算节点通过网络从远端存储读取数据仍然是一个代价较大的操作,往å¾
 €ä¼šå¸¦æ¥æ€§èƒ½çš„æŸè€—。&lt;br /&gt;
-为了提高 Kylin 4.0 
在使用云上对象存储作为存储时的查询性能,我们尝试在 
Kylin 4.0 的查询引擎中引入本地缓存(Local 
Cache)机制,在执行查询时,将经常使用的数据缓存在本地磁盘,减小从远程对象存储中拉取数据带来的延迟,实现更快的查询响应;除此之外,为了避å
…åŒæ ·çš„æ•°æ®åœ¨å¤§é‡ spark executor 
上同时缓存浪费磁盘空间,并且计算节点可以更多的从本地缓存读取所需数据,我们引å
…¥äº† 软äº�
 �和性(Soft Affinity 
)的调度策略,所谓软亲和性策略,就是通过某种方法在 
spark executor 和数据文件之间建立对应关系,使得同æ 
·çš„æ•°æ®åœ¨å¤§éƒ¨åˆ†æƒ…况下能够总是在同一个 executor 
上面读取,从而提高缓存的命中率。&lt;/p&gt;
-
-&lt;h2 id=&quot;section-1&quot;&gt;02 实现原理&lt;/h2&gt;
-
-&lt;h4 id=&quot;section-2&quot;&gt;1.本地缓存&lt;/h4&gt;
-&lt;p&gt;在 Kylin 4.0 执行查询时,主要经过以下几个阶段,å…
¶ä¸­ç”¨è™šçº¿æ 
‡æ³¨å‡ºäº†å¯ä»¥ä½¿ç”¨æœ¬åœ°ç¼“存来提升性能的阶段:&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/Local_cache_stage.png&quot; alt=&quot;&quot; 
/&gt;&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;File list cache:在 spark driver 端对 file status 
进行缓存。在执行查询时,spark driver 
需要读取文件列表,获取一些文件信息进行后续的调度执行,这里会将
 file status 信息缓存到本地避å…
é¢‘繁读取远程文件目录。&lt;/li&gt;
-  &lt;li&gt;Data cache:在 spark executor 
端对数据进行缓存。用户可以设置将数据缓存到内
存或是磁盘,若设置为缓存到内存,则需要适当调大 executor 
memory,保证 executor 有足够的内
存可以进行数据缓存;若是缓存到磁盘,需要用户设置数据缓存目录,最好设置为
 SSD 
磁盘目录。除此之外,缓存数据的最大容量、备份数量等均可由用户é
…ç½®è°ƒæ•´ã€‚&lt;/li&gt;
-&lt;/ul&gt;
-
-&lt;p&gt;基于以上设计,在 Kylin 4.0 的查询引擎 sparder 的 driver 
端和 executor 
端分别做不同类型的缓存,基本架构如下:&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/kylin4_local_cache.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;h4 id=&quot;section-3&quot;&gt;2.软亲和性调度&lt;/h4&gt;
-&lt;p&gt;在 executor 端做 data cache 时,如果在所有的 executor 
上都缓存å…
¨éƒ¨çš„æ•°æ®ï¼Œé‚£ä¹ˆç¼“存数据的大小将会非常可观,极大的浪费磁盘空间,同时也容易导致缓存数据被频繁æ¸
…理。为了最大化 spark executor 的缓存命中率,spark driver 
需要将同一文件的 task 在资源条件满足的情
况下尽可能调度到同样的 executor,这æ 
·å¯ä»¥ä¿è¯ç›¸åŒæ–‡ä»¶çš„æ•°æ®èƒ½å¤Ÿç¼“存在特定的某个或者某几个 
executor 上,再次读取时便可以通过缓存读取数æ
 ®ã€‚&lt;br /&gt;
-为此,我们采取根据文件名计算 hash 之后再与 executors num 
取模的结果来计算目标 executor 列表,在多少个 executor 
上面做缓存由用户配置的缓存备份数量决定,一般情
况下,缓存备份数量越大,击中缓存的概率越高。当目标 
executor 均不可达或者没有资源供调度时,调度程序将回退到 
spark 
的随机调度机制上。这种调度方式便称为软亲和性调度策略,它虽然不能保证
 100% 击中缓存,但能够有效提高缓存命ä�
 �­çŽ‡ï¼Œåœ¨å°½é‡ä¸æŸå¤±æ€§èƒ½çš„å‰æä¸‹é¿å… full cache 
浪费大量磁盘空间。&lt;/p&gt;
-
-&lt;h2 id=&quot;section-4&quot;&gt;03 相关配置&lt;/h2&gt;
-&lt;p&gt;根据以上原理,我们在 Kylin 4.0 
中实现了本地缓存+软亲和性调度的基础功能,并分别基于 
ssb 数据集和 tpch 数据集做了查询性能测试。&lt;br /&gt;
-这里列出几个比较重要的配置项供用户了解,实际使用的é…
ç½®å°†åœ¨ç»“尾链接中给出:&lt;br /&gt;
-- 
是否开启软亲和性调度策略:kylin.query.spark-conf.spark.kylin.soft-affinity.enabled&lt;br
 /&gt;
-- 
是否开启本地缓存:kylin.query.spark-conf.spark.hadoop.spark.kylin.local-cache.enabled&lt;br
 /&gt;
-- Data cache 的备份数量,即在多少个 executor 
上对同一数据文件进行缓存:kylin.query.spark-conf.spark.kylin.soft-affinity.replications.num&lt;br
 /&gt;
-- 缓存到内存中还是本地目录,缓存到内存设置为 
BUFF,缓存到本地设置为 
LOCAL:kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.store.type&lt;br
 /&gt;
-- 
最大缓存容量:kylin.query.spark-conf.spark.hadoop.alluxio.user.client.cache.size&lt;/p&gt;
-
-&lt;h2 id=&quot;section-5&quot;&gt;04 性能对比&lt;/h2&gt;
-&lt;p&gt;我们在 AWS EMR 环境下进行了 3 种场景的性能测试,在 
scale factor = 10的情况下,对 ssb 
数据集进行单并发查询测试、tpch 
数据集进行单并发查询以及 4 并发查询测试,实验组和对ç…
§ç»„均配置 s3 
作为存储,在实验组中开启本地缓存和软亲和性调度,对ç…
§ç»„则不开启。除此之外,我们还将实验组结果与相同环境下 
hdfs 
作为存储时的结果进行对比,以便用户可以直观的感受到 
本地缓存+软亲和性调度 对ä
 º‘上部署 Kylin 4.0 
使用对象存储作为存储场景下的优化效果。&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_ssb.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_tpch1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;&lt;img 
src=&quot;/images/blog/local-cache/local_cache_benchmark_result_tpch4.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
-
-&lt;p&gt;从以上结果可以看出:&lt;br /&gt;
-1. 在 ssb 10 数据集单并发场景下,使用 s3 
作为存储时,开启本地缓存和软亲和性调度能够获得3倍左右的性能提升,可以达到与
 hdfs 作为存储时的相同性能甚至还有 5% 左右的提升。&lt;br 
/&gt;
-2. 在 tpch 10 数据集下,使用 s3 作为存储时,无
论是单并发查询还是多并发查询,开启本地缓存和软亲和性调度后,基本在所有查询中都能够获得大å¹
…度的性能提升。&lt;/p&gt;
-
-&lt;p&gt;不过在 tpch 10 数据集的 4 并发测试下的 Q21 
的对比结果中,我们观察到,开启本地缓存和软亲和性调度的结果反而比单独使用
 s3 作为存储时有所下降,这里可能是由于某种原因
导致没有通过缓存读取数据,深层原因
在此次测试中没有进行进一步的分析,在后续的优化过程中我们会逐步改进。由于
 tpch 的查询比较复杂且 SQL 类型各异,与 hdfs 
作为存储时的结果相比,仍然有部分 sql 的性能略æ�
 �‰ä¸è¶³ï¼Œä¸è¿‡æ€»ä½“来说已经与 hdfs 的结果比较接近。&lt;br 
/&gt;
-本次性能测试的结果是一次对 本地缓存+软亲和性调度 
性能提升效果的初步验证,从总体上来看,本地缓存+软亲和性调度
 无
论对于简单查询还是复杂查询都能够获得明显的性能提升,但是在高并发查询场景下存在一定的性能损失。&lt;br
 /&gt;
-如果用户使用云上对象存储作为 Kylin 4.0 的存储,在开启 
本地缓存+ 软亲和性调度的情
况下,是可以获得很好的性能体验的,这为 Kylin 4.0 
在云上使用计算和存储分离架构提供了性能保障。&lt;/p&gt;
-
-&lt;h2 id=&quot;section-6&quot;&gt;05 代码实现&lt;/h2&gt;
-&lt;p&gt;由于目前的代ç 
å®žçŽ°è¿˜å¤„äºŽæ¯”è¾ƒåŸºç¡€çš„é˜¶æ®µï¼Œè¿˜æœ‰è®¸å¤šç»†èŠ‚éœ€è¦å®Œå–„ï¼Œæ¯”å¦‚å®žçŽ°ä¸€è‡´æ€§å“ˆå¸Œã€å½“
 executor 数量发生变化时如何处理已有 cache 等,所以作者
还未向社区代码库提交 PR,想要提前预览的开发者
可以通过下面的链接查看源码:&lt;br /&gt;
-&lt;a 
href=&quot;https://github.com/zzcclp/kylin/commit/4e75b7fa4059dd2eaed24061fda7797fecaf2e35&quot;&gt;Kylin4.0
 本地缓存+软亲和性调度代码实现&lt;/a&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;section-7&quot;&gt;06 相关链接&lt;/h2&gt;
-&lt;p&gt;通过链接可查阅性能测试结果数据和具体配置:&lt;br 
/&gt;
-&lt;a 
href=&quot;https://github.com/Kyligence/kylin-tpch/issues/9&quot;&gt;Kylin4.0 
本地缓存+软亲和性调度测试&lt;/a&gt;&lt;/p&gt;
-</description>
-        <pubDate>Thu, 21 Oct 2021 04:00:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/cn_blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/cn_blog/2021/10/21/Local-Cache-and-Soft-Affinity-Scheduling/</guid>
-        
-        
-        <category>cn_blog</category>
         
       </item>
     


Reply via email to