[kylin] 01/02: Add new blog: kylin on cloud EN version

xxyu Mon, 13 Jun 2022 18:20:29 -0700

This is an automated email from the ASF dual-hosted git repository.

xxyu pushed a commit to branch document
in repository https://gitbox.apache.org/repos/asf/kylin.git


commit b640f0a29b03a35f1b79faa11ecc750e7c0386c1
Author: Mukvin <boyboys...@163.com>
AuthorDate: Mon Jun 13 16:59:18 2022 +0800

    Add new blog: kylin on cloud EN version
---
 .../blog/2022-04-20-kylin4-on-cloud-part1.md       | 319 +++++++++++++++++++++
 .../blog/2022-04-20-kylin4-on-cloud-part2.md       | 265 +++++++++++++++++
 2 files changed, 584 insertions(+)

diff --git a/website/_posts/blog/2022-04-20-kylin4-on-cloud-part1.md 
b/website/_posts/blog/2022-04-20-kylin4-on-cloud-part1.md
new file mode 100644
index 0000000000..62e17ccad2
--- /dev/null
+++ b/website/_posts/blog/2022-04-20-kylin4-on-cloud-part1.md
@@ -0,0 +1,319 @@
+---
+layout: post-blog
+title: Kylin on Cloud — Build A Data Analysis Platform on the Cloud in Two 
Hours Part 1
+date: 2022-04-20 11:00:00
+author: Yaqian Zhang
+categories: blog
+---
+
+## Video Tutorials
+
+[Kylin on Cloud — Build A Data Analysis Platform on the Cloud in Two Hours 
Part 1](https://youtu.be/5kKXEMjO1Sc)
+
+## Background
+
+Apache Kylin is a multidimensional database based on pre-computation and 
multidimensional models. It also supports standard SQL query interface. In 
Kylin, users can define table relationships by creating Models, define 
dimensions and measures by creating Cubes, and run data aggregation with Cube 
building. The pre-computed data will be saved to answer user queries and users 
also can perform further aggregation on the pre-computed data, significantly 
improving the query performance.
+
+With the release of Kylin 4.0, Kylin can now be deployed without a Hadoop 
environment. To make it easier for users to deploy Kylin on the cloud, Kylin 
community recently developed a cloud deployment tool that allows users to 
obtain a complete Kylin cluster by executing just one line of command, 
delivering a fast and efficient analysis experience for the users. Moreover, in 
January 2022, the Kylin community released MDX for Kylin to enhance the 
semantic capability of Kylin as a multidimen [...]
+
+With all these innovations, users can easily and quickly deploy Kylin clusters 
on the cloud, create multi-dimensional models, and enjoy the short query 
latency brought by pre-computation; what's more, users can also use MDX for 
Kylin to define and manage business metrics, leveraging both the advantages of 
data warehouse and business semantics.
+
+With Kylin + MDX for Kylin, users can directly work with BI tools for 
multidimensional data analysis, or use it as the basis to build complex 
applications such as metrics platforms. Compared with the solution of building 
a metrics platform directly with computing engines such as Spark and Hive that 
perform Join and aggregated query computation at runtime, Kylin, with our 
multidimensional modeling, pre-computation technology, and semantics layer 
capabilities empowered by MDX for Kylin, pr [...]
+
+This tutorial will start from a data engineer's perspective to show how to 
build a Kylin on Cloud data analysis platform, which will deliver a 
high-performance query experience for hundreds of millions of rows of data with 
a lower TCO, the capability to manage business metrics through MDX for Kylin, 
and direct connection to BI tools for quick reports generating.
+
+Each step of this tutorial is explained in detail with illustrations and 
checkpoints to help newcomers. All you need to start is to an AWS account and 2 
hours. Note: The cloud cost to finish this tutorial is around 15$.
+
+![](/images/blog/kylin4_on_cloud/0_deploy_kylin.png)
+
+## Business scenario
+
+Since the beginning of 2020, COVID-19 has spread rapidly all over the world, 
which has greatly changed people’s daily life, especially their travel habits. 
This tutorial wants to learn the impact of the pandemic on the New York taxi 
industry based on the pandemic data and New York taxi travel data since 2018 
and indicators such as positive cases, fatality rate, taxi orders, and average 
travel mileage will be analyzed. We hope this analysis could provide some 
insights for future decision-making.
+
+### Business issues
+
+- The severity of the pandemic in different countries and regions
+- Travel metrics of different blocks in New York City, such as order number, 
travel mileage, etc.
+- Does the pandemic have a significant impact on taxi orders?
+- Travel habits change after the pandemic (long-distance vs. short-distance 
travels)
+- Is the severity of the pandemic strongly related to taxi travel?
+
+### Dataset
+
+#### COVID-19 Dataset
+
+The COVID-19 dataset includes a fact table `covid_19_activity` and a dimension 
table `lookup_calendar`.
+
+`covid_19_activity` contains the number of confirmed cases and deaths reported 
each day in different regions around the world. `lookup_calendar` is a date 
dimension table that holds time-extended information, such as the beginning of 
the year, and the beginning of the month for each date. `covid_19_activity` and 
`lookup_calendar` are associated by date.
+COVID-19 数据集相关信息如下:
+
+| 
------------------------|------------------------------------------------------------------------------------------------------------------|
+| Data size               |  235 MB                                            
                                                               |
+| Fact table row count    |  2,753,688                                         
                                                               |
+| Data range              |  2020-01-21~2022-03-07                             
                                                               |
+| Download address provided by the dataset provider       |  
https://data.world/covid-19-data-resource-hub/covid-19-case-counts/workspace/file?filename=COVID-19+Activity.csv
 |
+| S3 directory of the dataset           |  
s3://public.kyligence.io/kylin/kylin_demo/data/covid19_data/                    
                                 |
+
+#### NYC taxi order dataset
+
+The NYC taxi order dataset consists of a fact table `taxi_trip_records_view`, 
and two dimension tables, `newyork_zone` and `lookup_calendar`.
+
+Among them, each record in `taxi_trip_records_view` corresponds to one taxi 
trip and contains information like the pick-up ID, drop-off ID, trip duration, 
order amount, travel mileage, etc. `newyork_zone` records the administrative 
district corresponding to the location ID. `taxi_trip_records_view` are 
connected with `newyork_zone` through columns PULocationID and DOLocationID to 
get the information about pick-up and drop-off blocks. `lookup_calendar` is the 
same dimension table as in th [...]
+
+NYC taxi order dataset information：
+
+| 
------------------------|----------------------------------------------------------------------|
+| Data size                  |  19 G                                           
                     |
+| Fact table row count             |  226,849,274                              
                           |
+| Data range                  |  2018-01-01~2021-07-31                         
                      |
+| Download address provided by the dataset provider        |  
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page       |
+| S3 directory of the dataset            |  
s3://public.kyligence.io/kylin/kylin_demo/data/trip_data_2018-2021/ |
+
+
+
+#### ER Diagram
+
+The ER diagram of the COVID-19 dataset and NYC taxi order dataset is as 
follows:
+
+![](/images/blog/kylin4_on_cloud/1_table_ER.png)
+
+### Metrics design
+
+Based on what we try to solve with this model, we designed the following 
atomic metrics and business metrics:
+
+###### 1. Atomic metrics
+
+Atomic metrics refer to measures created in Kylin Cube, which are relatively 
simple, as they only run aggregated calculations on one column.
+
+- Covid19 case count: `sum(covid_19_activity.people_positive_cases_count)`
+- Covid19 fatality: `sum(covid_19_activity. people_death_count)`
+- Covid19 new positive case count: `sum(covid_19_activity. 
people_positive_new_cases_count)`
+- Covid19 new death count: `sum(covid_19_activity. people_death_new_count)`
+- Taxi trip mileage: `sum(taxi_trip_records_view. trip_distance)`
+- Taxi order amount: `sum(taxi_trip_records_view. total_amount)`
+- Taxi trip count: `count()`
+- Taxi trip duration: `sum(taxi_trip_records_view.trip_time_hour)`
+
+###### 2. Business metrics
+
+Business metrics are various compound operations based on atomic metrics that 
have specific business meanings.
+
+- MTD, YTD of each atomic metric
+- MOM, YOY of each atomic metric
+- Covid19 fatality rate: death count/positive case count
+- Average taxi trip speed: taxi trip distance/taxi trip duration
+- Average taxi trip mileage: taxi trip distance/taxi trip count
+
+## Operation Overview
+
+The diagram below is the main steps to build a cloud data analysis platform 
with Apache Kylin and how to perform data analysis:
+
+![](/images/blog/kylin4_on_cloud/2_step_overview.jpg)
+
+## Cluster architecture
+
+Here is the architecture of the Kylin cluster deployed by the cloud deployment 
tool:
+
+![](/images/blog/kylin4_on_cloud/3_kylin_cluster.jpg)
+
+## Kylin on Cloud deployment
+
+### Prerequisites
+
+- GitHub Desktop: for downloading the deployment tool;
+- Python 3.6.6: for running the deployment tool
+
+### AWS permission check and initialization
+
+Log in to AWS with your account to check the permission status and then create 
the Access Key, IAM Role, Key Pair, and S3 working directory according to the 
document 
[Prerequisites](https://github.com/apache/kylin/blob/kylin4_on_cloud/readme/prerequisites.md).
 Subsequent AWS operations will be performed with this account.
+
+### Configure the deployment tool
+
+1. Execute the following command to clone the code for the Kylin on AWS 
deployment tool.
+
+    ```shell
+    git clone -b kylin4_on_cloud --single-branch 
https://github.com/apache/kylin.git && cd kylin
+    ```
+
+2. Initialize the virtual environment for your Python on local machine.
+
+    Run the command below to check the Python version. Note: Python 3.6.6 or 
above is needed:
+    
+    ```shell
+    python --version
+    ```
+    
+    Initialize the virtual environment for Python and install dependencies:
+    
+    ```shell
+    bin/init.sh
+    source venv/bin/activate
+    ```
+
+3. Modify the configuration file `kylin_configs.yaml`
+
+Open kylin_configs.yaml file, and replace the configuration items with the 
actual values:
+
+- `AWS_REGION`: Region for EC2 instance, the default value is `cn-northwest-1`
+- `${IAM_ROLE_NAME}`: IAM Role just created, e.g. `kylin_deploy_role`
+- `${S3_URI}`: S3 working directory for deploying Kylin, e.g. 
s3://kylindemo/kylin_demo_dir/
+- `${KEY_PAIR}`: Key pairs just created, e.g. kylin_deploy_key
+- `${Cidr Ip}`: IP address range that is allowed to access EC2 instances, e.g. 
10.1.0.0/32, usually set as your external IP address to ensure that only you 
can access these EC2 instances
+
+As Kylin adopts a read-write separation architecture to separate build and 
query resources, in the following steps, we will first start a build cluster to 
connect to Glue to create tables, load data sources, and submit build jobs for 
pre-computation, then delete the build cluster but save the metadata. Then we 
will start a query cluster with MDX for Kylin to create business metrics, 
connect to BI tools for queries, and perform data analysis. Kylin on AWS 
cluster uses RDS to store metadat [...]
+
+### Kylin build cluster
+
+#### Start Kylin build cluster
+
+1. Start the build cluster with the following command. The whole process may 
take 15-30 minutes depending on your network conditions.
+
+    ```shell
+    python deploy.py --type deploy --mode job
+    ```
+
+2. You may check the terminal to see if the build cluster is successfully 
deployed:
+
+![](/images/blog/kylin4_on_cloud/4_deploy_cluster_successfully.png)
+
+#### Check AWS Service
+
+1. Go to CloudFormation on AWS console, where you can see 7 stacks are created 
by the Kylin deployment tool:
+
+    ![](/images/blog/kylin4_on_cloud/5_check_aws_stacks.png)
+
+2. Users can view the details of EC2 nodes through the AWS console or use the 
command below to check the names, private IPs, and public IPs of all EC2 nodes.
+
+```shell
+python deploy.py --type list
+```
+
+![](/images/blog/kylin4_on_cloud/6_list_cluster_node.png)
+
+#### Spark-SQL query response time
+
+Let's first check the query response time in Spark-SQL environment as a 
comparison.
+
+1. First, log in to the EC2 where Kylin is deployed with the public IP of the 
Kylin node, switch to root user, and execute `~/.bash_profile` to implement the 
environment variables set beforehand.
+
+    ```shell
+    ssh -i "${KEY_PAIR}" ec2-user@${kylin_node_public_ip}
+    sudo su
+    source ~/.bash_profile
+    ```
+
+2. Go to `$SPARK_HOME` to modify configuration file 
`conf/spark-defaults.conf`, change spark_master_node_private_ip to a private IP 
of the Spark master node:
+
+    ```shell
+    cd $SPARK_HOME
+    vim conf/spark-defaults.conf
+    
+     ## Replace spark_master_node_private_ip with the private IP of the real 
Spark master node
+     spark.master spark://spark_master_node_private_ip:7077
+    ```
+    
+    In `spark-defaults.conf`, the resource allocation for driver and executor 
is the same as that for Kylin query cluster.
+
+3. Create table in Spark-SQL
+
+    All data from the test dataset is stored in S3 bucket of `cn-north-1` and 
`us-east-1`. If your S3 bucket is in `cn-north-1` or `us-east-1`, you can 
directly run SQL to create the table; Or, you will need to execute the 
following script to copy the data to the S3 working directory set up in 
`kylin_configs.yaml`, and modify your SQL for creating the table:
+    
+    ```shell
+    ## AWS CN user
+    aws s3 sync s3://public.kyligence.io/kylin/kylin_demo/data/ ${S3_DATA_DIR} 
--region cn-north-1
+    
+    ## AWS Global user
+    aws s3 sync s3://public.kyligence.io/kylin/kylin_demo/data/ ${S3_DATA_DIR} 
--region us-east-1
+    
+    ## Modify create table SQL
+    sed -i 
"s#s3://public.kyligence.io/kylin/kylin_demo/data/#${S3_DATA_DIR}#g" 
/home/ec2-user/kylin_demo/create_kylin_demo_table.sql
+    ```
+
+    Execute SQL for creating table:
+    
+    ```shell
+    bin/spark-sql -f /home/ec2-user/kylin_demo/create_kylin_demo_table.sql
+    ```
+
+4. Execute query in Spark-SQL
+
+    Go to Spark-SQL:
+    
+    ```shell
+    bin/spark-sql
+    ```
+    
+    Run query in Spark-SQL:
+    
+    ```sql
+    use kylin_demo;
+    select TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH, count(*), 
sum(TAXI_TRIP_RECORDS_VIEW.TRIP_TIME_HOUR), 
sum(TAXI_TRIP_RECORDS_VIEW.TOTAL_AMOUNT)
+    from TAXI_TRIP_RECORDS_VIEW
+    left join NEWYORK_ZONE
+    on TAXI_TRIP_RECORDS_VIEW.PULOCATIONID = NEWYORK_ZONE.LOCATIONID
+    group by TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH;
+    ```
+
+    We can see that with the same configuration as Kylin query cluster, direct 
query using Spark-SQL takes over 100s:
+    
+    ![](/images/blog/kylin4_on_cloud/7_query_in_spark_sql.png)
+
+5. After the query is successfully executed, we should exit the Spark-SQL 
before proceeding to the following steps to save resources.
+
+#### Import Kylin metadata
+
+1. Go to `$KYLIN_HOME`
+
+    ```shell
+    cd $KYLIN_HOME
+    ```
+
+2. Import metadata
+
+    ```shell
+    bin/metastore.sh restore /home/ec2-user/meta_backups/
+    ```
+
+3. Reload metadata
+
+Type `http://${kylin_node_public_ip}:7070/kylin` (relace the IP with the 
public IP of the EC2 node) in your browser to log in to Kylin web UI, and log 
in with the default username and password ADMIN/KYLIN:
+
+![](/images/blog/kylin4_on_cloud/8_kylin_web_ui.png)
+
+Reload Kylin metadata by clicking System - > Configuration - > Reload Metadata:
+
+![](/images/blog/kylin4_on_cloud/9_reload_kylin_metadata.png)
+
+If you'd like to learn how to manually create the Model and Cube included in 
Kylin metadata, please refer to [Create model and cube in 
Kylin](https://cwiki.apache.org/confluence/display/KYLIN/Create+Model+and+Cube+in+Kylin).
+
+#### Run build
+
+Submit the Cube build job. Since no partition column is set in the model, we 
will directly perform a full build for the two cubes:
+
+![](/images/blog/kylin4_on_cloud/10_full_build_cube.png.png)
+
+![](/images/blog/kylin4_on_cloud/11_kylin_job_complete.png)
+
+#### Destroy build cluster
+
+After the building Job is completed, execute the cluster delete command to 
close the build cluster. By default, the RDS stack, monitor stack, and VPC 
stack will be kept.
+
+```shell
+python deploy.py --type destroy
+```
+
+Cluster is successfully closed:
+
+![](/images/blog/kylin4_on_cloud/12_destroy_job_cluster.png)
+
+#### Check AWS resource
+
+After the cluster is successfully deleted, you can go to the `CloudFormation` 
page in AWS console to confirm whether there are remaining resources. Since the 
metadata RDS, monitor nodes, and VPC nodes are kept by default, you will see 
only the following three stacks on the page.
+
+![](/images/blog/kylin4_on_cloud/13_check_aws_stacks.png)
+
+The resources in the three stacks will still be used when we start the query 
cluster, to ensure that the query cluster and the build cluster use the same 
set of metadata.
+
+#### Intro to next part
+
+That’s all for the first part of Kylin on Cloud —— Build A Data Analysis 
Platform on the Cloud in Two Hours, please see part 2 here: [Kylin on Cloud —— 
Quickly Build Cloud Data Analysis Service Platform within Two 
Hours](../kylin4-on-cloud-part2/) (Part 2)
diff --git a/website/_posts/blog/2022-04-20-kylin4-on-cloud-part2.md 
b/website/_posts/blog/2022-04-20-kylin4-on-cloud-part2.md
new file mode 100644
index 0000000000..0f27341c08
--- /dev/null
+++ b/website/_posts/blog/2022-04-20-kylin4-on-cloud-part2.md
@@ -0,0 +1,265 @@
+---
+layout: post-blog
+title: Kylin on Cloud — Build A Data Analysis Platform on the Cloud in Two 
Hours Part 2
+date: 2022-04-20 11:00:00
+author: Yaqian Zhang
+categories: blog
+---
+
+
+This is the second part of the blog series, for part 1, see ：[Kylin on Cloud — 
Build A Data Analysis Platform on the Cloud in Two Hours Part 
1](../kylin4-on-cloud-part1/)
+
+### Video Tutorials
+
+[Kylin on Cloud — Build A Data Analysis Platform on the Cloud in Two Hours 
Part 2](https://youtu.be/LPHxqZ-au4w)
+
+
+### Kylin query cluster
+
+#### Start Kylin query cluster
+
+1. Besides the `kylin_configs.yaml` file for starting the build cluster, we 
will also enable MDX with the command below:
+
+   ```
+   ENABLE_MDX: &ENABLE_MDX 'true'
+   ```
+
+2. Then execute the deploy command to start the cluster:
+
+   ```
+   python deploy.py --type deploy --mode query
+   ```
+
+#### Query with Kylin
+
+1. After the query cluster is successfully started, first execute `python 
deploy.py --type list` to get all node information, and then type in your 
browser `http://${kylin_node_public_ip}:7070/kylin` to log in to Kylin web UI:
+
+   ![](/images/blog/kylin4_on_cloud/14_kylin_web_ui.png)
+
+2. Execute the same SQL on Insight page as what we have done with spark-SQL:
+
+   ```
+   select TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH, count(*), 
sum(TAXI_TRIP_RECORDS_VIEW.TRIP_TIME_HOUR), 
sum(TAXI_TRIP_RECORDS_VIEW.TOTAL_AMOUNT)
+   from TAXI_TRIP_RECORDS_VIEW
+   left join NEWYORK_ZONE
+   on TAXI_TRIP_RECORDS_VIEW.PULOCATIONID = NEWYORK_ZONE.LOCATIONID
+   group by TAXI_TRIP_RECORDS_VIEW.PICKUP_DATE, NEWYORK_ZONE.BOROUGH;
+   ```
+
+   ![](/images/blog/kylin4_on_cloud/15_query_in_kylin.png)
+
+As we can see, when the query hits the cube, that is, the query is directly 
answered by the pre-computed data, the query result is returned in about 4s, a 
great reduction from the over 100s of query latency.
+
+### Pre-computation reduces query cost
+
+In this test, we used the New York taxi order data with fact table containing 
200+ million entries of data. As we can see from the result, Kylin has 
significantly improved the query efficiency in this big data analysis scenario 
against hundreds of millions of data entries. Moreover, the build data could be 
reused to answer thousands of subsequent queries, thereby reducing query cost.
+
+### Configure semantic layer
+
+#### Import Dataset into MDX for Kylin
+
+With `MDX for Kylin`, you can create `Dataset` based on the Kylin Cube, define 
Cube relations, and create business metrics. To make it easy for beginners, you 
can directly download Dataset file from S3 and import it into `MDX for Kylin`:
+
+1. Download the dataset to your local machine from S3.
+
+   ```
+   wget 
https://s3.cn-north-1.amazonaws.com.cn/public.kyligence.io/kylin/kylin_demo/covid_trip_project_covid_trip_dataset.json
+   ```
+
+2. Access `MDX for Kylin` web UI
+
+   Enter `http://${kylin_node_public_ip}:7080` in your browser to access `MDX 
for Kylin` web UI and log in with the default username and password 
`ADMIN/KYLIN`:
+   
+   ![](/images/blog/kylin4_on_cloud/16_mdx_web_ui.png)
+
+3. Confirm Kylin connection
+
+   `MDX for Kylin` is already configured with the information of the Kylin 
node to be connected. You only need to type in the username and password 
(`ADMIN/KYLIN`) for the Kylin node when logging in for the first time.
+   
+   ![](/images/blog/kylin4_on_cloud/17_connect_to_kylin.png)
+   
+   ![](/images/blog/kylin4_on_cloud/18_exit_management.png)
+
+4. Import Dataset
+
+   After Kylin is successfully connected, click the icon in the upper right 
corner to exit the management page:
+   
+   ![](/images/blog/kylin4_on_cloud/19_kylin_running.png)
+   
+   Switch to the `covid_trip_project` project and click `Import Dataset` on 
`Dataset` page:
+   
+   ![](/images/blog/kylin4_on_cloud/20_import_dataset.png)
+
+   Select and import the `covid_trip_project_covid_trip_dataset.json` file we 
just download from S3.
+
+   `covid_trip_dataset` contains specific dimensions and measures for each 
atomic metric, such as YTD, MTD, annual growth, monthly growth, time hierarchy, 
and regional hierarchy; as well as various business metrics including COVID-19 
death rate, the average speed of taxi trips, etc. For more information on how 
to manually create a dataset, see Create dataset in `MDX for Kylin` or [MDX for 
Kylin User Manual](https://kyligence.github.io/mdx-kylin/).
+
+## Data analysis with BI and Excel
+
+### Data analysis using Tableau
+
+Let's take Tableau installed on a local Windows machine as an example to 
connect to MDX for Kylin for data analysis.
+
+1. Select Tableau's built-in `Microsoft Analysis Service` to connect to `MDX 
for Kylin`. (Note: Please install the [`Microsoft Analysis Services` 
driver](https://www.tableau.com/support/drivers?_ga=2.104833284.564621013.1647953885-1839825424.1608198275)
 in advance, which can be downloaded from Tableau).
+
+   ![](/images/blog/kylin4_on_cloud/21_tableau_connect.png)
+
+2. In the pop-up settings page, enter the `MDX for Kylin` server address, the 
username and password. The server address is 
`http://${kylin_node_public_ip}:7080/mdx/xmla/covid_trip_project`:
+
+   ![](/images/blog/kylin4_on_cloud/22_tableau_server.png)
+
+3. Select covid_trip_dataset as the dataset:
+
+   ![](/images/blog/kylin4_on_cloud/23_tableau_dataset.png)
+
+4. Then we can run data analysis with the worksheet. Since we have defined the 
business metrics with `MDX for Kylin`, when we want to generate a business 
report with Tableau, we can directly drag the pre-defined business metrics into 
the worksheet to create a report.
+
+5. Firstly, we will analyze the pandemic data and draw the national-level 
pandemic map with the number of confirmed cases and mortality rate. We only 
need to drag and drop `COUNTRY_SHORT_NAME` under `REGION_HIERARCHY` to the 
Columns field and drop and drop `SUM_NEW_POSITIVE_CASES` and `CFR_COVID19` 
(fatality rate) under Measures to the Rows field, and then select to display 
the data results as a map:
+
+   ![](/images/blog/kylin4_on_cloud/24_tableau_covid19_map.png)
+   
+   The size of the symbols represents the level of COVID-19 death count and 
the shade of the color represents the level of the mortality rate. According to 
the pandemic map, the United States and India have more confirmed cases, but 
the mortality rates in the two countries are not significantly different from 
the other countries. However, countries with much fewer confirmed cases, such 
as Peru, Vanuatu, and Mexico, have persistently high death rates. You can 
continue to explore the reaso [...]
+   
+   Since we have set up a regional hierarchy, we can break down the 
country-level situation to the provincial/state level to see the pandemic 
situation in different regions of each country:
+   
+   ![](/images/blog/kylin4_on_cloud/25_tableau_province.png)
+   
+   Zoom in on the COVID map to see the status in each state of the United 
States:
+   
+   ![](/images/blog/kylin4_on_cloud/26_tableau_us_covid19.png)
+   
+   It can be concluded that there is no significant difference in the 
mortality rate in each state of the United States, which is around 0.01. In 
terms of the number of confirmed cases, it is significantly higher in 
California, Texas, Florida, and New York City. These regions are economically 
developed and have a large population. This might be the reason behind the 
higher number of confirmed COVID-19 cases. In the following part, we will 
combine the pandemic data with the New York taxi  [...]
+
+6. For the New York taxi order dataset, we want to compare the order numbers 
and travel speed in different boroughs.
+
+Drag and drop `BOROUGH` under `PICKUP_NEWYORK_ZONE` to Columns, and drag and 
drop `ORDER_COUNT` and `trip_mean_speed` under Measures to Rows, and display 
the results as a map. The color shade represents the average speed and the size 
of the symbol represents the order number. We can see that taxi orders 
departing from Manhattan are higher than all the other boroughs combined, but 
the average speed is the lowest. Queens ranks second in terms of order number 
while Staten Island has the low [...]
+
+![](/images/blog/kylin4_on_cloud/27_tableau_taxi_1.png)
+
+Then we will replace the field `BOROUGH` from `PICKUP_NEWYORK_ZONE` with 
`BOROUGH` from `DROPOFF_NEWYORK_ZONE`, to analyze the number of taxi orders and 
average speed by drop-off ID:
+
+![](/images/blog/kylin4_on_cloud/27_tableau_taxi_2.png)
+
+The pick-up and drop-off data of Brooklyn, Queens, and Bronx differ greatly, 
for example, the taxi orders to Brooklyn or Bronx are much higher than those 
departing from there, while there are much fewer taxi trips to Queens than 
those starting from it.
+
+- Travel habits change after the pandemic (long-distance vs. short-distance 
travels)
+
+To analyze the average trip mileage we can get the residents’ travel habit 
changes, drag and drop dimension `MONTH_START` to Rows, and drag and drop the 
metric `trip_mean_distance` to Columns:
+
+![](/images/blog/kylin4_on_cloud/28_tableau_taxi_3.png)
+
+Based on the histogram we can see that there have been significant changes in 
people’s travel behavior before and after the outbreak of COVID-19, as the 
average trip mileage has increased significantly since March 2020 and in some 
months is even several times higher, and the trip mileage of each month 
fluctuated greatly. We can combine these data with the pandemic data in the 
month dimension, so we drag and drop `SUM_NEW_POSITIVE_CASES` and 
`MTD_ORDER_COUNT` to Rows and add `PROVINCE_STA [...]
+
+![](/images/blog/kylin4_on_cloud/29_tableau_taxi_4.png)
+
+It is interesting to see that the number of taxi orders decreased sharply at 
the beginning of the outbreak while the average trip mileage increased, 
indicating people have cut unnecessary short-distance travels or switched to a 
safer means of transportation. By comparing the data curves, we can see that 
the severity of the pandemic and people’s travel patterns are highly related, 
taxi orders drop and average trip mileage increases when the pandemic worsens, 
while when the situation impro [...]
+
+### Data analysis via Excel
+
+With `MDX for Kylin`, we can also use Kylin for big data analysis with Excel. 
In this test, we will use Excel installed on a local Windows machine to connect 
MDX for Kylin.
+
+1. Open Excel, select `Data` -> `Get Data` -> `From Database` -> `From 
Analysis Services`:
+   
+   ![](/images/blog/kylin4_on_cloud/30_excel_connect.png)
+
+2. In `Data Connection Wizard`, enter the connection information as the server 
name:`http://${kylin_node_public_ip}:7080/mdx/xmla/covid_trip_project`:
+
+   ![](/images/blog/kylin4_on_cloud/31_excel_server.png)
+   
+   ![](/images/blog/kylin4_on_cloud/32_tableau_dataset.png)
+
+3. Then create a PivotTable for this data connection. We can see the data 
listed here is the same as that when we are using Tableau. So no matter whether 
analysts are using Tableau or Excel, they are working on identical sets of data 
models, dimensions, and business metrics, thereby realizing unified semantics.
+
+4. We have just created a pandemic map and run a trend analysis using 
`covid19` and `newyork_trip_data` with Tableau. In Excel, we can check more 
details for the same datasets and data scenarios.
+
+- For COVID-19 related data, we add `REGION_HIERARCHY` and pre-defined 
`SUM_NEW_POSITIVE_CASES` and mortality rate `CFR_COVID19` to the PivotTable:
+
+![](/images/blog/kylin4_on_cloud/33_tableau_covid19_1.png)
+
+The highest level of the regional hierarchy is `CONTINENT_NAME`, which 
includes the number of confirmed cases and mortality rate in each continent. We 
can see that Europe has the highest number of confirmed cases while Africa has 
the highest mortality rate. In this PivotTable, we can easily drill down to 
lower regional levels to check more fine-grained data, such as data from 
different Asian countries, and sort them in descending order according to the 
number of confirmed cases:
+
+![](/images/blog/kylin4_on_cloud/34_excel_covid20_2.png)
+
+The data shows that India, Turkey, and Iran are the countries with the highest 
number of confirmed cases.
+
+- Regarding the problem, does the pandemic have a significant impact on taxi 
orders, we first look at the YTD and growth rate of taxi orders from the year 
dimension by creating a PivotTable with `TIME_HIERARCHY`, `YOY_ORDER_COUNT`, 
and `YTD_ORDER_COUNT` as the dimension for time hierarchy:
+
+![](/images/blog/kylin4_on_cloud/35_excel_taxi_1.png)
+
+It can be seen that since the outbreak of the pandemic in 2020, there is a 
sharp decrease in taxi orders. The growth rate in 2020 is -0.7079, that is, a 
reduction of 70% in taxi orders. The growth rate in 2021 is still negative, but 
the decrease is not so obvious compared to 2020 when the pandemic just started.
+
+Click to expand the time hierarchy to view the data at quarter, month, and 
even day levels. By selecting `MOM_ORDER_COUNT` and `ORDER_COUNT`, we can check 
the monthly order growth rate and order numbers in different time hierarchies:
+
+![](/images/blog/kylin4_on_cloud/36_excel_taxi_2.png)
+
+The order growth rate in March 2020 was -0.52, which is already a significant 
fall. The rate dropped even further to -0.92 in April, that is, a 90% reduction 
in orders. Then the decreasing rate becomes less obvious.  But taxi orders were 
still much lower than before the outbreak.
+
+### Use API to integrate Kylin with data analysis platform
+
+In addition to mainstream BI tools such as Excel and Tableau, many companies 
also like to develop their in-house data analysis platforms. For such 
self-developed data analysis platforms, users can still use Kylin + MDX for 
Kylin as the base for the analysis platform by calling API to ensure a unified 
data definition. In the following part, we will show you how to send a query to 
MDX for Kylin through Olap4j, the Java library similar to JDBC driver that can 
access any OLAP service.
+
+We also provide a simple demo for our users, you may click [mdx query 
demo](https://github.com/apache/kylin/tree/mdx-query-demo) to download the 
source code.
+
+1. Download jar package for the demo:
+
+   ```
+   wget 
https://s3.cn-north-1.amazonaws.com.cn/public.kyligence.io/kylin/kylin_demo/mdx_query_demo.tgz
+   tar -xvf mdx_query_demo.tgz
+   cd mdx_query_demo
+   ```
+
+2. Run demo
+
+Make sure Java 8 is installed before running the demo:
+
+![](/images/blog/kylin4_on_cloud/37_jdk_8.png)
+
+Two parameters are needed to run the demo: the IP of the MDX node and the MDX 
query to be run. The default port is 7080. The MDX node IP here is the public 
IP of the Kylin node.
+
+```
+java -cp 
olap4j-xmla-1.2.0.jar:olap4j-1.2.0.jar:xercesImpl-2.9.1.jar:mdx-query-demo-0.0.1.jar
 io.kyligence.mdxquerydemo.MdxQueryDemoApplication "${kylin_node_public_ip}" 
"${mdx_query}"
+```
+
+Or you could just enter the IP of the MDX node, the system will automatically 
run the following MDX statement to count the order number and average trip 
mileage of each borough according to the pickup ID:
+
+```
+SELECT
+{[Measures].[ORDER_COUNT],
+[Measures].[trip_mean_distance]}
+DIMENSION PROPERTIES [MEMBER_UNIQUE_NAME],[MEMBER_ORDINAL],[MEMBER_CAPTION] ON 
COLUMNS,
+NON EMPTY [PICKUP_NEWYORK_ZONE].[BOROUGH].[BOROUGH].AllMembers
+DIMENSION PROPERTIES [MEMBER_UNIQUE_NAME],[MEMBER_ORDINAL],[MEMBER_CAPTION] ON 
ROWS
+FROM [covid_trip_dataset]
+```
+
+We will also use the default query in this tutorial. After the execution is 
completed, we can get the query result in the command line:
+
+![](/images/blog/kylin4_on_cloud/38_demo_result.png)
+
+As you can see, we have successfully obtained the data needed. The result 
shows that the largest number of taxi orders are from Manhattan, with an 
average order distance of only about 2.4 miles, which is reasonable if we 
consider the area and dense population in Manhattan; while the average distance 
of orders departing from Bronx is 33 miles, much higher than any other 
boroughs, probably due to Bronx's remote location.
+
+As with Tableau and Excel, the MDX statement here can directly use the metrics 
defined in Kylin and MDX for Kylin. Users can do further analysis of the data 
with their own data analysis platform.
+
+### Unified data definition
+
+We have demonstrated 3 ways to work with Kylin + MDX for Kylin, from which we 
can see that with the help of Kylin multi-dimensional database and MDX for 
Kylin semantic layer, no matter which data analytic system you are using, you 
can always use the same data model and business metrics and enjoy the 
advantages brought by unified semantics.
+
+## Delete clusters
+
+### Delete query cluster
+
+After the analysis, we can execute the cluster destruction command to delete 
the query cluster. If you also want to delete metadata database RDS, monitor 
node and VPC of Kylin and MDX for Kylin, you can execute the following cluster 
destroy command:
+
+```
+python deploy.py --type destroy-all
+```
+
+### Check AWS resources
+
+After all cluster resources are deleted, there should be no Kylin deployment 
tool-related Stack on `CloudFormation`. If you also want to delete the 
deployment-related files and data from S3, you can manually delete the 
following folders under the S3 working directory:
+
+![](/images/blog/kylin4_on_cloud/39_check_s3_demo.png)
+
+## Summary
+
+You only need an AWS account to follow the steps in this tutorial to explore 
our Kylin deployment tool on the Cloud. Kylin + MDX for Kylin, with our 
pre-computation technology, multi-dimensional models, and basic metrics 
management capabilities, enables users to build a big data analysis platform on 
the cloud in a convenient way. In addition, we also support seamless connection 
to mainstream BI tools, helping our users to better leverage their data with 
higher efficiency and the lowest TCO.

[kylin] 01/02: Add new blog: kylin on cloud EN version

Reply via email to