kylin git commit: Add How to optimize cube build

shaofengshi Wed, 25 Jan 2017 00:29:10 -0800

Repository: kylin
Updated Branches:
  refs/heads/document 4037fef89 -> bd23d7f25



Add How to optimize cube build


Project: http://git-wip-us.apache.org/repos/asf/kylin/repo
Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/bd23d7f2
Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/bd23d7f2
Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/bd23d7f2

Branch: refs/heads/document
Commit: bd23d7f257b88abcdbf1013c22c42d3f26c86638
Parents: 4037fef
Author: shaofengshi <shaofeng...@apache.org>
Authored: Wed Jan 25 16:28:34 2017 +0800
Committer: shaofengshi <shaofeng...@apache.org>
Committed: Wed Jan 25 16:28:34 2017 +0800

----------------------------------------------------------------------
 website/_data/docs16.yml                        |   1 +
 website/_docs16/howto/howto_ldap_and_sso.md     |  13 +-
 website/_docs16/howto/howto_optimize_build.md   | 186 +++++++++++++++++++
 website/_docs16/howto/howto_optimize_cubes.md   |   2 +-
 .../howto/howto_use_distributed_scheduler.md    |   2 +-
 website/_docs16/tutorial/cube_streaming.md      |  10 +-
 6 files changed, 204 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_data/docs16.yml
----------------------------------------------------------------------
diff --git a/website/_data/docs16.yml b/website/_data/docs16.yml
index 25be101..22f9141 100644
--- a/website/_data/docs16.yml
+++ b/website/_data/docs16.yml
@@ -52,6 +52,7 @@
   - howto/howto_use_restapi_in_js
   - howto/howto_use_restapi
   - howto/howto_optimize_cubes
+  - howto/howto_optimize_build
   - howto/howto_backup_metadata
   - howto/howto_cleanup_storage
   - howto/howto_jdbc

http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_ldap_and_sso.md
----------------------------------------------------------------------
diff --git a/website/_docs16/howto/howto_ldap_and_sso.md 
b/website/_docs16/howto/howto_ldap_and_sso.md
index d8988dc..0a377b1 100644
--- a/website/_docs16/howto/howto_ldap_and_sso.md
+++ b/website/_docs16/howto/howto_ldap_and_sso.md
@@ -11,19 +11,26 @@ Kylin supports LDAP authentication for enterprise or 
production deployment; This
 
 #### Configure LDAP server info
 
-Firstly, provide LDAP URL, and username/password if the LDAP server is 
secured; The password in kylin.properties need be salted; You can run 
"org.apache.kylin.rest.security.PasswordPlaceholderConfigurer AES 
your_password" to get a hash.
+Firstly, provide LDAP URL, and username/password if the LDAP server is 
secured; The password in kylin.properties need be encrypted; You can run the 
following command to get the encrypted value (please note, the password's 
length should be less than 16 characters, see 
[KYLIN-2416](https://issues.apache.org/jira/browse/KYLIN-2416)):
+
+```
+cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib
+java -classpath 
kylin-server-base-1.6.0.jar:spring-beans-3.2.17.RELEASE.jar:spring-core-3.2.17.RELEASE.jar:commons-codec-1.7.jar
 org.apache.kylin.rest.security.PasswordPlaceholderConfigurer AES 
<your_password>
+```
+
+Config them in the conf/kylin.properties:
 
 ```
 ldap.server=ldap://<your_ldap_host>:<port>
 ldap.username=<your_user_name>
-ldap.password=<your_password_hash>
+ldap.password=<your_password_encrypted>
 ```
 
 Secondly, provide the user search patterns, this is by LDAP design, here is 
just a sample:
 
 ```
 ldap.user.searchBase=OU=UserAccounts,DC=mycompany,DC=com
-ldap.user.searchPattern=(&(AccountName={0})(memberOf=CN=MYCOMPANY-USERS,DC=mycompany,DC=com))
+ldap.user.searchPattern=(&(cn={0})(memberOf=CN=MYCOMPANY-USERS,DC=mycompany,DC=com))
 ldap.user.groupSearchBase=OU=Group,DC=mycompany,DC=com
 ```
 

http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_optimize_build.md
----------------------------------------------------------------------
diff --git a/website/_docs16/howto/howto_optimize_build.md 
b/website/_docs16/howto/howto_optimize_build.md
new file mode 100644
index 0000000..68becd5
--- /dev/null
+++ b/website/_docs16/howto/howto_optimize_build.md
@@ -0,0 +1,186 @@
+---
+layout: docs16
+title:  Optimize Cube Build
+categories: howto
+permalink: /docs16/howto/howto_optimize_build.html
+---
+
+Kylin decomposes a Cube build task into several steps and then executes them 
in sequence. These steps include Hive operations, MapReduce jobs, and other 
types job. When you have many Cubes to build daily, then you definitely want to 
speed up this process. Here are some practices that you probably want to know, 
and they are organized in the same order as the steps sequence.
+
+
+
+## Create Intermediate Flat Hive Table
+
+This step extracts data from source Hive tables (with all tables joined) and 
inserts them into an intermediate flat table. If Cube is partitioned, Kylin 
will add a time condition so that only the data in the range would be fetched. 
You can check the related Hive command in the log of this step, e.g: 
+
+```
+hive -e "USE default;
+DROP TABLE IF EXISTS 
kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34;
+
+CREATE EXTERNAL TABLE IF NOT EXISTS 
kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34
+(AIRLINE_FLIGHTDATE date,AIRLINE_YEAR int,AIRLINE_QUARTER 
int,...,AIRLINE_ARRDELAYMINUTES int)
+STORED AS SEQUENCEFILE
+LOCATION 
'hdfs:///kylin/kylin200instance/kylin-0a8d71e8-df77-495f-b501-03c06f785b6c/kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34';
+
+SET dfs.replication=2;
+SET hive.exec.compress.output=true;
+SET hive.auto.convert.join.noconditionaltask=true;
+SET hive.auto.convert.join.noconditionaltask.size=100000000;
+SET mapreduce.job.split.metainfo.maxsize=-1;
+
+INSERT OVERWRITE TABLE 
kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 SELECT
+AIRLINE.FLIGHTDATE
+,AIRLINE.YEAR
+,AIRLINE.QUARTER
+,...
+,AIRLINE.ARRDELAYMINUTES
+FROM AIRLINE.AIRLINE as AIRLINE
+WHERE (AIRLINE.FLIGHTDATE >= '1987-10-01' AND AIRLINE.FLIGHTDATE < 
'2017-01-01');
+"
+
+```
+
+Kylin applies the configuration in conf/kylin\_hive\_conf.xml while Hive 
commands are running, for instance, use less replication and enable Hive's 
mapper side join. If it is needed, you can add other configurations which are 
good for your cluster.
+
+If Cube's partition column ("FLIGHTDATE" in this case) is the same as Hive 
table's partition column, then filtering on it will let Hive smartly skip those 
non-matched partitions. So it is highly recommended to use Hive table's 
paritition column (if it is a date column) as the Cube's partition column. This 
is almost required for those very large tables, or Hive has to scan all files 
each time in this step, costing terribly long time.
+
+If your Hive enables file merge, you can disable them in 
"conf/kylin\_hive\_conf.xml" as Kylin has its own way to merge files (in the 
next step): 
+
+    <property>
+        <name>hive.merge.mapfiles</name>
+        <value>false</value>
+        <description>Disable Hive's auto merge</description>
+    </property>
+
+
+## Redistribute intermediate table
+
+After the previous step, Hive generates the data files in HDFS folder: while 
some files are large, some are small or even empty. The imbalanced file 
distribution would lead subsequent MR jobs to imbalance as well: some mappers 
finish quickly yet some others are very slow. To balance them, Kylin adds this 
step to "redistribute" the data and here is a sample output:
+
+```
+total input rows = 159869711
+expected input rows per mapper = 1000000
+num reducers for RedistributeFlatHiveTableStep = 160
+
+```
+
+
+Redistribute table, cmd: 
+
+```
+hive -e "USE default;
+SET dfs.replication=2;
+SET hive.exec.compress.output=true;
+SET hive.auto.convert.join.noconditionaltask=true;
+SET hive.auto.convert.join.noconditionaltask.size=100000000;
+SET mapreduce.job.split.metainfo.maxsize=-1;
+set mapreduce.job.reduces=160;
+set hive.merge.mapredfiles=false;
+
+INSERT OVERWRITE TABLE 
kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 SELECT * 
FROM kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 
DISTRIBUTE BY RAND();
+"
+
+```
+
+
+
+Firstly, Kylin gets the row count of this intermediate table; then based on 
the number of row count, it would get amount of files needed to get data 
redistributed. By default, Kylin allocates one file per 1 million rows. In this 
sample, there are 160 million rows and exist 160 reducers, and each reducer 
would write 1 file. In following MR step over this table, Hadoop will start the 
same number Mappers as the files to process (usually 1 million's data size is 
small than a HDFS block size). If your daily data scale isn't so large or 
Hadoop cluster has enough resources, you may want to get more concurrency. 
Setting "kylin.job.mapreduce.mapper.input.rows" in "conf/kylin.properties" to a 
smaller value will get that, e.g:
+
+`kylin.job.mapreduce.mapper.input.rows=500000`
+
+
+Secondly, Kylin runs a *"INSERT OVERWIRTE TABLE .... DISTRIBUTE BY "* HiveQL 
to distribute the rows among a specified number of reducers.
+
+In most cases, Kylin asks Hive to distributes the rows among reducers, then 
get files very closed in size. The distribute clause is "DISTRIBUTE BY RAND()".
+
+If your Cube has specified a "shard by" dimension (in Cube's "Advanced 
setting" page), which is a high cardinality column (like "USER\_ID"), Kylin 
will ask Hive to redistribute data by that column's value. Then for the rows 
that have the same value as this column has, they will go to the same file. 
This is much better than "by random",  because the data will be not only 
redistributed but also pre-categorized without additional cost, thus benefiting 
the subsequent Cube build process. Under a typical scenario, this optimization 
can cut off 40% building time. In this case the distribute clause will be 
"DISTRIBUTE BY USER_ID":
+
+**Please note:** 1) The "shard by" column should be a high cardinality 
dimension column, and it appears in many cuboids (not just appears in seldom 
cuboids). Utilize it to distribute properly can get equidistribution in every 
time range; otherwise it will cause data incline, which will reduce the 
building speed. Typical good cases are: "USER\_ID", "SELLER\_ID", "PRODUCT", 
"CELL\_NUMBER", so forth, whose cardinality is higher than one thousand (should 
be much more than the reducer numbers). 2) Using "shard by" has other advantage 
in Cube storage, but it is out of this doc's scope.
+
+
+
+## Extract Fact Table Distinct Columns
+
+In this step Kylin runs a MR job to fetch distinct values for the dimensions, 
which are using dictionary encoding. 
+
+Actually this step does more: it collects the Cube statistics by using 
HyperLogLog counters to estimate the row count of each Cuboid. If you find that 
mappers work incredible slowly, it usually indicates that the Cube design is 
too complex, please check [optimize cube 
design](/docs16/howto/howto_optimize_cubes.html) to make the Cube thinner. If 
the reducers get OutOfMemory error, it indicates that the Cuboid combination 
does explode or the default YARN memory allocation cannot meet demands. If this 
step couldn't finish in a reasonable time by all means, you can give up and 
revisit the design as the real building will take longer.
+
+You can reduce the sampling percentage (kylin.job.cubing.inmem.sampling.percen 
in kylin.properties) to get this step accelerated, but this may not help much 
and impact on the accuracy of Cube statistics, thus we don't recommend.  
+
+
+
+## Build Dimension Dictionary
+
+With the distinct values fetched in previous step, Kylin will build 
dictionaries in memory (in next version this will be moved to MR). Usually this 
step is fast, but if the value set is large, Kylin may report error like "Too 
high cardinality is not suitable for dictionary". For UHC column, please use 
other encoding method for the UHC column, such as "fixed_length", "integer" and 
so on.
+
+
+
+## Save Cuboid Statistics and Create HTable
+
+These two steps are lightweight and fast.
+
+
+
+## Build Base Cuboid 
+
+This step is building the base cuboid from the intermediate table. The mapper 
number is equals to the reducer number of step 2; The reducer number is 
estimate with the cube statistics: by default use 1 reducer every 500MB output; 
If you observed the reducer number is small, you can set 
"kylin.job.mapreduce.default.reduce.input.mb" in kylin.properteis to a smaller 
value to get more resources, e.g: 
`kylin.job.mapreduce.default.reduce.input.mb=200`
+
+
+## Build N-Dimension Cuboid 
+
+These steps are the "by-layer" cubing process, each step uses the output of 
previous step as the input. It is similar as the "build base cuboid" step. 
Usually from the N-D to (N/2)-D the building is slow, because it is the cuboid 
explosion process: N-D has 1 Cuboid, (N-1)-D has N cuboids, (N-2)-D has N*(N-1) 
cuboids, etc. After (N/2)-D step, the building gets faster gradually.
+
+
+
+## Build Cube
+
+This step uses a new algorithm to build the Cube: "by-split" Cubing (also 
called as "in-mem" cubing). It will use one round MR to calculate all cuboids, 
but it requests more memory than normal. The "conf/kylin\_job\_conf\_inmem.xml" 
is made for this step. By default it requests 3GB memory for each mapper. If 
your cluster has enough memory, you can allocate more in 
"conf/kylin\_job\_conf\_inmem.xml" so it will use as much possible memory to 
hold the data and gain a better performance, e.g:
+
+    <property>
+        <name>mapreduce.map.memory.mb</name>
+        <value>6144</value>
+        <description></description>
+    </property>
+    
+    <property>
+        <name>mapreduce.map.java.opts</name>
+        <value>-Xmx5632m</value>
+        <description></description>
+    </property>
+
+
+Please note, Kylin will automatically select the best algorithm based on the 
data distribution (get in Cube statistics). The not-selected algorithm's steps 
will be skipped. You don't need to select the algorithm explicitly.
+
+
+
+## Convert Cuboid Data to HFile
+
+This step starts a MR job to convert the Cuboid files (sequence file format) 
into HBase's HFile format. Kylin calculates the HBase region number with the 
Cube statistics, by default 1 region per 5GB. The more regions got, the more 
reducers would be utilized. If you observe the reducer's number is small and 
perforamnce is poor, you can set the following parameters in 
"conf/kylin.properties" to smaller, as follows:
+
+```
+kylin.hbase.region.cut=2
+kylin.hbase.hfile.size.gb=1
+```
+
+If you're not sure what size a region should be, contact your HBase 
administrator. 
+
+
+## Load HFile to HBase Table
+
+This step uses HBase API to load the HFiles to region servers, it is 
lightweight and fast.
+
+
+
+## Update Cube Info
+
+After loading data into HBase, Kylin marks this Cube segment as ready in 
metadata. This step is very fast.
+
+
+
+## Cleanup
+
+Drop the intermediate table from Hive. This step doesn't block anything as the 
segment has been marked ready in the previous step. If this step gets error, no 
need to worry, the garbage can be collected later when Kylin executes the 
[StorageCleanupJob](/docs16/howto/howto_cleanup_storage.html).
+
+
+## Summary
+There are also many other methods to boost the performance. If you have 
practices to share, welcome to discuss in 
[d...@kylin.apache.org](mailto:d...@kylin.apache.org).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_optimize_cubes.md
----------------------------------------------------------------------
diff --git a/website/_docs16/howto/howto_optimize_cubes.md 
b/website/_docs16/howto/howto_optimize_cubes.md
index fbc3586..17e7fb8 100644
--- a/website/_docs16/howto/howto_optimize_cubes.md
+++ b/website/_docs16/howto/howto_optimize_cubes.md
@@ -1,6 +1,6 @@
 ---
 layout: docs16
-title:  Optimize Cube
+title:  Optimize Cube Design
 categories: howto
 permalink: /docs16/howto/howto_optimize_cubes.html
 ---

http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_use_distributed_scheduler.md
----------------------------------------------------------------------
diff --git a/website/_docs16/howto/howto_use_distributed_scheduler.md 
b/website/_docs16/howto/howto_use_distributed_scheduler.md
index 6d92584..01bb097 100644
--- a/website/_docs16/howto/howto_use_distributed_scheduler.md
+++ b/website/_docs16/howto/howto_use_distributed_scheduler.md
@@ -5,7 +5,7 @@ categories: howto
 permalink: /docs16/howto/howto_use_distributed_scheduler.html
 ---
 
-Since Kylin 1.6.1, Kylin support distributed job scheduler.
+Since Kylin 2.0, Kylin support distributed job scheduler.
 Which is more extensible, available and reliable than default job scheduler.
 To enable the distributed job scheduler, you need to set or update three 
configs in the kylin.properties:
 

http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/tutorial/cube_streaming.md
----------------------------------------------------------------------
diff --git a/website/_docs16/tutorial/cube_streaming.md 
b/website/_docs16/tutorial/cube_streaming.md
index 4cbd14c..6909545 100644
--- a/website/_docs16/tutorial/cube_streaming.md
+++ b/website/_docs16/tutorial/cube_streaming.md
@@ -14,15 +14,15 @@ In this tutorial, we will use Hortonworks HDP 2.2.4 Sandbox 
VM + Kafka v0.10.0(S
 ## Install Kafka 0.10.0.0 and Kylin
 Don't use HDP 2.2.4's build-in Kafka as it is too old, stop it first if it is 
running.
 {% highlight Groff markup %}
-curl -s 
http://mirrors.tuna.tsinghua.edu.cn/apache/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz
 | tar -xz -C /usr/hdp/current/
+curl -s 
http://mirrors.tuna.tsinghua.edu.cn/apache/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz
 | tar -xz -C /usr/local/
 
-cd /usr/hdp/current/kafka_2.10-0.10.0.0/
+cd /usr/local/kafka_2.10-0.10.0.0/
 
 bin/kafka-server-start.sh config/server.properties &
 
 {% endhighlight %}
 
-Download the Kylin v1.6 from download page, expand the tar ball in /root/ 
folder.
+Download the Kylin v1.6 from download page, expand the tar ball in /usr/local/ 
folder.
 
 ## Create sample Kafka topic and populate data
 
@@ -37,8 +37,8 @@ Created topic "kylindemo".
 Put sample data to this topic; Kylin has an utility class which can do this;
 
 {% highlight Groff markup %}
-export KAFKA_HOME=/usr/hdp/current/kafka_2.10-0.10.0.0
-export KYLIN_HOME=/root/apache-kylin-1.6.0-SNAPSHOT-bin
+export KAFKA_HOME=/usr/local/kafka_2.10-0.10.0.0
+export KYLIN_HOME=/usr/local/apache-kylin-1.6.0-bin
 
 cd $KYLIN_HOME
 ./bin/kylin.sh org.apache.kylin.source.kafka.util.KafkaSampleProducer --topic 
kylindemo --broker localhost:9092

kylin git commit: Add How to optimize cube build

Reply via email to