Repository: kylin Updated Branches: refs/heads/document 4037fef89 -> bd23d7f25
Add How to optimize cube build Project: http://git-wip-us.apache.org/repos/asf/kylin/repo Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/bd23d7f2 Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/bd23d7f2 Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/bd23d7f2 Branch: refs/heads/document Commit: bd23d7f257b88abcdbf1013c22c42d3f26c86638 Parents: 4037fef Author: shaofengshi <shaofeng...@apache.org> Authored: Wed Jan 25 16:28:34 2017 +0800 Committer: shaofengshi <shaofeng...@apache.org> Committed: Wed Jan 25 16:28:34 2017 +0800 ---------------------------------------------------------------------- website/_data/docs16.yml | 1 + website/_docs16/howto/howto_ldap_and_sso.md | 13 +- website/_docs16/howto/howto_optimize_build.md | 186 +++++++++++++++++++ website/_docs16/howto/howto_optimize_cubes.md | 2 +- .../howto/howto_use_distributed_scheduler.md | 2 +- website/_docs16/tutorial/cube_streaming.md | 10 +- 6 files changed, 204 insertions(+), 10 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_data/docs16.yml ---------------------------------------------------------------------- diff --git a/website/_data/docs16.yml b/website/_data/docs16.yml index 25be101..22f9141 100644 --- a/website/_data/docs16.yml +++ b/website/_data/docs16.yml @@ -52,6 +52,7 @@ - howto/howto_use_restapi_in_js - howto/howto_use_restapi - howto/howto_optimize_cubes + - howto/howto_optimize_build - howto/howto_backup_metadata - howto/howto_cleanup_storage - howto/howto_jdbc http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_ldap_and_sso.md ---------------------------------------------------------------------- diff --git a/website/_docs16/howto/howto_ldap_and_sso.md b/website/_docs16/howto/howto_ldap_and_sso.md index d8988dc..0a377b1 100644 --- a/website/_docs16/howto/howto_ldap_and_sso.md +++ b/website/_docs16/howto/howto_ldap_and_sso.md @@ -11,19 +11,26 @@ Kylin supports LDAP authentication for enterprise or production deployment; This #### Configure LDAP server info -Firstly, provide LDAP URL, and username/password if the LDAP server is secured; The password in kylin.properties need be salted; You can run "org.apache.kylin.rest.security.PasswordPlaceholderConfigurer AES your_password" to get a hash. +Firstly, provide LDAP URL, and username/password if the LDAP server is secured; The password in kylin.properties need be encrypted; You can run the following command to get the encrypted value (please note, the password's length should be less than 16 characters, see [KYLIN-2416](https://issues.apache.org/jira/browse/KYLIN-2416)): + +``` +cd $KYLIN_HOME/tomcat/webapps/kylin/WEB-INF/lib +java -classpath kylin-server-base-1.6.0.jar:spring-beans-3.2.17.RELEASE.jar:spring-core-3.2.17.RELEASE.jar:commons-codec-1.7.jar org.apache.kylin.rest.security.PasswordPlaceholderConfigurer AES <your_password> +``` + +Config them in the conf/kylin.properties: ``` ldap.server=ldap://<your_ldap_host>:<port> ldap.username=<your_user_name> -ldap.password=<your_password_hash> +ldap.password=<your_password_encrypted> ``` Secondly, provide the user search patterns, this is by LDAP design, here is just a sample: ``` ldap.user.searchBase=OU=UserAccounts,DC=mycompany,DC=com -ldap.user.searchPattern=(&(AccountName={0})(memberOf=CN=MYCOMPANY-USERS,DC=mycompany,DC=com)) +ldap.user.searchPattern=(&(cn={0})(memberOf=CN=MYCOMPANY-USERS,DC=mycompany,DC=com)) ldap.user.groupSearchBase=OU=Group,DC=mycompany,DC=com ``` http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_optimize_build.md ---------------------------------------------------------------------- diff --git a/website/_docs16/howto/howto_optimize_build.md b/website/_docs16/howto/howto_optimize_build.md new file mode 100644 index 0000000..68becd5 --- /dev/null +++ b/website/_docs16/howto/howto_optimize_build.md @@ -0,0 +1,186 @@ +--- +layout: docs16 +title: Optimize Cube Build +categories: howto +permalink: /docs16/howto/howto_optimize_build.html +--- + +Kylin decomposes a Cube build task into several steps and then executes them in sequence. These steps include Hive operations, MapReduce jobs, and other types job. When you have many Cubes to build daily, then you definitely want to speed up this process. Here are some practices that you probably want to know, and they are organized in the same order as the steps sequence. + + + +## Create Intermediate Flat Hive Table + +This step extracts data from source Hive tables (with all tables joined) and inserts them into an intermediate flat table. If Cube is partitioned, Kylin will add a time condition so that only the data in the range would be fetched. You can check the related Hive command in the log of this step, e.g: + +``` +hive -e "USE default; +DROP TABLE IF EXISTS kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34; + +CREATE EXTERNAL TABLE IF NOT EXISTS kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 +(AIRLINE_FLIGHTDATE date,AIRLINE_YEAR int,AIRLINE_QUARTER int,...,AIRLINE_ARRDELAYMINUTES int) +STORED AS SEQUENCEFILE +LOCATION 'hdfs:///kylin/kylin200instance/kylin-0a8d71e8-df77-495f-b501-03c06f785b6c/kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34'; + +SET dfs.replication=2; +SET hive.exec.compress.output=true; +SET hive.auto.convert.join.noconditionaltask=true; +SET hive.auto.convert.join.noconditionaltask.size=100000000; +SET mapreduce.job.split.metainfo.maxsize=-1; + +INSERT OVERWRITE TABLE kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 SELECT +AIRLINE.FLIGHTDATE +,AIRLINE.YEAR +,AIRLINE.QUARTER +,... +,AIRLINE.ARRDELAYMINUTES +FROM AIRLINE.AIRLINE as AIRLINE +WHERE (AIRLINE.FLIGHTDATE >= '1987-10-01' AND AIRLINE.FLIGHTDATE < '2017-01-01'); +" + +``` + +Kylin applies the configuration in conf/kylin\_hive\_conf.xml while Hive commands are running, for instance, use less replication and enable Hive's mapper side join. If it is needed, you can add other configurations which are good for your cluster. + +If Cube's partition column ("FLIGHTDATE" in this case) is the same as Hive table's partition column, then filtering on it will let Hive smartly skip those non-matched partitions. So it is highly recommended to use Hive table's paritition column (if it is a date column) as the Cube's partition column. This is almost required for those very large tables, or Hive has to scan all files each time in this step, costing terribly long time. + +If your Hive enables file merge, you can disable them in "conf/kylin\_hive\_conf.xml" as Kylin has its own way to merge files (in the next step): + + <property> + <name>hive.merge.mapfiles</name> + <value>false</value> + <description>Disable Hive's auto merge</description> + </property> + + +## Redistribute intermediate table + +After the previous step, Hive generates the data files in HDFS folder: while some files are large, some are small or even empty. The imbalanced file distribution would lead subsequent MR jobs to imbalance as well: some mappers finish quickly yet some others are very slow. To balance them, Kylin adds this step to "redistribute" the data and here is a sample output: + +``` +total input rows = 159869711 +expected input rows per mapper = 1000000 +num reducers for RedistributeFlatHiveTableStep = 160 + +``` + + +Redistribute table, cmd: + +``` +hive -e "USE default; +SET dfs.replication=2; +SET hive.exec.compress.output=true; +SET hive.auto.convert.join.noconditionaltask=true; +SET hive.auto.convert.join.noconditionaltask.size=100000000; +SET mapreduce.job.split.metainfo.maxsize=-1; +set mapreduce.job.reduces=160; +set hive.merge.mapredfiles=false; + +INSERT OVERWRITE TABLE kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 SELECT * FROM kylin_intermediate_airline_cube_v3610f668a3cdb437e8373c034430f6c34 DISTRIBUTE BY RAND(); +" + +``` + + + +Firstly, Kylin gets the row count of this intermediate table; then based on the number of row count, it would get amount of files needed to get data redistributed. By default, Kylin allocates one file per 1 million rows. In this sample, there are 160 million rows and exist 160 reducers, and each reducer would write 1 file. In following MR step over this table, Hadoop will start the same number Mappers as the files to process (usually 1 million's data size is small than a HDFS block size). If your daily data scale isn't so large or Hadoop cluster has enough resources, you may want to get more concurrency. Setting "kylin.job.mapreduce.mapper.input.rows" in "conf/kylin.properties" to a smaller value will get that, e.g: + +`kylin.job.mapreduce.mapper.input.rows=500000` + + +Secondly, Kylin runs a *"INSERT OVERWIRTE TABLE .... DISTRIBUTE BY "* HiveQL to distribute the rows among a specified number of reducers. + +In most cases, Kylin asks Hive to distributes the rows among reducers, then get files very closed in size. The distribute clause is "DISTRIBUTE BY RAND()". + +If your Cube has specified a "shard by" dimension (in Cube's "Advanced setting" page), which is a high cardinality column (like "USER\_ID"), Kylin will ask Hive to redistribute data by that column's value. Then for the rows that have the same value as this column has, they will go to the same file. This is much better than "by random", because the data will be not only redistributed but also pre-categorized without additional cost, thus benefiting the subsequent Cube build process. Under a typical scenario, this optimization can cut off 40% building time. In this case the distribute clause will be "DISTRIBUTE BY USER_ID": + +**Please note:** 1) The "shard by" column should be a high cardinality dimension column, and it appears in many cuboids (not just appears in seldom cuboids). Utilize it to distribute properly can get equidistribution in every time range; otherwise it will cause data incline, which will reduce the building speed. Typical good cases are: "USER\_ID", "SELLER\_ID", "PRODUCT", "CELL\_NUMBER", so forth, whose cardinality is higher than one thousand (should be much more than the reducer numbers). 2) Using "shard by" has other advantage in Cube storage, but it is out of this doc's scope. + + + +## Extract Fact Table Distinct Columns + +In this step Kylin runs a MR job to fetch distinct values for the dimensions, which are using dictionary encoding. + +Actually this step does more: it collects the Cube statistics by using HyperLogLog counters to estimate the row count of each Cuboid. If you find that mappers work incredible slowly, it usually indicates that the Cube design is too complex, please check [optimize cube design](/docs16/howto/howto_optimize_cubes.html) to make the Cube thinner. If the reducers get OutOfMemory error, it indicates that the Cuboid combination does explode or the default YARN memory allocation cannot meet demands. If this step couldn't finish in a reasonable time by all means, you can give up and revisit the design as the real building will take longer. + +You can reduce the sampling percentage (kylin.job.cubing.inmem.sampling.percen in kylin.properties) to get this step accelerated, but this may not help much and impact on the accuracy of Cube statistics, thus we don't recommend. + + + +## Build Dimension Dictionary + +With the distinct values fetched in previous step, Kylin will build dictionaries in memory (in next version this will be moved to MR). Usually this step is fast, but if the value set is large, Kylin may report error like "Too high cardinality is not suitable for dictionary". For UHC column, please use other encoding method for the UHC column, such as "fixed_length", "integer" and so on. + + + +## Save Cuboid Statistics and Create HTable + +These two steps are lightweight and fast. + + + +## Build Base Cuboid + +This step is building the base cuboid from the intermediate table. The mapper number is equals to the reducer number of step 2; The reducer number is estimate with the cube statistics: by default use 1 reducer every 500MB output; If you observed the reducer number is small, you can set "kylin.job.mapreduce.default.reduce.input.mb" in kylin.properteis to a smaller value to get more resources, e.g: `kylin.job.mapreduce.default.reduce.input.mb=200` + + +## Build N-Dimension Cuboid + +These steps are the "by-layer" cubing process, each step uses the output of previous step as the input. It is similar as the "build base cuboid" step. Usually from the N-D to (N/2)-D the building is slow, because it is the cuboid explosion process: N-D has 1 Cuboid, (N-1)-D has N cuboids, (N-2)-D has N*(N-1) cuboids, etc. After (N/2)-D step, the building gets faster gradually. + + + +## Build Cube + +This step uses a new algorithm to build the Cube: "by-split" Cubing (also called as "in-mem" cubing). It will use one round MR to calculate all cuboids, but it requests more memory than normal. The "conf/kylin\_job\_conf\_inmem.xml" is made for this step. By default it requests 3GB memory for each mapper. If your cluster has enough memory, you can allocate more in "conf/kylin\_job\_conf\_inmem.xml" so it will use as much possible memory to hold the data and gain a better performance, e.g: + + <property> + <name>mapreduce.map.memory.mb</name> + <value>6144</value> + <description></description> + </property> + + <property> + <name>mapreduce.map.java.opts</name> + <value>-Xmx5632m</value> + <description></description> + </property> + + +Please note, Kylin will automatically select the best algorithm based on the data distribution (get in Cube statistics). The not-selected algorithm's steps will be skipped. You don't need to select the algorithm explicitly. + + + +## Convert Cuboid Data to HFile + +This step starts a MR job to convert the Cuboid files (sequence file format) into HBase's HFile format. Kylin calculates the HBase region number with the Cube statistics, by default 1 region per 5GB. The more regions got, the more reducers would be utilized. If you observe the reducer's number is small and perforamnce is poor, you can set the following parameters in "conf/kylin.properties" to smaller, as follows: + +``` +kylin.hbase.region.cut=2 +kylin.hbase.hfile.size.gb=1 +``` + +If you're not sure what size a region should be, contact your HBase administrator. + + +## Load HFile to HBase Table + +This step uses HBase API to load the HFiles to region servers, it is lightweight and fast. + + + +## Update Cube Info + +After loading data into HBase, Kylin marks this Cube segment as ready in metadata. This step is very fast. + + + +## Cleanup + +Drop the intermediate table from Hive. This step doesn't block anything as the segment has been marked ready in the previous step. If this step gets error, no need to worry, the garbage can be collected later when Kylin executes the [StorageCleanupJob](/docs16/howto/howto_cleanup_storage.html). + + +## Summary +There are also many other methods to boost the performance. If you have practices to share, welcome to discuss in [d...@kylin.apache.org](mailto:d...@kylin.apache.org). \ No newline at end of file http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_optimize_cubes.md ---------------------------------------------------------------------- diff --git a/website/_docs16/howto/howto_optimize_cubes.md b/website/_docs16/howto/howto_optimize_cubes.md index fbc3586..17e7fb8 100644 --- a/website/_docs16/howto/howto_optimize_cubes.md +++ b/website/_docs16/howto/howto_optimize_cubes.md @@ -1,6 +1,6 @@ --- layout: docs16 -title: Optimize Cube +title: Optimize Cube Design categories: howto permalink: /docs16/howto/howto_optimize_cubes.html --- http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/howto/howto_use_distributed_scheduler.md ---------------------------------------------------------------------- diff --git a/website/_docs16/howto/howto_use_distributed_scheduler.md b/website/_docs16/howto/howto_use_distributed_scheduler.md index 6d92584..01bb097 100644 --- a/website/_docs16/howto/howto_use_distributed_scheduler.md +++ b/website/_docs16/howto/howto_use_distributed_scheduler.md @@ -5,7 +5,7 @@ categories: howto permalink: /docs16/howto/howto_use_distributed_scheduler.html --- -Since Kylin 1.6.1, Kylin support distributed job scheduler. +Since Kylin 2.0, Kylin support distributed job scheduler. Which is more extensible, available and reliable than default job scheduler. To enable the distributed job scheduler, you need to set or update three configs in the kylin.properties: http://git-wip-us.apache.org/repos/asf/kylin/blob/bd23d7f2/website/_docs16/tutorial/cube_streaming.md ---------------------------------------------------------------------- diff --git a/website/_docs16/tutorial/cube_streaming.md b/website/_docs16/tutorial/cube_streaming.md index 4cbd14c..6909545 100644 --- a/website/_docs16/tutorial/cube_streaming.md +++ b/website/_docs16/tutorial/cube_streaming.md @@ -14,15 +14,15 @@ In this tutorial, we will use Hortonworks HDP 2.2.4 Sandbox VM + Kafka v0.10.0(S ## Install Kafka 0.10.0.0 and Kylin Don't use HDP 2.2.4's build-in Kafka as it is too old, stop it first if it is running. {% highlight Groff markup %} -curl -s http://mirrors.tuna.tsinghua.edu.cn/apache/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz | tar -xz -C /usr/hdp/current/ +curl -s http://mirrors.tuna.tsinghua.edu.cn/apache/kafka/0.10.0.0/kafka_2.10-0.10.0.0.tgz | tar -xz -C /usr/local/ -cd /usr/hdp/current/kafka_2.10-0.10.0.0/ +cd /usr/local/kafka_2.10-0.10.0.0/ bin/kafka-server-start.sh config/server.properties & {% endhighlight %} -Download the Kylin v1.6 from download page, expand the tar ball in /root/ folder. +Download the Kylin v1.6 from download page, expand the tar ball in /usr/local/ folder. ## Create sample Kafka topic and populate data @@ -37,8 +37,8 @@ Created topic "kylindemo". Put sample data to this topic; Kylin has an utility class which can do this; {% highlight Groff markup %} -export KAFKA_HOME=/usr/hdp/current/kafka_2.10-0.10.0.0 -export KYLIN_HOME=/root/apache-kylin-1.6.0-SNAPSHOT-bin +export KAFKA_HOME=/usr/local/kafka_2.10-0.10.0.0 +export KYLIN_HOME=/usr/local/apache-kylin-1.6.0-bin cd $KYLIN_HOME ./bin/kylin.sh org.apache.kylin.source.kafka.util.KafkaSampleProducer --topic kylindemo --broker localhost:9092