Author: lidong Date: Tue Apr 4 14:00:31 2017 New Revision: 1790118 URL: http://svn.apache.org/viewvc?rev=1790118&view=rev Log: add blog about percentile measure
Added: kylin/site/blog/2017/04/ kylin/site/blog/2017/04/01/ kylin/site/blog/2017/04/01/percentile-measure/ kylin/site/blog/2017/04/01/percentile-measure/index.html kylin/site/images/blog/percentile_1.png (with props) kylin/site/images/blog/percentile_2.png (with props) kylin/site/images/blog/percentile_3.png (with props) Modified: kylin/site/blog/index.html kylin/site/feed.xml Added: kylin/site/blog/2017/04/01/percentile-measure/index.html URL: http://svn.apache.org/viewvc/kylin/site/blog/2017/04/01/percentile-measure/index.html?rev=1790118&view=auto ============================================================================== --- kylin/site/blog/2017/04/01/percentile-measure/index.html (added) +++ kylin/site/blog/2017/04/01/percentile-measure/index.html Tue Apr 4 14:00:31 2017 @@ -0,0 +1,294 @@ +<!-- +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +--> +<!doctype html> +<html> + <!-- +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +--> + +<head> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + + <title>Apache Kylin | A new measure for Percentile precalculation</title> + <meta name="description" content="Introduction"> + <meta name="author" content="Apache Kylin"> + <link rel="shortcut icon" href="fav.png" type="image/png"> + + + +<link rel="stylesheet" href="/assets/css/animate.css"> +<!-- Bootstrap --> +<link rel="stylesheet" href="/assets/css/bootstrap.min.css"> + +<!-- Fonts --> +<!-- <link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Alice|Open+Sans:400,300,700"> --> + +<!-- Icons --> +<link rel="stylesheet" href="/assets/css/font-awesome.min.css"> + + <!-- Custom styles --> + <link rel="stylesheet" href="/assets/css/styles.css"> + <link rel="stylesheet" href="/assets/css/docs.css"> + <link rel="stylesheet" href="/assets/css/pygments.css"> + + <link rel="canonical" href="http://kylin.apache.org/blog/2017/04/01/percentile-measure/"> + <link rel="alternate" type="application/rss+xml" title="Apache Kylin" href="http://kylin.apache.org/feed.xml" /> + +<!--[if lt IE 9]> <script src="assets/js/html5shiv.js"></script> <![endif]--> +<script> + (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ + (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), + m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) + })(window,document,'script','//www.google-analytics.com/analytics.js','ga'); + + //oringal tracker for kylin.io + ga('create', 'UA-55534813-1', 'auto'); + //new tracker for kylin.apache.org + ga('create', 'UA-55534813-2', 'auto', {'name':'toplevel'}); + + ga('send', 'pageview'); + ga('toplevel.send', 'pageview'); + + +</script> +<script type="text/javascript" src="/assets/js/jquery-1.9.1.min.js"></script> +<script type="text/javascript" src="/assets/js/nside.js"></script> </script> +<script type="text/javascript" src="/assets/js/nnav.js"></script> </script> +</head> + + <body> + <!-- +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +--> + +<header id="header" > + + <div id="head" class="parallax" parallax-speed="3" > + <div id="logo" class="text-center"> <img class="img-circle" id="circlelogo" src="/assets/images/kylin_logo.jpg"> <span class="title" >Apache Kylinâ¢</span> <span class="tagline">Extreme OLAP Engine for Big Data</span> + </div> + <div class="text-center" style=" + position: relative; + top: 66px; + width: 1080px; + margin: 0 auto; + z-index: 11; + margin-top: -253px; + text-align: right;" + > + <a href="http://apache.org/foundation/contributing.html" title="Support Apache" style="margin-left: 150px;"> + <img src="https://www.apache.org/images/SupportApache-small.png" style="height: 150px; width: 150px;"> + </a> + </div> + </div> + + + <!-- Main Menu --> + <nav class="navbar navbar-default" role="navigation" id="nav-wrapper"> + <div class="container-fluid" id="nav"> + <!-- + <img class="img-circle" width="40px" height="40px" id="circlelogo" src="/assets/images/kylin_logo.jpg"> + --> + <!-- Brand and toggle get grouped for better mobile display --> + <div class="navbar-header"> + <button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1"> + <span class="sr-only">Toggle navigation</span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + <span class="icon-bar"></span> + </button> + + </div> + + <!-- Collect the nav links, forms, and other content for toggling --> + <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1"> + <ul class="nav navbar-nav"> + <li><a href="/">Home</a></li> + <li><a href="/docs20" >Docs</a></li> + <li><a href="/download">Download</li> + <li><a href="/community" >Community</a></li> + <li><a href="/development" >Development</a></li> + <li><a href="/blog">Blog</li> + <li><a href="/cn" >䏿ç</a></li> + <li><a href="https://twitter.com/apachekylin" target="_blank" class="fa fa-twitter fa-lg" title="Twitter: @ApacheKylin" ></a></li> + <li><a href="https://github.com/apache/kylin" target="_blank" class="fa fa-github-alt fa-lg" title="Github: apache/kylin" ></a></li> + <li><a href="https://www.facebook.com/kylinio" target="_blank" class="fa fa-facebook fa-lg" title="Facebook: kylin.io" ></a></li> + </ul> + </div><!-- /.navbar-collapse --> + </div><!-- /.container-fluid --> +</nav> + </header> + + <div class="page-content"> + <header style=" padding:2em 0 0 0"> + <div class="container" > + <h4 class="section-title"><span>Apache Kylin⢠Technical Blog</span></h4> + </div> + </div> + + <div class="container"> + <div> + <article class="post-content" > + <!-- +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +--> + +<div class="post" style=" padding:2em 4em 4em 4em"> + + <header class="post-header"> + <h1 class="post-title">A new measure for Percentile precalculation</h1> + <p class="post-meta" >Apr 1, 2017 ⢠Dong Li</p> + </header> + + <article class="post-content" > + <h2 id="introduction">Introduction</h2> + +<p>Since Apache Kylin 2.0, thereâs a new measure for percentile precalculation, which aims at (sub-)second latency for <strong>approximate</strong> percentile analytics SQL queries. The implementation is based on <a href="https://github.com/tdunning/t-digest">t-digest</a> library under Apachee 2.0 license, which provides a high-effecient data structure to save aggregation counters and algorithm to calculate approximate result of percentile.</p> + +<h3 id="percentile">Percentile</h3> +<p><em>From <a href="https://en.wikipedia.org/wiki/Percentile">wikipedia</a></em>: A <strong>percentile</strong> (or a <strong>centile</strong>) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.</p> + +<p>In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation function called <strong>percentile(<Number Column>, <Double>)</strong>:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">seller_id</span><span class="p">,</span> <span class="n">percentile</span><span class="p">(</span><span class="n">price</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">)</span> +<span class="k">FROM</span> <span class="n">test_kylin_fact</span> +<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">seller_id</span> +</code></pre> +</div> + +<h3 id="how-to-use">How to use</h3> +<p>If you know little about <em>Cubes</em>, please go to <a href="http://kylin.apache.org/docs20/tutorial/kylin_sample.html">QuickStart</a> first to learn basic knowledge.</p> + +<p>Firstly, you need to add this column as measure in data model.</p> + +<p><img src="/images/blog/percentile_1.png" alt="" /></p> + +<p>Secondly, create a cube and add a PERCENTILE measure.</p> + +<p><img src="/images/blog/percentile_2.png" alt="" /></p> + +<p>Finally, build the cube and try some query.</p> + +<p><img src="/images/blog/percentile_3.png" alt="" /></p> + + </article> + +</div> + + + + + + </article> + </div> + </div> + <!-- +* Licensed to the Apache Software Foundation (ASF) under one +* or more contributor license agreements. See the NOTICE file +* distributed with this work for additional information +* regarding copyright ownership. The ASF licenses this file +* to you under the Apache License, Version 2.0 (the +* "License"); you may not use this file except in compliance +* with the License. You may obtain a copy of the License at +* +* http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an "AS IS" BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +--> + +<footer id="underfooter"> + <div class="container"> + <div class="row"> + <div class="col-md-12 widget"> + <div class="widget-body" style="text-align:center"> + <a href="http://www.apache.org"> + <img id="asf-logo" alt="Apache Software Foundation" src="/assets/images/feather-small.gif"> + </a> + + <div> + The contents of this website are © 2015 Apache Software Foundation under the terms of the <a + href="http://www.apache.org/licenses/LICENSE-2.0"> Apache License v2 </a>. Apache Kylin and + its logo are trademarks of the Apache Software Foundation. + </div> + + </div> + </div> + </div> + <!-- /row of widgets --> + + </div> + <div></div> + +</footer> + + <script src="/assets/js/jquery-1.9.1.min.js"></script> + <script src="/assets/js/bootstrap.min.js"></script> + <script src="/assets/js/main.js"></script> + </body> +</html> + + + + Modified: kylin/site/blog/index.html URL: http://svn.apache.org/viewvc/kylin/site/blog/index.html?rev=1790118&r1=1790117&r2=1790118&view=diff ============================================================================== --- kylin/site/blog/index.html (original) +++ kylin/site/blog/index.html Tue Apr 4 14:00:31 2017 @@ -187,6 +187,12 @@ <li> <h2 align="left" style="margin:0px"> + <a class="post-link" href="/blog/2017/04/01/percentile-measure/">A new measure for Percentile precalculation</a></h2><div align="left" class="post-meta">posted: Apr 1, 2017</div> + + </li> + + <li> + <h2 align="left" style="margin:0px"> <a class="post-link" href="/blog/2017/02/25/v2.0.0-beta-ready/">Apache Kylin v2.0.0 Beta Announcement</a></h2><div align="left" class="post-meta">posted: Feb 25, 2017</div> </li> @@ -277,13 +283,13 @@ <li> <h2 align="left" style="margin:0px"> - <a class="post-link" href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 æ£å¼åå¸</a></h2><div align="left" class="post-meta">posted: May 26, 2016</div> + <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 Release Announcement</a></h2><div align="left" class="post-meta">posted: May 26, 2016</div> </li> <li> <h2 align="left" style="margin:0px"> - <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 Release Announcement</a></h2><div align="left" class="post-meta">posted: May 26, 2016</div> + <a class="post-link" href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 æ£å¼åå¸</a></h2><div align="left" class="post-meta">posted: May 26, 2016</div> </li> @@ -307,13 +313,13 @@ <li> <h2 align="left" style="margin:0px"> - <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 Release Announcement</a></h2><div align="left" class="post-meta">posted: Mar 17, 2016</div> + <a class="post-link" href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 æ£å¼åå¸</a></h2><div align="left" class="post-meta">posted: Mar 17, 2016</div> </li> <li> <h2 align="left" style="margin:0px"> - <a class="post-link" href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 æ£å¼åå¸</a></h2><div align="left" class="post-meta">posted: Mar 17, 2016</div> + <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 Release Announcement</a></h2><div align="left" class="post-meta">posted: Mar 17, 2016</div> </li> Modified: kylin/site/feed.xml URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1790118&r1=1790117&r2=1790118&view=diff ============================================================================== --- kylin/site/feed.xml (original) +++ kylin/site/feed.xml Tue Apr 4 14:00:31 2017 @@ -19,11 +19,52 @@ <description>Apache Kylin Home</description> <link>http://kylin.apache.org/</link> <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/> - <pubDate>Wed, 29 Mar 2017 06:59:03 -0700</pubDate> - <lastBuildDate>Wed, 29 Mar 2017 06:59:03 -0700</lastBuildDate> + <pubDate>Tue, 04 Apr 2017 06:59:04 -0700</pubDate> + <lastBuildDate>Tue, 04 Apr 2017 06:59:04 -0700</lastBuildDate> <generator>Jekyll v2.5.3</generator> <item> + <title>A new measure for Percentile precalculation</title> + <description><h2 id="introduction">Introduction</h2> + +<p>Since Apache Kylin 2.0, thereâs a new measure for percentile precalculation, which aims at (sub-)second latency for <strong>approximate</strong> percentile analytics SQL queries. The implementation is based on <a href="https://github.com/tdunning/t-digest">t-digest</a> library under Apachee 2.0 license, which provides a high-effecient data structure to save aggregation counters and algorithm to calculate approximate result of percentile.</p> + +<h3 id="percentile">Percentile</h3> +<p><em>From <a href="https://en.wikipedia.org/wiki/Percentile">wikipedia</a></em>: A <strong>percentile</strong> (or a <strong>centile</strong>) is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20% of the observations may be found.</p> + +<p>In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a aggregation function called <strong>percentile(&lt;Number Column&gt;, &lt;Double&gt;)</strong>:</p> + +<div class="highlighter-rouge"><pre class="highlight"><code><span class="k">SELECT</span> <span class="n">seller_id</span><span class="p">,</span> <span class="n">percentile</span><span class="p">(</span><span class="n">price</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">5</span><span class="p">)</span> +<span class="k">FROM</span> <span class="n">test_kylin_fact</span> +<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">seller_id</span> +</code></pre> +</div> + +<h3 id="how-to-use">How to use</h3> +<p>If you know little about <em>Cubes</em>, please go to <a href="http://kylin.apache.org/docs20/tutorial/kylin_sample.html">QuickStart</a> first to learn basic knowledge.</p> + +<p>Firstly, you need to add this column as measure in data model.</p> + +<p><img src="/images/blog/percentile_1.png" alt="" /></p> + +<p>Secondly, create a cube and add a PERCENTILE measure.</p> + +<p><img src="/images/blog/percentile_2.png" alt="" /></p> + +<p>Finally, build the cube and try some query.</p> + +<p><img src="/images/blog/percentile_3.png" alt="" /></p> +</description> + <pubDate>Sat, 01 Apr 2017 15:22:22 -0700</pubDate> + <link>http://kylin.apache.org/blog/2017/04/01/percentile-measure/</link> + <guid isPermaLink="true">http://kylin.apache.org/blog/2017/04/01/percentile-measure/</guid> + + + <category>blog</category> + + </item> + + <item> <title>Apache Kylin v2.0.0 Beta Announcement</title> <description><p>The Apache Kylin community is pleased to announce the <a href="http://kylin.apache.org/download/">v2.0.0 beta package</a> is ready for download and test.</p> @@ -599,173 +640,6 @@ kylin_sales_cube is a cube name.<br / <category>blog</category> - - </item> - - <item> - <title>Use Count Distinct in Apache Kylin</title> - <description><p>Since v.1.5.3</p> - -<h2 id="background">Background</h2> -<p>Count Distinct is a commonly measure in OLAP analyze, usually used for uv, etc. Apache Kylin offers two kinds of count distinct, approximately and precisely, differs on resource and performance.</p> - -<h2 id="approximately-count-distinct">Approximately Count Distinct</h2> -<p>Apache Kylin implements approximately count distinct using HyperLogLog algorithm, offered serveral precision, with the error rates from 9.75% to 1.22%. <br /> -The result of measure has theorically upper limit in size, as 2^N bytes. For the max precision N=16, the upper limit is 64KB, and the max error rate is 1.22%. <br /> -This implementationâs pros is fast caculating and storage resource saving, but canât be used for precisely requirements.</p> - -<h2 id="precisely-count-distinct">Precisely Count Distinct</h2> -<p>Apache Kylin also implements precisely count distinct based on bitmap. For the data with type tiny int(byte), small int(short) and int, project the value into the bitmap directly. For the data with type long, string and others, encode the value as String into a dict, and project the dict id into the bitmap.<br /> -The result of measure is the serialized data of bitmap, not just the count value. This makes sure that the result is always correct with any roll-up, even across segments.<br /> -This implementationâs pros is precise result, no error, but needs more storage resources. One result size might be hundreds of MB, when the count distinct value over millions.</p> - -<h2 id="global-dictionary">Global Dictionary</h2> -<p>Apache Kylin encodes values into dictionay at the segment level by default. That means one value in different segments maybe encoded into different ID, then the result of count distinct will be incorrect.</p> - -<p>In v1.5.3 we introduce âGlobal Dictionaryâ with ensurance that one value always be encoded into the same ID across different segments. Meanwhile, the capacity of dictionary has expanded dramatically, upper to support 2 billion values in one dictionary. It can also be used to replace the default dictionary which has 5 million values limitation.</p> - -<p>Current version (v1.5.3) has no GUI for defining global dictionary yet, you need manually edit the cube desc json like this:</p> - -<div class="highlighter-rouge"><pre class="highlight"><code>"dictionaries": [ - { - "column": "SUCPAY_USERID", - "reuse": "USER_ID", - "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" - } -] -</code></pre> -</div> - -<p>The <code class="highlighter-rouge">column</code> means the column which to be encoded, the <code class="highlighter-rouge">builder</code> specifies the dictionary builder, only <code class="highlighter-rouge">org.apache.kylin.dict.GlobalDictionaryBuilder</code> is available for now.<br /> -The âreuse` is used to optimize the dict of more than one columns based on one dataset, please refer the next section âExampleâ for more details.</p> - -<p>Higher version (v1.5.4 or above) provided GUI for global dictionary definetion, the âAdvanced Dictionariesâ part in step âAdvanced Settingâ of cube designer.</p> - -<p>The global dictionay cannot be used for dimension encoding for now, that means if one column is used for both dimension and count distinct measure in one cube, its dimension encoding should be others instead of dict.</p> - -<h2 id="example">Example</h2> -<p>Hereâs some example data:</p> - -<table> - <thead> - <tr> - <th style="text-align: left">DT</th> - <th style="text-align: center">USER_ID</th> - <th style="text-align: center">FLAG1</th> - <th style="text-align: center">FLAG2</th> - <th style="text-align: center">USER_ID_FLAG1</th> - <th style="text-align: center">USER_ID_FLAG2</th> - </tr> - </thead> - <tbody> - <tr> - <td style="text-align: left">2016-06-08</td> - <td style="text-align: center">AAA</td> - <td style="text-align: center">1</td> - <td style="text-align: center">1</td> - <td style="text-align: center">AAA</td> - <td style="text-align: center">AAA</td> - </tr> - <tr> - <td style="text-align: left">2016-06-08</td> - <td style="text-align: center">BBB</td> - <td style="text-align: center">1</td> - <td style="text-align: center">1</td> - <td style="text-align: center">BBB</td> - <td style="text-align: center">BBB</td> - </tr> - <tr> - <td style="text-align: left">2016-06-08</td> - <td style="text-align: center">CCC</td> - <td style="text-align: center">0</td> - <td style="text-align: center">1</td> - <td style="text-align: center">NULL</td> - <td style="text-align: center">CCC</td> - </tr> - <tr> - <td style="text-align: left">2016-06-09</td> - <td style="text-align: center">AAA</td> - <td style="text-align: center">0</td> - <td style="text-align: center">1</td> - <td style="text-align: center">NULL</td> - <td style="text-align: center">AAA</td> - </tr> - <tr> - <td style="text-align: left">2016-06-09</td> - <td style="text-align: center">CCC</td> - <td style="text-align: center">1</td> - <td style="text-align: center">0</td> - <td style="text-align: center">CCC</td> - <td style="text-align: center">NULL</td> - </tr> - <tr> - <td style="text-align: left">2016-06-10</td> - <td style="text-align: center">BBB</td> - <td style="text-align: center">0</td> - <td style="text-align: center">1</td> - <td style="text-align: center">NULL</td> - <td style="text-align: center">BBB</td> - </tr> - </tbody> -</table> - -<p>Thereâs basic columns <code class="highlighter-rouge">DT</code>, <code class="highlighter-rouge">USER_ID</code>, <code class="highlighter-rouge">FLAG1</code>, <code class="highlighter-rouge">FLAG2</code>, and condition columns <code class="highlighter-rouge">USER_ID_FLAG1=if(FLAG1=1,USER_ID,null)</code>, <code class="highlighter-rouge">USER_ID_FLAG2=if(FLAG2=1,USER_ID,null)</code>. Supposed the cube is builded by day, has 3 segments.</p> - -<p>Without the global dictionay, the precisely count distinct in a semgent is correct, but the roll-up acrros segments will be wrong. Hereâs an example:</p> - -<div class="highlighter-rouge"><pre class="highlight"><code>select count(distinct user_id_flag1) from table where dt in ('2016-06-08', '2016-06-09') -</code></pre> -</div> -<p>The result is 2 but not 3. The reason is that the dict in 2016-06-08 segment is AAA=&gt;1, BBB=&gt;1, and the dict in 2016-06-09 segment is CCC=&gt; 1.<br /> -With global dictionary config as below, the dict became as AAA=&gt;1, BBB=&gt;2, CCC=&gt;3, that will procude correct result.<br /> -<code class="highlighter-rouge"> -"dictionaries": [ - { - "column": "USER_ID_FLAG1", - "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" - } -] -</code></p> - -<p>Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a subset of USER_ID dataset, that made the dictionary re-using possible. Just encode the USER_ID dataset, and config USER_ID_FLAG1 and USER_ID_FLAG2 resue USER_ID dict:<br /> -<code class="highlighter-rouge"> -"dictionaries": [ - { - "column": "USER_ID", - "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" - }, - { - "column": "USER_ID_FLAG1", - "reuse": "USER_ID", - "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" - }, - { - "column": "USER_ID_FLAG2", - "reuse": "USER_ID", - "builder": "org.apache.kylin.dict.GlobalDictionaryBuilder" - } -] -</code></p> - -<h2 id="performance-tunning">Performance Tunning</h2> -<p>When using global dictionary and the dictionary is large, the step âBuild Base Cuboid Dataâ may took long time. That mainly caused by the dictionary cache loading and eviction cost, since the dictionary size is bigger than mapper memory size. To solve this problem, overwrite the cube configuration as following, adjust the mapper size to 8GB:<br /> -<code class="highlighter-rouge"> -kylin.job.mr.config.override.mapred.map.child.java.opts=-Xmx8g -kylin.job.mr.config.override.mapreduce.map.memory.mb=8500 -</code></p> - -<h2 id="conclusions">Conclusions</h2> -<p>Hereâs some basically pricipal to decide which kind of count distinct will be used:<br /> - - If the result with error rate is acceptable, approximately way is always an better way<br /> - - If you need precise result, the only way is precisely count distinct<br /> - - If you donât need roll-up across segments (like non-partitioned cube), or the column data type is tinyint/smallint/int, or the values count is less than 5M, just use default dictionary; otherwise the global dictionary should be configured, and also consider the âreuseâ column optimization</p> -</description> - <pubDate>Mon, 01 Aug 2016 11:30:00 -0700</pubDate> - <link>http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</link> - <guid isPermaLink="true">http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</guid> - - - <category>blog</category> </item> Added: kylin/site/images/blog/percentile_1.png URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_1.png?rev=1790118&view=auto ============================================================================== Binary file - no diff available. Propchange: kylin/site/images/blog/percentile_1.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: kylin/site/images/blog/percentile_2.png URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_2.png?rev=1790118&view=auto ============================================================================== Binary file - no diff available. Propchange: kylin/site/images/blog/percentile_2.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream Added: kylin/site/images/blog/percentile_3.png URL: http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_3.png?rev=1790118&view=auto ============================================================================== Binary file - no diff available. Propchange: kylin/site/images/blog/percentile_3.png ------------------------------------------------------------------------------ svn:mime-type = application/octet-stream