Author: lidong
Date: Tue Apr  4 14:00:31 2017
New Revision: 1790118

URL: http://svn.apache.org/viewvc?rev=1790118&view=rev
Log:
add blog about percentile measure

Added:
    kylin/site/blog/2017/04/
    kylin/site/blog/2017/04/01/
    kylin/site/blog/2017/04/01/percentile-measure/
    kylin/site/blog/2017/04/01/percentile-measure/index.html
    kylin/site/images/blog/percentile_1.png   (with props)
    kylin/site/images/blog/percentile_2.png   (with props)
    kylin/site/images/blog/percentile_3.png   (with props)
Modified:
    kylin/site/blog/index.html
    kylin/site/feed.xml

Added: kylin/site/blog/2017/04/01/percentile-measure/index.html
URL: 
http://svn.apache.org/viewvc/kylin/site/blog/2017/04/01/percentile-measure/index.html?rev=1790118&view=auto
==============================================================================
--- kylin/site/blog/2017/04/01/percentile-measure/index.html (added)
+++ kylin/site/blog/2017/04/01/percentile-measure/index.html Tue Apr  4 
14:00:31 2017
@@ -0,0 +1,294 @@
+<!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+<!doctype html>
+<html>
+       <!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+
+  <title>Apache Kylin | A new measure for Percentile precalculation</title>
+  <meta name="description" content="Introduction">
+  <meta name="author"      content="Apache Kylin">
+  <link rel="shortcut icon" href="fav.png" type="image/png">
+
+
+
+<link rel="stylesheet" href="/assets/css/animate.css">
+<!-- Bootstrap -->
+<link rel="stylesheet" href="/assets/css/bootstrap.min.css">
+
+<!-- Fonts -->
+<!-- <link rel="stylesheet" 
href="http://fonts.googleapis.com/css?family=Alice|Open+Sans:400,300,700"> -->
+
+<!-- Icons -->
+<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
+
+  <!-- Custom styles -->
+  <link rel="stylesheet" href="/assets/css/styles.css">
+  <link rel="stylesheet" href="/assets/css/docs.css">
+  <link rel="stylesheet" href="/assets/css/pygments.css">
+
+  <link rel="canonical" 
href="http://kylin.apache.org/blog/2017/04/01/percentile-measure/";>
+  <link rel="alternate" type="application/rss+xml" title="Apache Kylin" 
href="http://kylin.apache.org/feed.xml"; />
+
+<!--[if lt IE 9]> <script src="assets/js/html5shiv.js"></script> <![endif]-->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+  
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+
+  //oringal tracker for kylin.io
+  ga('create', 'UA-55534813-1', 'auto');
+  //new tracker for kylin.apache.org
+  ga('create', 'UA-55534813-2', 'auto', {'name':'toplevel'});
+
+  ga('send', 'pageview');
+  ga('toplevel.send', 'pageview');
+
+
+</script>
+<script type="text/javascript" src="/assets/js/jquery-1.9.1.min.js"></script>
+<script type="text/javascript" src="/assets/js/nside.js"></script> </script>
+<script type="text/javascript" src="/assets/js/nnav.js"></script> </script>
+</head>
+
+       <body>
+               <!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<header id="header" >
+  
+  <div id="head" class="parallax" parallax-speed="3" >
+    <div id="logo" class="text-center"> <img class="img-circle" 
id="circlelogo" src="/assets/images/kylin_logo.jpg"> <span class="title" 
>Apache Kylin™</span> <span class="tagline">Extreme OLAP Engine for Big 
Data</span> 
+    </div>
+    <div class="text-center" style="
+      position: relative;
+      top: 66px;
+      width: 1080px;
+      margin: 0 auto;
+      z-index: 11;
+      margin-top: -253px;
+      text-align: right;"
+    >
+      <a href="http://apache.org/foundation/contributing.html"; title="Support 
Apache" style="margin-left: 150px;">
+          <img src="https://www.apache.org/images/SupportApache-small.png"; 
style="height: 150px; width: 150px;">
+      </a>
+    </div>  
+  </div>
+  
+
+  <!-- Main Menu -->
+  <nav class="navbar navbar-default" role="navigation" id="nav-wrapper">
+  <div class="container-fluid" id="nav">
+    <!--
+    <img class="img-circle" width="40px" height="40px" id="circlelogo" 
src="/assets/images/kylin_logo.jpg">
+    -->
+    <!-- Brand and toggle get grouped for better mobile display -->
+    <div class="navbar-header">
+      <button type="button" class="navbar-toggle collapsed" 
data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
+        <span class="sr-only">Toggle navigation</span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+        <span class="icon-bar"></span>
+      </button>
+     
+    </div>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
+      <ul class="nav navbar-nav">
+     <li><a href="/">Home</a></li>
+          <li><a href="/docs20" >Docs</a></li>
+          <li><a href="/download">Download</li>
+          <li><a href="/community" >Community</a></li>
+          <li><a href="/development" >Development</a></li>
+          <li><a href="/blog">Blog</li>
+          <li><a href="/cn" >中文版</a></li>  
+          <li><a href="https://twitter.com/apachekylin"; target="_blank" 
class="fa fa-twitter fa-lg" title="Twitter: @ApacheKylin" ></a></li>
+          <li><a href="https://github.com/apache/kylin"; target="_blank" 
class="fa fa-github-alt fa-lg" title="Github: apache/kylin" ></a></li>          
+          <li><a href="https://www.facebook.com/kylinio"; target="_blank" 
class="fa fa-facebook fa-lg" title="Facebook: kylin.io" ></a></li>   
+      </ul>      
+    </div><!-- /.navbar-collapse -->
+  </div><!-- /.container-fluid -->
+</nav>
+ </header>
+
+               <div class="page-content">
+                       <header style=" padding:2em 0 0 0">
+                       <div class="container" >
+                               <h4 class="section-title"><span>Apache Kylin™ 
Technical Blog</span></h4>
+                       </div>
+               </div>
+
+               <div class="container">
+                       <div>
+                               <article class="post-content" > 
+                               <!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<div class="post" style=" padding:2em 4em 4em 4em">
+
+  <header class="post-header">
+    <h1 class="post-title">A new measure for Percentile precalculation</h1>
+    <p class="post-meta" >Apr 1, 2017 • Dong Li</p>
+  </header>
+
+  <article class="post-content" >
+    <h2 id="introduction">Introduction</h2>
+
+<p>Since Apache Kylin 2.0, there’s a new measure for percentile 
precalculation, which aims at (sub-)second latency for 
<strong>approximate</strong> percentile analytics SQL queries. The 
implementation is based on <a 
href="https://github.com/tdunning/t-digest";>t-digest</a> library under Apachee 
2.0 license, which provides a high-effecient data structure to save aggregation 
counters and algorithm to calculate approximate result of percentile.</p>
+
+<h3 id="percentile">Percentile</h3>
+<p><em>From <a 
href="https://en.wikipedia.org/wiki/Percentile";>wikipedia</a></em>: A 
<strong>percentile</strong> (or a <strong>centile</strong>) is a measure used 
in statistics indicating the value below which a given percentage of 
observations in a group of observations fall. For example, the 20th percentile 
is the value (or score) below which 20% of the observations may be found.</p>
+
+<p>In Apache Kylin, we support the similar SQL sytanx like Apache Hive, with a 
aggregation function called <strong>percentile(&lt;Number Column&gt;, 
&lt;Double&gt;)</strong>:</p>
+
+<div class="highlighter-rouge"><pre class="highlight"><code><span 
class="k">SELECT</span> <span class="n">seller_id</span><span 
class="p">,</span> <span class="n">percentile</span><span 
class="p">(</span><span class="n">price</span><span class="p">,</span> <span 
class="mi">0</span><span class="p">.</span><span class="mi">5</span><span 
class="p">)</span>
+<span class="k">FROM</span> <span class="n">test_kylin_fact</span>
+<span class="k">GROUP</span> <span class="k">BY</span> <span 
class="n">seller_id</span>
+</code></pre>
+</div>
+
+<h3 id="how-to-use">How to use</h3>
+<p>If you know little about <em>Cubes</em>, please go to <a 
href="http://kylin.apache.org/docs20/tutorial/kylin_sample.html";>QuickStart</a> 
first to learn basic knowledge.</p>
+
+<p>Firstly, you need to add this column as measure in data model.</p>
+
+<p><img src="/images/blog/percentile_1.png" alt="" /></p>
+
+<p>Secondly, create a cube and add a PERCENTILE measure.</p>
+
+<p><img src="/images/blog/percentile_2.png" alt="" /></p>
+
+<p>Finally, build the cube and try some query.</p>
+
+<p><img src="/images/blog/percentile_3.png" alt="" /></p>
+
+  </article>
+
+</div>
+
+
+
+
+
+                               </article>
+                       </div>
+               </div>          
+               <!--
+* Licensed to the Apache Software Foundation (ASF) under one
+* or more contributor license agreements.  See the NOTICE file
+* distributed with this work for additional information
+* regarding copyright ownership.  The ASF licenses this file
+* to you under the Apache License, Version 2.0 (the
+* "License"); you may not use this file except in compliance
+* with the License.  You may obtain a copy of the License at
+*
+*     http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an "AS IS" BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+-->
+
+<footer id="underfooter">
+    <div class="container">
+        <div class="row">
+            <div class="col-md-12 widget">
+                <div class="widget-body" style="text-align:center">
+                    <a href="http://www.apache.org";>
+                        <img id="asf-logo" alt="Apache Software Foundation" 
src="/assets/images/feather-small.gif">
+                    </a>
+
+                    <div>
+                        The contents of this website are © 2015 Apache 
Software Foundation under the terms of the <a
+                            href="http://www.apache.org/licenses/LICENSE-2.0";> 
Apache License v2 </a>. Apache Kylin and
+                        its logo are trademarks of the Apache Software 
Foundation.
+                    </div>
+
+                </div>
+            </div>
+        </div>
+        <!-- /row of widgets -->
+
+    </div>
+    <div></div>
+
+</footer>
+
+       <script src="/assets/js/jquery-1.9.1.min.js"></script> 
+       <script src="/assets/js/bootstrap.min.js"></script> 
+       <script src="/assets/js/main.js"></script>
+       </body>
+</html>
+
+
+
+

Modified: kylin/site/blog/index.html
URL: 
http://svn.apache.org/viewvc/kylin/site/blog/index.html?rev=1790118&r1=1790117&r2=1790118&view=diff
==============================================================================
--- kylin/site/blog/index.html (original)
+++ kylin/site/blog/index.html Tue Apr  4 14:00:31 2017
@@ -187,6 +187,12 @@
             
             <li>
         <h2 align="left" style="margin:0px">
+          <a class="post-link" href="/blog/2017/04/01/percentile-measure/">A 
new measure for Percentile precalculation</a></h2><div align="left" 
class="post-meta">posted: Apr 1, 2017</div>
+        
+      </li>
+    
+            <li>
+        <h2 align="left" style="margin:0px">
           <a class="post-link" 
href="/blog/2017/02/25/v2.0.0-beta-ready/">Apache Kylin v2.0.0 Beta 
Announcement</a></h2><div align="left" class="post-meta">posted: Feb 25, 
2017</div>
         
       </li>
@@ -277,13 +283,13 @@
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" 
href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 
正式发布</a></h2><div align="left" class="post-meta">posted: May 26, 
2016</div>
+          <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache 
Kylin v1.5.2 Release Announcement</a></h2><div align="left" 
class="post-meta">posted: May 26, 2016</div>
         
       </li>
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/blog/2016/05/26/release-v1.5.2/">Apache 
Kylin v1.5.2 Release Announcement</a></h2><div align="left" 
class="post-meta">posted: May 26, 2016</div>
+          <a class="post-link" 
href="/cn/blog/2016/05/26/release-v1.5.2/">Apache Kylin v1.5.2 
正式发布</a></h2><div align="left" class="post-meta">posted: May 26, 
2016</div>
         
       </li>
     
@@ -307,13 +313,13 @@
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache 
Kylin v1.5.0 Release Announcement</a></h2><div align="left" 
class="post-meta">posted: Mar 17, 2016</div>
+          <a class="post-link" 
href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 
正式发布</a></h2><div align="left" class="post-meta">posted: Mar 17, 
2016</div>
         
       </li>
     
             <li>
         <h2 align="left" style="margin:0px">
-          <a class="post-link" 
href="/cn/blog/2016/03/17/release-v1.5.0/">Apache Kylin v1.5.0 
正式发布</a></h2><div align="left" class="post-meta">posted: Mar 17, 
2016</div>
+          <a class="post-link" href="/blog/2016/03/17/release-v1.5.0/">Apache 
Kylin v1.5.0 Release Announcement</a></h2><div align="left" 
class="post-meta">posted: Mar 17, 2016</div>
         
       </li>
     

Modified: kylin/site/feed.xml
URL: 
http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1790118&r1=1790117&r2=1790118&view=diff
==============================================================================
--- kylin/site/feed.xml (original)
+++ kylin/site/feed.xml Tue Apr  4 14:00:31 2017
@@ -19,11 +19,52 @@
     <description>Apache Kylin Home</description>
     <link>http://kylin.apache.org/</link>
     <atom:link href="http://kylin.apache.org/feed.xml"; rel="self" 
type="application/rss+xml"/>
-    <pubDate>Wed, 29 Mar 2017 06:59:03 -0700</pubDate>
-    <lastBuildDate>Wed, 29 Mar 2017 06:59:03 -0700</lastBuildDate>
+    <pubDate>Tue, 04 Apr 2017 06:59:04 -0700</pubDate>
+    <lastBuildDate>Tue, 04 Apr 2017 06:59:04 -0700</lastBuildDate>
     <generator>Jekyll v2.5.3</generator>
     
       <item>
+        <title>A new measure for Percentile precalculation</title>
+        <description>&lt;h2 
id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
+
+&lt;p&gt;Since Apache Kylin 2.0, there’s a new measure for percentile 
precalculation, which aims at (sub-)second latency for 
&lt;strong&gt;approximate&lt;/strong&gt; percentile analytics SQL queries. The 
implementation is based on &lt;a 
href=&quot;https://github.com/tdunning/t-digest&quot;&gt;t-digest&lt;/a&gt; 
library under Apachee 2.0 license, which provides a high-effecient data 
structure to save aggregation counters and algorithm to calculate approximate 
result of percentile.&lt;/p&gt;
+
+&lt;h3 id=&quot;percentile&quot;&gt;Percentile&lt;/h3&gt;
+&lt;p&gt;&lt;em&gt;From &lt;a 
href=&quot;https://en.wikipedia.org/wiki/Percentile&quot;&gt;wikipedia&lt;/a&gt;&lt;/em&gt;:
 A &lt;strong&gt;percentile&lt;/strong&gt; (or a 
&lt;strong&gt;centile&lt;/strong&gt;) is a measure used in statistics 
indicating the value below which a given percentage of observations in a 
group of observations fall. For example, the 20th percentile is the value (or 
score) below which 20% of the observations may be found.&lt;/p&gt;
+
+&lt;p&gt;In Apache Kylin, we support the similar SQL sytanx like Apache Hive, 
with a aggregation function called &lt;strong&gt;percentile(&amp;lt;Number 
Column&amp;gt;, &amp;lt;Double&amp;gt;)&lt;/strong&gt;:&lt;/p&gt;
+
+&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span 
class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;seller_id&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;percentile&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span 
class=&quot;n&quot;&gt;price&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span 
class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span 
class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span 
class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;test_kylin_fact&lt;/span&gt;
+&lt;span class=&quot;k&quot;&gt;GROUP&lt;/span&gt; &lt;span 
class=&quot;k&quot;&gt;BY&lt;/span&gt; &lt;span 
class=&quot;n&quot;&gt;seller_id&lt;/span&gt;
+&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
+
+&lt;h3 id=&quot;how-to-use&quot;&gt;How to use&lt;/h3&gt;
+&lt;p&gt;If you know little about &lt;em&gt;Cubes&lt;/em&gt;, please go to 
&lt;a 
href=&quot;http://kylin.apache.org/docs20/tutorial/kylin_sample.html&quot;&gt;QuickStart&lt;/a&gt;
 first to learn basic knowledge.&lt;/p&gt;
+
+&lt;p&gt;Firstly, you need to add this column as measure in data 
model.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_1.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Secondly, create a cube and add a PERCENTILE measure.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_2.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+
+&lt;p&gt;Finally, build the cube and try some query.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/images/blog/percentile_3.png&quot; 
alt=&quot;&quot; /&gt;&lt;/p&gt;
+</description>
+        <pubDate>Sat, 01 Apr 2017 15:22:22 -0700</pubDate>
+        
<link>http://kylin.apache.org/blog/2017/04/01/percentile-measure/</link>
+        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2017/04/01/percentile-measure/</guid>
+        
+        
+        <category>blog</category>
+        
+      </item>
+    
+      <item>
         <title>Apache Kylin v2.0.0 Beta Announcement</title>
         <description>&lt;p&gt;The Apache Kylin community is pleased to 
announce the &lt;a href=&quot;http://kylin.apache.org/download/&quot;&gt;v2.0.0 
beta package&lt;/a&gt; is ready for download and test.&lt;/p&gt;
 
@@ -599,173 +640,6 @@ kylin_sales_cube is a cube name.&lt;br /
         
         
         <category>blog</category>
-        
-      </item>
-    
-      <item>
-        <title>Use Count Distinct in Apache Kylin</title>
-        <description>&lt;p&gt;Since v.1.5.3&lt;/p&gt;
-
-&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;
-&lt;p&gt;Count Distinct is a commonly measure in OLAP analyze, usually used 
for uv, etc. Apache Kylin offers two kinds of count distinct, approximately and 
precisely, differs on resource and performance.&lt;/p&gt;
-
-&lt;h2 id=&quot;approximately-count-distinct&quot;&gt;Approximately Count 
Distinct&lt;/h2&gt;
-&lt;p&gt;Apache Kylin implements approximately count distinct using 
HyperLogLog algorithm, offered serveral precision, with the error rates from 
9.75% to 1.22%. &lt;br /&gt;
-The result of measure has theorically upper limit in size, as 2^N bytes. For 
the max precision N=16, the upper limit is 64KB, and the max error rate is 
1.22%. &lt;br /&gt;
-This implementation’s pros is fast caculating and storage resource saving, 
but can’t be used for precisely requirements.&lt;/p&gt;
-
-&lt;h2 id=&quot;precisely-count-distinct&quot;&gt;Precisely Count 
Distinct&lt;/h2&gt;
-&lt;p&gt;Apache Kylin also implements precisely count distinct based on 
bitmap. For the data with type tiny int(byte), small int(short) and int, 
project the value into the bitmap directly. For the data with type long, string 
and others, encode the value as String into a dict, and project the dict id 
into the bitmap.&lt;br /&gt;
-The result of measure is the serialized data of bitmap, not just the count 
value. This makes sure that the result is always correct with any roll-up, even 
across segments.&lt;br /&gt;
-This implementation’s pros is precise result, no error, but needs more 
storage resources. One result size might be hundreds of MB, when the count 
distinct value over millions.&lt;/p&gt;
-
-&lt;h2 id=&quot;global-dictionary&quot;&gt;Global Dictionary&lt;/h2&gt;
-&lt;p&gt;Apache Kylin encodes values into dictionay at the segment level by 
default. That means one value in different segments maybe encoded into 
different ID, then the result of count distinct will be incorrect.&lt;/p&gt;
-
-&lt;p&gt;In v1.5.3 we introduce “Global Dictionary” with ensurance that 
one value always be encoded into the same ID across different segments. 
Meanwhile, the capacity of dictionary has expanded dramatically, upper to 
support 2 billion values in one dictionary. It can also be used to replace the 
default dictionary which has 5 million values limitation.&lt;/p&gt;
-
-&lt;p&gt;Current version (v1.5.3) has no GUI for defining global dictionary 
yet, you need manually edit the cube desc json like this:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;&quot;dictionaries&quot;: [
-    {
-          &quot;column&quot;: &quot;SUCPAY_USERID&quot;,
-                  &quot;reuse&quot;: &quot;USER_ID&quot;,
-          &quot;builder&quot;: 
&quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-
-&lt;p&gt;The &lt;code 
class=&quot;highlighter-rouge&quot;&gt;column&lt;/code&gt; means the column 
which to be encoded, the &lt;code 
class=&quot;highlighter-rouge&quot;&gt;builder&lt;/code&gt; specifies the 
dictionary builder, only &lt;code 
class=&quot;highlighter-rouge&quot;&gt;org.apache.kylin.dict.GlobalDictionaryBuilder&lt;/code&gt;
 is available for now.&lt;br /&gt;
-The ‘reuse` is used to optimize the dict of more than one columns based on 
one dataset, please refer the next section ‘Example’ for more 
details.&lt;/p&gt;
-
-&lt;p&gt;Higher version (v1.5.4 or above) provided GUI for global dictionary 
definetion, the ‘Advanced Dictionaries’ part in step ‘Advanced Setting’ 
of cube designer.&lt;/p&gt;
-
-&lt;p&gt;The global dictionay cannot be used for dimension encoding for now, 
that means if one column is used for both dimension and count distinct measure 
in one cube, its dimension encoding should be others instead of dict.&lt;/p&gt;
-
-&lt;h2 id=&quot;example&quot;&gt;Example&lt;/h2&gt;
-&lt;p&gt;Here’s some example data:&lt;/p&gt;
-
-&lt;table&gt;
-  &lt;thead&gt;
-    &lt;tr&gt;
-      &lt;th style=&quot;text-align: left&quot;&gt;DT&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;FLAG1&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;FLAG2&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID_FLAG1&lt;/th&gt;
-      &lt;th style=&quot;text-align: center&quot;&gt;USER_ID_FLAG2&lt;/th&gt;
-    &lt;/tr&gt;
-  &lt;/thead&gt;
-  &lt;tbody&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-08&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-09&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;AAA&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-09&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;CCC&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-    &lt;/tr&gt;
-    &lt;tr&gt;
-      &lt;td style=&quot;text-align: left&quot;&gt;2016-06-10&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;1&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;NULL&lt;/td&gt;
-      &lt;td style=&quot;text-align: center&quot;&gt;BBB&lt;/td&gt;
-    &lt;/tr&gt;
-  &lt;/tbody&gt;
-&lt;/table&gt;
-
-&lt;p&gt;There’s basic columns &lt;code 
class=&quot;highlighter-rouge&quot;&gt;DT&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;USER_ID&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;FLAG1&lt;/code&gt;, &lt;code 
class=&quot;highlighter-rouge&quot;&gt;FLAG2&lt;/code&gt;, and condition 
columns &lt;code 
class=&quot;highlighter-rouge&quot;&gt;USER_ID_FLAG1=if(FLAG1=1,USER_ID,null)&lt;/code&gt;,
 &lt;code 
class=&quot;highlighter-rouge&quot;&gt;USER_ID_FLAG2=if(FLAG2=1,USER_ID,null)&lt;/code&gt;.
 Supposed the cube is builded by day, has 3 segments.&lt;/p&gt;
-
-&lt;p&gt;Without the global dictionay, the precisely count distinct in a 
semgent is correct, but the roll-up acrros segments will be wrong. Here’s an 
example:&lt;/p&gt;
-
-&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;pre 
class=&quot;highlight&quot;&gt;&lt;code&gt;select count(distinct user_id_flag1) 
from table where dt in (&#39;2016-06-08&#39;, &#39;2016-06-09&#39;)
-&lt;/code&gt;&lt;/pre&gt;
-&lt;/div&gt;
-&lt;p&gt;The result is 2 but not 3. The reason is that the dict in 2016-06-08 
segment is AAA=&amp;gt;1, BBB=&amp;gt;1, and the dict in 2016-06-09 segment is 
CCC=&amp;gt; 1.&lt;br /&gt;
-With global dictionary config as below, the dict became as AAA=&amp;gt;1, 
BBB=&amp;gt;2, CCC=&amp;gt;3, that will procude correct result.&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-&quot;dictionaries&quot;: [
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG1&quot;,
-      &quot;builder&quot;: 
&quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;p&gt;Actually, the data of USER_ID_FLAG1 and USER_ID_FLAG2 both are a 
subset of USER_ID dataset, that made the dictionary re-using possible. Just 
encode the USER_ID dataset, and config USER_ID_FLAG1 and USER_ID_FLAG2 resue 
USER_ID dict:&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-&quot;dictionaries&quot;: [
-    {
-      &quot;column&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: 
&quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    },
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG1&quot;,
-      &quot;reuse&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: 
&quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    },
-    {
-      &quot;column&quot;: &quot;USER_ID_FLAG2&quot;,
-      &quot;reuse&quot;: &quot;USER_ID&quot;,
-      &quot;builder&quot;: 
&quot;org.apache.kylin.dict.GlobalDictionaryBuilder&quot;
-    }
-]
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;performance-tunning&quot;&gt;Performance Tunning&lt;/h2&gt;
-&lt;p&gt;When using global dictionary and the dictionary is large, the step 
‘Build Base Cuboid Data’ may took long time. That mainly caused by the 
dictionary cache loading and eviction cost, since the dictionary size is bigger 
than mapper memory size. To solve this problem, overwrite the cube 
configuration as following, adjust the mapper size to 8GB:&lt;br /&gt;
-&lt;code class=&quot;highlighter-rouge&quot;&gt;
-kylin.job.mr.config.override.mapred.map.child.java.opts=-Xmx8g
-kylin.job.mr.config.override.mapreduce.map.memory.mb=8500
-&lt;/code&gt;&lt;/p&gt;
-
-&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;
-&lt;p&gt;Here’s some basically pricipal to decide which kind of count 
distinct will be used:&lt;br /&gt;
- - If the result with error rate is acceptable, approximately way is always an 
better way&lt;br /&gt;
- - If you need precise result, the only way is precisely count distinct&lt;br 
/&gt;
- - If you don’t need roll-up across segments (like non-partitioned cube), or 
the column data type is tinyint/smallint/int, or the values count is less than 
5M, just use default dictionary; otherwise the global dictionary should be 
configured, and also consider the “reuse” column optimization&lt;/p&gt;
-</description>
-        <pubDate>Mon, 01 Aug 2016 11:30:00 -0700</pubDate>
-        
<link>http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</link>
-        <guid 
isPermaLink="true">http://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/</guid>
-        
-        
-        <category>blog</category>
         
       </item>
     

Added: kylin/site/images/blog/percentile_1.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_1.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_1.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/percentile_2.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_2.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_2.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: kylin/site/images/blog/percentile_3.png
URL: 
http://svn.apache.org/viewvc/kylin/site/images/blog/percentile_3.png?rev=1790118&view=auto
==============================================================================
Binary file - no diff available.

Propchange: kylin/site/images/blog/percentile_3.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream


Reply via email to