Author: lidong Date: Wed Oct 19 06:02:59 2016 New Revision: 1765533 URL: http://svn.apache.org/viewvc?rev=1765533&view=rev Log: minor update on the blog
Modified: kylin/site/blog/2016/10/18/new-nrt-streaming/index.html kylin/site/feed.xml Modified: kylin/site/blog/2016/10/18/new-nrt-streaming/index.html URL: http://svn.apache.org/viewvc/kylin/site/blog/2016/10/18/new-nrt-streaming/index.html?rev=1765533&r1=1765532&r2=1765533&view=diff ============================================================================== --- kylin/site/blog/2016/10/18/new-nrt-streaming/index.html (original) +++ kylin/site/blog/2016/10/18/new-nrt-streaming/index.html Wed Oct 19 06:02:59 2016 @@ -205,15 +205,15 @@ </li> </ul> -<p>To overcome these limitations, the Apache Kylin team developed the new streaming (<a href="https://issues.apache.org/jira/browse/KYLIN-1726">KYLIN-1726</a>) with Kafka 0.10 API, it has been tested internally for some time, will release to public soon.</p> +<p>To overcome these limitations, the Apache Kylin team developed the new streaming (<a href="https://issues.apache.org/jira/browse/KYLIN-1726">KYLIN-1726</a>) with Kafka 0.10, it has been tested internally for some time, will release to public soon.</p> -<p>The new design is a perfect implementation under Kylin 1.5âs âPlug-inâ architecture: treat Kafka topic as a âData Sourceâ like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as from Hive. Figure 1 is a high level architecture of the new design.</p> +<p>The new design is a perfect implementation under Kylin 1.5âs âplug-inâ architecture: treat Kafka topic as a âData Sourceâ like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as other cubes. Figure 1 is a high level architecture of the new design.</p> <p><img src="/images/blog/new-streaming.png" alt="Kylin New Streaming Framework Architecture" /></p> -<p>The adapter to read Kafka messages is modified from <a href="https://github.com/amient/kafka-hadoop-loader">kafka-hadoop-loader</a>, which is open sourced under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; in next steps Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.</p> +<p>The adapter to read Kafka messages is modified from <a href="https://github.com/amient/kafka-hadoop-loader">kafka-hadoop-loader</a>, the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.</p> -<p>To overcome the âdata lossâ problem, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap is allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: Kylin will scan all segments which include the queried time. Figure 2 illurates this.</p> +<p>To overcome the âdata lossâ limitation, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: each segment has a âminâ date/time and a âmaxâ date/time; Kylin will scan all segments which matched with the queried time scope. Figure 2 illurates this.</p> <p><img src="/images/blog/offset-as-partition-value.png" alt="Use Offset to Cut Segments" /></p> @@ -227,23 +227,25 @@ <li>Add REST API to check and fill the segment holes</li> </ul> -<p>The integration test result shows big improvements than the previous version:</p> +<p>The integration test result is promising:</p> <ul> <li>Scalability: it can easily process up to hundreds of million records in one build;</li> - <li>Flexibility: trigger the build at any time with the frequency you want, e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets so it can resume from the last position;</li> - <li>Stability: pretty stable, no OutOfMemory error;</li> + <li>Flexibility: you can trigger the build at any time, with the frequency you want; for example: every 5 minutes in day time but every hour in night time, and even pause when you need do a maintenance; Kylin manages the offsets so it can automatically continue from the last position;</li> + <li>Stability: pretty stable, no OutOfMemoryError;</li> <li>Management: user can check all jobsâ status through Kylinâs âMonitorâ page or REST API;</li> <li>Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes;</li> </ul> -<p>Here are a couple of screenshots in this test:<br /> +<p>Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial in the future:<br /> <img src="/images/blog/streaming-monitor.png" alt="Streaming Job Monitoring" /></p> <p><img src="/images/blog/streaming-adapter.png" alt="Streaming Adapter" /></p> <p><img src="/images/blog/streaming-twitter.png" alt="Streaming Twitter Sample" /></p> +<p>In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the previous version). Nextly, the Apache Kylin team will move toward a Real Time engine.</p> + </article> </div> Modified: kylin/site/feed.xml URL: http://svn.apache.org/viewvc/kylin/site/feed.xml?rev=1765533&r1=1765532&r2=1765533&view=diff ============================================================================== --- kylin/site/feed.xml (original) +++ kylin/site/feed.xml Wed Oct 19 06:02:59 2016 @@ -19,8 +19,8 @@ <description>Apache Kylin Home</description> <link>http://kylin.apache.org/</link> <atom:link href="http://kylin.apache.org/feed.xml" rel="self" type="application/rss+xml"/> - <pubDate>Tue, 18 Oct 2016 07:59:25 -0700</pubDate> - <lastBuildDate>Tue, 18 Oct 2016 07:59:25 -0700</lastBuildDate> + <pubDate>Wed, 19 Oct 2016 06:59:18 -0700</pubDate> + <lastBuildDate>Wed, 19 Oct 2016 06:59:18 -0700</lastBuildDate> <generator>Jekyll v2.5.3</generator> <item> @@ -44,15 +44,15 @@ </li> </ul> -<p>To overcome these limitations, the Apache Kylin team developed the new streaming (<a href="https://issues.apache.org/jira/browse/KYLIN-1726">KYLIN-1726</a>) with Kafka 0.10 API, it has been tested internally for some time, will release to public soon.</p> +<p>To overcome these limitations, the Apache Kylin team developed the new streaming (<a href="https://issues.apache.org/jira/browse/KYLIN-1726">KYLIN-1726</a>) with Kafka 0.10, it has been tested internally for some time, will release to public soon.</p> -<p>The new design is a perfect implementation under Kylin 1.5âs âPlug-inâ architecture: treat Kafka topic as a âData Sourceâ like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as from Hive. Figure 1 is a high level architecture of the new design.</p> +<p>The new design is a perfect implementation under Kylin 1.5âs âplug-inâ architecture: treat Kafka topic as a âData Sourceâ like Hive table, using an adapter to extract the data to HDFS; the next steps are almost the same as other cubes. Figure 1 is a high level architecture of the new design.</p> <p><img src="/images/blog/new-streaming.png" alt="Kylin New Streaming Framework Architecture" /></p> -<p>The adapter to read Kafka messages is modified from <a href="https://github.com/amient/kafka-hadoop-loader">kafka-hadoop-loader</a>, which is open sourced under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; in next steps Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.</p> +<p>The adapter to read Kafka messages is modified from <a href="https://github.com/amient/kafka-hadoop-loader">kafka-hadoop-loader</a>, the author Michal Harish open sourced it under Apache License V2.0; it starts a mapper for each Kafka partition, reading and then saving the messages to HDFS; so Kylin will be able to leverage existing framework like MR to do the processing, this makes the solution scalable and fault-tolerant.</p> -<p>To overcome the âdata lossâ problem, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap is allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: Kylin will scan all segments which include the queried time. Figure 2 illurates this.</p> +<p>To overcome the âdata lossâ limitation, Kylin adds the start/end offset information on each Cube segment, and then use the offsets as the partition value (no overlap allowed); this ensures no data be lost and 1 message be consumed at most once. To let the late/early message can be queried, Cube segments allow overlap for the partition time dimension: each segment has a âminâ date/time and a âmaxâ date/time; Kylin will scan all segments which matched with the queried time scope. Figure 2 illurates this.</p> <p><img src="/images/blog/offset-as-partition-value.png" alt="Use Offset to Cut Segments" /></p> @@ -66,22 +66,24 @@ <li>Add REST API to check and fill the segment holes</li> </ul> -<p>The integration test result shows big improvements than the previous version:</p> +<p>The integration test result is promising:</p> <ul> <li>Scalability: it can easily process up to hundreds of million records in one build;</li> - <li>Flexibility: trigger the build at any time with the frequency you want, e.g: every 5 minutes in day and every hour in night; Kylin manages the offsets so it can resume from the last position;</li> - <li>Stability: pretty stable, no OutOfMemory error;</li> + <li>Flexibility: you can trigger the build at any time, with the frequency you want; for example: every 5 minutes in day time but every hour in night time, and even pause when you need do a maintenance; Kylin manages the offsets so it can automatically continue from the last position;</li> + <li>Stability: pretty stable, no OutOfMemoryError;</li> <li>Management: user can check all jobsâ status through Kylinâs âMonitorâ page or REST API;</li> <li>Build Performance: in a testing cluster (8 AWS instances to consume Twitter streams), 10 thousands arrives per second, define a 9-dimension cube with 3 measures; when build interval is 2 mintues, the job finishes in around 3 minutes; if change interval to 5 mintues, build finishes in around 4 minutes;</li> </ul> -<p>Here are a couple of screenshots in this test:<br /> +<p>Here are a couple of screenshots in this test, we may compose it as a step-by-step tutorial in the future:<br /> <img src="/images/blog/streaming-monitor.png" alt="Streaming Job Monitoring" /></p> <p><img src="/images/blog/streaming-adapter.png" alt="Streaming Adapter" /></p> <p><img src="/images/blog/streaming-twitter.png" alt="Streaming Twitter Sample" /></p> + +<p>In short, this is a more robust Near Real Time Streaming OLAP solution (compared with the previous version). Nextly, the Apache Kylin team will move toward a Real Time engine.</p> </description> <pubDate>Tue, 18 Oct 2016 10:30:00 -0700</pubDate> <link>http://kylin.apache.org/blog/2016/10/18/new-nrt-streaming/</link>