Repository: kylin Updated Branches: refs/heads/document fe0d56898 -> ed810ebea
minor changes on documents Project: http://git-wip-us.apache.org/repos/asf/kylin/repo Commit: http://git-wip-us.apache.org/repos/asf/kylin/commit/ed810ebe Tree: http://git-wip-us.apache.org/repos/asf/kylin/tree/ed810ebe Diff: http://git-wip-us.apache.org/repos/asf/kylin/diff/ed810ebe Branch: refs/heads/document Commit: ed810ebea8f06bbeeb432469866da56a43762caf Parents: fe0d568 Author: honma <ho...@ebay.com> Authored: Thu Feb 11 20:44:56 2016 +0800 Committer: honma <ho...@ebay.com> Committed: Wed Feb 17 10:36:03 2016 +0800 ---------------------------------------------------------------------- website/_dev/howto_test.md | 35 +++++++++++++++---- .../_posts/blog/2016-02-03-streaming-cubing.md | 28 +++++++++++++++ website/images/develop/streaming.png | Bin 0 -> 211683 bytes 3 files changed, 56 insertions(+), 7 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/kylin/blob/ed810ebe/website/_dev/howto_test.md ---------------------------------------------------------------------- diff --git a/website/_dev/howto_test.md b/website/_dev/howto_test.md index 150ece8..6aaa056 100644 --- a/website/_dev/howto_test.md +++ b/website/_dev/howto_test.md @@ -7,21 +7,19 @@ permalink: /development/howto_test.html In general, there should be unit tests to cover individual classes; there must be integration test to cover end-to-end scenarios like build, merge, and query. Unit test must run independently (does not require an external sandbox). - -## 2.x branches +## Test 2.x branches * `mvn test` to run unit tests, which has a limited test coverage. * Unit tests has no external dependency and can run on any machine. * The unit tests do not cover end-to-end scenarios like build, merge, and query. * The unit tests take a few minutes to complete. * `dev-support/test_all_against_hdp_2_2_4_2_2.sh` to run integration tests, which has the best test coverage. - * Integration tests __must run on a Hadoop sandbox__. Make sure all changes you want to test are avaiable on sandbox. + * Integration tests __better be run on a Hadoop sandbox__. We suggest to checkout a copy of code in your sandbox and direct run the test_all_against_hdp_2_2_4_2_2.sh in it. If you don't want to put codes on sandbox, refer to __More on 2.x UT/IT separation__ * As the name indicates, the script is only for hdp 2.2.4.2, but you get the idea of how integration test run from it. * The integration tests start from generate random data, then build cube, merge cube, and finally query the result and compare to H2 DB. - * The integration tests take a few hours to complete. - + * The integration tests take one to two hours to complete. -## 1.x branches +## Test 1.x branches * `mvn test` to run unit tests, which has a limited test coverage. * What's special about 1.x is that a hadoop/hbase mini cluster is used to cover queries in unit test. @@ -32,12 +30,34 @@ In general, there should be unit tests to cover individual classes; there must b * `mvn test -fae -P sandbox` * `mvn test -fae -Dtest=org.apache.kylin.query.test.IIQueryTest -Dhdp.version=2.2.0.0-2041 -DfailIfNoTests=false -P sandbox` +## More on 2.x UT/IT separation + +From Kylin 2.0 you can run UT(Unit test), environment cube provision and IT(Integration test) separately. +Running `mvn verify -Dhdp.version=2.2.4.2-2` (assume you're on your sandbox) is all you need to run a complete all the test suites. + +It will execute the following steps sequentially: + + 1. Build Artifacts + 2. Run all UTs (takes few minutes) + 3. Provision cubes on the sandbox environment for IT uasge (takes 1~2 hours) + 4. Run all ITs (takes few tens of minutes) + 5. verify jar stuff + +If your code change is minor and it merely requires running UT, use: +`mvn test` +If your sandbox is already provisioned and your code change will not affect the result of sandbox provision, (and you don't want to wait hours of provision) just run the following commands to separately run UT and IT: +`mvn test` +`mvn failsafe:integration-test` + +### Cube Provision + +Environment cube provision is indeed running kylin cubing jobs to prepare example cubes in the sandbox. These prepared cubes will be used by the ITs. Currently provision step is bound with the maven pre-integration-test phase, and it contains running BuildCubeWithEngine (HBase required), BuildCubeWithStream(Kafka required) and BuildIIWithStream(Kafka Required). You can run the mvn commands on you sandbox or your develop computer. For the latter case you need to set kylin.job.run.as.remote.cmd=true in __$KYLIN_HOME/examples/test_case_data/sandbox/kylin.properties__. +Try appending `-DfastBuildMode=true` to mvn verify command to speed up provision by skipping incremental cubing. ## More on 1.x Mini Cluster Kylin 1.x used to move as many as possible unit test cases from sandbox to HBase mini cluster (not any more in 2.x), so that user can run tests easily in local without a hadoop sandbox. Two maven profiles are created in the root pom.xml, "default" and "sandbox". The default profile will startup a HBase Mini Cluster to prepare the test data and run the unit tests (the test cases that are not supported by Mini cluster have been added in the "exclude" list). If you want to keep using Sandbox to run test, just run `mvn test -P sandbox` - ### When use the "default" profile, Kylin will * Startup a HBase minicluster and update KylinConfig with the dynamic HBase configurations @@ -46,4 +66,5 @@ Kylin 1.x used to move as many as possible unit test cases from sandbox to HBase * After all test cases be completed, shutdown minicluster and cleanup KylinConfig cache ### To ensure Mini cluster can run successfully, you need + * Make sure JAVA_HOME is properly set http://git-wip-us.apache.org/repos/asf/kylin/blob/ed810ebe/website/_posts/blog/2016-02-03-streaming-cubing.md ---------------------------------------------------------------------- diff --git a/website/_posts/blog/2016-02-03-streaming-cubing.md b/website/_posts/blog/2016-02-03-streaming-cubing.md new file mode 100644 index 0000000..525f4b8 --- /dev/null +++ b/website/_posts/blog/2016-02-03-streaming-cubing.md @@ -0,0 +1,28 @@ +--- +layout: post-blog +title: Streaming cubing (Prototype) +date: 2016-02-03 16:30:00 +author: Hongbin Ma +categories: blog +--- + + +One of the most important features in 2.x branches is streaming cubing which enables OLAP analysis on streaming data. Streaming cubing delivers faster insights on the data to help more promptly business decisions. Even though there are already many real time analysis tools in open source community, Kylin Streaming cubing still differs from them in multiple angles: + +Firstly, Kylin Streaming Cubing aligns with Kylin traditional cubing to provided unified, ANSI SQL interface. Actually Kylin Streaming shares the storage engine and query engine with traditional Kylin cubes, so in theory all of the optimization techniques to save storage and speed up query performance can also be applied on streaming cubes. Besides, all the supported aggregations/filters/UDFs still work for streaming cubes. By unifying the storage engine and query engine we also get freed from double amount of maintaince work. + +Secondly, Kylin Streaming Cubing does not require large amount of memory to store real time data, nor does it attempts to provide truly "real time" analysis. By our customer survey we found that minutes of visualization latency is acceptable for OLAP analysts. So our streaming cubing adopts the micro batch approach. Incoming streaming data are partitioned into different time windows and we build a micro batch for each time window. The cube output for each micro batch is directly saved to HBase. The query engine goes to HBase for data retrieving instead of the data ingestion server. The benefit of such design is that we don't have to maintain large amount of in-memory index which could easily require tens of gigabytes of memory. In the future Kylin might need to consider truly "real time" support, too. + +Thirdly, Kylin Streaming Cubing data will be persistent and gradually be converted to traditional cubes, thus customers can still query "cold data" without any compromise on performance. As discussed above the output of streaming cubing is directly saved to HBase as a new segment. The traditional job engine will be notified of the new segment and take over to schedule merge jobs when then segments accumulates. Day after day the segments of the streaming cube got merged and become a very large traditional cube. + + + +With the major difference in mind we will introduce the modules for Kylin Streaming cubing. Kylin Streaming cubing consist of three major parts: + +* Streaming Input to retrieve data from a replayable data queue (currently it is Kafka) within given time window. Streaming Input is also responsible for primary data cleaning and normalization. By default Kylin Streaming provides a default implementation to parse the messages from the source queue. Customers can choose to configure the parser or provide a brand new one based on their requirements. +* Streaming Batch Ingestion to ingest the incoming data batch and transform it into a micro cube. Thanks to the latest Kylin In-memory cubing technology, this step is now times faster and space-saving than previous. The micro cube is directly saved to HBase. +* Job Scheduling Module to trigger Streaming Batch Ingestion. Kylin does not put too much efforts in job scheduling, streaming cubing is not a exception. Currently we provided a simple implementation based on Linux Crontab. + +We'll publish more detailed documents on how to use Kylin Streaming soon. In latest 2.x branch we are also working on more complicated load balancing schemes for streaming cubing. Please stay tuned. + + \ No newline at end of file http://git-wip-us.apache.org/repos/asf/kylin/blob/ed810ebe/website/images/develop/streaming.png ---------------------------------------------------------------------- diff --git a/website/images/develop/streaming.png b/website/images/develop/streaming.png new file mode 100644 index 0000000..1123e14 Binary files /dev/null and b/website/images/develop/streaming.png differ