This is an automated email from the ASF dual-hosted git repository. kishoreg pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-pinot.git
The following commit(s) were added to refs/heads/master by this push: new 9477e01 Add benchmark documentation. (#5683) 9477e01 is described below commit 9477e014ed7e053e9d098b15742dbbdfe7e85493 Author: Nikola Grcevski <6207777+grcev...@users.noreply.github.com> AuthorDate: Sat Jul 11 15:05:26 2020 -0400 Add benchmark documentation. (#5683) --- contrib/pinot-druid-benchmark/README.md | 297 ++++++++++++++++++++++++++++++++ 1 file changed, 297 insertions(+) diff --git a/contrib/pinot-druid-benchmark/README.md b/contrib/pinot-druid-benchmark/README.md new file mode 100644 index 0000000..f7ac5f1 --- /dev/null +++ b/contrib/pinot-druid-benchmark/README.md @@ -0,0 +1,297 @@ +# Running the benchmark + +For instructions on how to run the Pinot/Druid benchmark please refer to the +```run_benchmark.sh``` file. + +In order to run the Apache Pinot benchmark you'll need to create the appropriate +data segments, which are too large to be included in this github repository and +they may need to be recreated with new Apache Pinot versions. + +To create the neccessary segment data for the benchmark please follow the +instructions below. + +# Creating Apache Pinot benchmark segments from TPC-H data + +To run the Pinot/Druid benchmark with Apache Pinot you'll need to download and run +the TPC-H tools to generate the benchmark data sets. + +## Downloading and building the TPC-H tools + +The TPC-H tools can be downloaded from the [TPC-H Website](http://www.tpc.org/tpch/default5.asp). +Registration is required. + +**Note:**: The instructions below for dbgen assume a Linux OS. + +After downloading and extracing the TPC-H tools, you'll need to build the +db generator tool: ```dbgen```. To do so, extract the package that you have +downloaded from TPC-H's website and inside the dbgen sub directory edit the +```makefile``` file. + +Set the following variables in the makefile to: + +``` +CC = gcc +... +DATABASE= SQLSERVER +MACHINE = LINUX +WORKLOAD = TPCH +``` + +Next, build the dbgen tool as per the README instructions in the dbgen directory. + +## Generating the TPC-H data and converting them for use in Apache Pinot + +After building ```dbgen``` run the following command line in the ```dbgen``` directory: + +``` +./dbgen -TL -s8 +``` + +The command above will generate a single large file called ```lineitem.tbl```. +This is the data file for the TPC-H benchmark, which we'll need to post-process +a bit to be imported into Apache Pinot. + +Next, build the Pinot/Druid Benchmark code if you haven't done so already. + +**Note:** Apache Pinot has JDK11 support, however for now it's +best to use JDK8 for all build and run operations in this manual. + +Inside ```pinot_directory/contrib/pinot-druid-benchmark``` run: + +``` +mvn clean install +``` + +Next, inside the same directory split the ```lineitem``` table: + +``` +./target/appassembler/bin/data-separator.sh <Path to lineitem.tbl> <Output Directory> +``` + +Use the output directory from the split as the input directory for the merge +command below: + +``` +./target/appassembler/bin/data-merger.sh <Input Directory> <Output Directory> YEAR +``` + +If all ran well you should see a few CSV files produced, 1992.csv through 1998.csv. + +These files are the starting point for creating our Apache Pinot segments. + +## Create the Apache Pinot segments + +The first step in the process is to launch a standalone Apache Pinot Cluster on one +single server. This cluster will serve as a host to hold the initial segments, +which we'll extract and copy for later re-use in the benchmark. + +Follow the steps outlined in the Apache Pinot Manual Cluster setup to launch the +cluster: + +https://docs.pinot.apache.org/basics/getting-started/advanced-pinot-setup + +You don't need the Kafka service as we won't be using it. + +Next, we need to follow the instructions similar to the ones described in +the [Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot) +in the Apache Pinot documentation. + +### Create the Apache Pinot tables + +Run: + +``` +pinot-admin.sh AddTable \ + -tableConfigFile /absolute/path/to/table-config.json \ + -schemaFile /absolute/path/to/schema.json -exec +``` + +For this command above you'll need the following configuration files: + +```table_config.json``` +``` +{ + "tableName": "tpch_lineitem", + "segmentsConfig" : { + "replication" : "1", + "schemaName" : "tpch_lineitem", + "segmentAssignmentStrategy" : "BalanceNumSegmentAssignmentStrategy" + }, + "tenants" : { + "broker":"DefaultTenant", + "server":"DefaultTenant" + }, + "tableIndexConfig" : { + "starTreeIndexConfigs":[{ + "maxLeafRecords": 100, + "functionColumnPairs": ["SUM__l_extendedprice", "SUM__l_discount", "SUM__l_quantity"], + "dimensionsSplitOrder": ["l_receiptdate", "l_shipdate", "l_shipmode", "l_returnflag"], + "skipStarNodeCreationForDimensions": [], + "skipMaterializationForDimensions": ["l_partkey", "l_commitdate", "l_linestatus", "l_comment", "l_orderkey", "l_shipinstruct", "l_linenumber", "l_suppkey"] + }] + }, + "tableType":"OFFLINE", + "metadata": {} +} +``` + +```schema.json``` +``` +{ + "schemaName": "tpch_lineitem", + "dimensionFieldSpecs": [ + { + "name": "l_orderkey", + "dataType": "INT" + }, + { + "name": "l_partkey", + "dataType": "INT" + }, + { + "name": "l_suppkey", + "dataType": "INT" + }, + { + "name": "l_linenumber", + "dataType": "INT" + }, + { + "name": "l_returnflag", + "dataType": "STRING" + }, + { + "name": "l_linestatus", + "dataType": "STRING" + }, + { + "name": "l_shipdate", + "dataType": "STRING" + }, + { + "name": "l_commitdate", + "dataType": "STRING" + }, + { + "name": "l_receiptdate", + "dataType": "STRING" + }, + { + "name": "l_shipinstruct", + "dataType": "STRING" + }, + { + "name": "l_shipmode", + "dataType": "STRING" + }, + { + "name": "l_comment", + "dataType": "STRING" + } + ], + "metricFieldSpecs": [ + { + "name": "l_quantity", + "dataType": "LONG" + }, + { + "name": "l_extendedprice", + "dataType": "DOUBLE" + }, + { + "name": "l_discount", + "dataType": "DOUBLE" + }, + { + "name": "l_tax", + "dataType": "DOUBLE" + } + ] +} +``` + +**Note:** The configuration as specified above will give you +the data with the **optimal star tree index**. The index configuration is +specified in the ```tableIndexConfig``` section in the ```table_config.json``` file. If +you want to generate a different type of indexed segment, then you +should modify the tableIndexConfig section to reflect the correct index +type as described in the [Indexing Section](https://docs.pinot.apache.org/basics/features/indexing) +of the Apache Pinot Documentation. + +### Create the Apache Pinot segments + +Next, we'll create the segments for this Apache Pinot table using the optimal +star tree index configuration. + +For this purpose you'll need a job specification YAML file. Here's an example +that does the TPC-H data import: + +```job-spec.yml``` +``` +executionFrameworkSpec: + name: 'standalone' + segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner' + segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner' + segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner' +jobType: SegmentCreationAndTarPush +inputDirURI: '/absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/raw_data/' +includeFileNamePattern: 'glob:**/*.csv' +outputDirURI: '/absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/segments/' +overwriteOutput: true +pinotFSSpecs: + - scheme: file + className: org.apache.pinot.spi.filesystem.LocalPinotFS +recordReaderSpec: + dataFormat: 'csv' + className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader' + configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig' + configs: + delimiter: '|' + header: 'l_orderkey|l_partkey|l_suppkey|l_linenumber|l_quantity|l_extendedprice|l_discount|l_tax|l_returnflag|l_linestatus|l_shipdate|l_commitdate|l_receiptdate|l_shipinstruct|l_shipmode|l_comment|' +tableSpec: + tableName: 'tpch_lineitem' + schemaURI: 'http://localhost:9000/tables/tpch_lineitem/schema' + tableConfigURI: 'http://localhost:9000/tables/tpch_lineitem' +pinotClusterSpecs: + - controllerURI: 'http://localhost:9000' +``` + +**Note:** Make sure you modify the absolute path for **inputDirURI** and **outputDirURI** +above. The inputDirURI should be pointing to the directory where you have +generated the 7 YEAR CSV files, 1992.csv through 1998.csv. + +After you have modified the input and output dir, run the job as described in the +[Batch Import Example](https://docs.pinot.apache.org/basics/getting-started/pushing-your-data-to-pinot) document: + + +``` +pinot-admin.sh LaunchDataIngestionJob \ + -jobSpecFile /absolute/path/to/job-spec.yml +``` + +The segment creation output on the console will tell you where Apache Pinot will +store the created segments (it should be your output dir). You should see a +line appear in the output as: + +``` +... +outputDirURI: /absolute/path/to/incubator-pinot/contrib/pinot-druid-benchmark/data_out/segments/ +... +``` + +Inside there you'll find the tpch_lineitem_OFFLINE directory with 7 separate +segments, 0 through 6. Tar/GZip the whole directory and this will be your +optimal_startree_small_yearly temp segment that the benchmark requires. However, +wait first for the segment creation to finish. + +Try few queries to ensure that the segments are working. You can find some +sample queries under the benchmark directory ```src/main/resources/pinot_queries```. +Watch the console output from the Apache Pinot cluster as you run the queries, and make sure +there are no complaints in there that the queries were slow since index wasn't found. +If you see a message saying the query was slow, it means that the indexes weren't +created properly. With the optimal star tree index your total query time should be +few milliseconds at most. + +You can now shutdown the Apache Pinot cluster which you started manually and when you +launch the benchmark server cluster it will pick up your new segments. + --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org