npawar opened a new pull request #5934: URL: https://github.com/apache/incubator-pinot/pull/5934
## Description A Segment Processing Framework to convert "m" segments to "n" segments The phases of the Segment Processor are 1. **Map** - RecordTransformation (using transform functions) - Partitioning (Column value based, transform function based, table config's partition config based) - PartitionFiltering (using filter function) 2. **Reduce** - Rollup/Concat records - Split into parts 3. **Segment generation** A SegmentProcessorFrameworkCommand is provided to run this on demand. Run using command `bin/pinot-admin.sh SegmentProcessorFramework -segmentProcessorFrameworkSpec /<path>/spec.json` where spec.json is ``` { "inputSegmentsDir": "/<base_dir>/segmentsDir", "outputSegmentsDir": "/<base_dir>/outputDir/", "schemaFile": "/<base_dir>/schema.json", "tableConfigFile": "/<base_dir>/table.json", "recordTransformerConfig": { "transformFunctionsMap": { "epochMillis": "round(epochMillis, 86400000)" // round to nearest day } }, "partitioningConfig": { "partitionerType": "COLUMN_VALUE", // partition on epochMillis "columnName": "epochMillis" }, "collectorConfig": { "collectorType": "ROLLUP", // rollup clicks by summing "aggregatorTypeMap": { "clicks": "SUM" } }, "segmentConfig": { "maxNumRecordsPerSegment": 200_000 } } ``` Note: 1. Currently this framework attempts to do no parallelism in the map/reduce/segment creation jobs. Each input file will be processed sequentially in map stage, each part will be executed sequentially in reduce, and each segment will be built one after another. We can change this in the future if the need arises to make this more advanced. 2. The framework makes the assumption that there's enough memory to hold all records of a partition in memory, during rollups in reducer. A limit of 5M records has been set on the Reducer as the number of records to collect before forcing a flush, as a safety measure. In future we could consider using off heap processing, if memory becomes a problem. This framework will typically be used by minion tasks, which want to perform some processing on segments (eg task which merges segments, tasks which aligns segments per time boundaries etc). The existing Segment merge jobs can be changed to use this framework. **Pending** An end-to-end test for the framework (WIP) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org