JohnTortugo opened a new pull request #6287: URL: https://github.com/apache/incubator-pinot/pull/6287
I'm from a Microsoft team working with LinkedIn engineers to improve Pinot performance. In a previous talk with @mayankshriv, he said that segment creation is slow and if we could take a look to improve that. CPU profiling, using synthetic data, for the segment creation code, produced the following flame graph:  [Full resolution image](https://cesarshare.blob.core.windows.net/pinot-investigation/SegmentCreation-Flames.png) Basically, segment creation is made of two roughly equal parts: init & build. Both methods are very similar in the sense that they have a main loop that iterates over rows in the input data doing some transformation on each of them. This PR introduces two main changes: 1) A new Pinot-perf benchmark used for benchmarking segment creation performance. 2) The parallelization of the init->gatherStats and build methods mentioned above. The main loop of each method was parallelized using a technique called [DSWP](https://liberty.princeton.edu/Publications/micro05_dswp.pdf) and we made use of [Disruptor RingBuffer](https://github.com/LMAX-Exchange/disruptor) to implement thread communication. ## Benchmark results: ### Original code ``` # Run progress: 0.00% complete, ETA 00:07:00 # Fork: 1 of 2 # Warmup Iteration 1: 31097.686 ms/op # Warmup Iteration 2: 26007.428 ms/op # Warmup Iteration 3: 26816.007 ms/op Iteration 1: 25951.170 ms/op Iteration 2: 26076.096 ms/op Iteration 3: 26045.939 ms/op # Run progress: 50.00% complete, ETA 00:05:19 # Fork: 2 of 2 # Warmup Iteration 1: 31711.546 ms/op # Warmup Iteration 2: 26587.875 ms/op # Warmup Iteration 3: 27360.283 ms/op Iteration 1: 26208.574 ms/op Iteration 2: 26316.409 ms/op Iteration 3: 26194.492 ms/op Result "org.apache.pinot.perf.BenchmarkSegmentCreation.segmentCreationFromCSV": 26132.113 ±(99.9%) 369.912 ms/op [Average] (min, avg, max) = (25951.170, 26132.113, 26316.409), stdev = 131.914 CI (99.9%): [25762.202, 26502.025] (assumes normal distribution) # Run complete. Total time: 00:10:42 Benchmark Mode Cnt Score Error Units BenchmarkSegmentCreation.segmentCreationFromCSV avgt 6 26132.113 ± 369.912 ms/op ``` ### New code ``` # Run progress: 0.00% complete, ETA 00:03:20 # Fork: 1 of 2 # Warmup Iteration 1: 23004.364 ms/op # Warmup Iteration 2: 19380.296 ms/op # Warmup Iteration 3: 20914.349 ms/op # Warmup Iteration 4: 19469.886 ms/op # Warmup Iteration 5: 19461.024 ms/op Iteration 1: 19523.648 ms/op Iteration 2: 19582.673 ms/op Iteration 3: 19409.540 ms/op Iteration 4: 19419.701 ms/op Iteration 5: 19386.130 ms/op # Run progress: 50.00% complete, ETA 00:03:20 # Fork: 2 of 2 # Warmup Iteration 1: 23344.723 ms/op # Warmup Iteration 2: 19335.702 ms/op # Warmup Iteration 3: 20535.619 ms/op # Warmup Iteration 4: 19512.260 ms/op # Warmup Iteration 5: 19461.238 ms/op Iteration 1: 19510.350 ms/op Iteration 2: 19453.281 ms/op Iteration 3: 19444.863 ms/op Iteration 4: 19399.972 ms/op Iteration 5: 19380.142 ms/op Result "org.apache.pinot.perf.BenchmarkSegmentCreation.segmentCreationFromCSV": 19451.030 ±(99.9%) 101.684 ms/op [Average] (min, avg, max) = (19380.142, 19451.030, 19582.673), stdev = 67.258 CI (99.9%): [19349.346, 19552.714] (assumes normal distribution) # Run complete. Total time: 00:06:41 Benchmark Mode Cnt Score Error Units BenchmarkSegmentCreation.segmentCreationFromCSV avgt 10 19451.030 ± 101.684 ms/op ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org