richardstartin opened a new pull request #8097: URL: https://github.com/apache/pinot/pull/8097
I noticed that the FST Like benchmark results vary a lot: ``` Benchmark Mode Cnt Score Error Units BenchmarkNativeAndLuceneBasedLike.testLuceneBasedFSTLike avgt 25 3.127 ± 0.578 s/op BenchmarkNativeAndLuceneBasedLike.testNativeBasedFSTLike avgt 25 3.440 ± 0.555 s/op ``` When running these benchmarks with `-prof jfr` to figure out what they were actually measuring I noticed a couple of things: 1. Most of the time is spent parsing SQL creating the query context <img width="1530" alt="Screenshot 2022-01-31 at 17 54 48" src="https://user-images.githubusercontent.com/16439049/151846946-0b3aa09b-a0ba-43bb-af13-a120406e311e.png"> 2. Because the benchmark doesn't properly segregate the FST types or clean up after itself properly (it uses a testng `AfterClass` annotation which JMH doesn't know anything about), the `testNativeBasedFSTLike` sometimes measures the Lucene implementation, but when it does measure the Native implementation, the SQL parsing frames are much narrower relative to the construction of the filter operator. <img width="1546" alt="Screenshot 2022-01-31 at 17 58 54" src="https://user-images.githubusercontent.com/16439049/151847574-9073b823-d2ea-4f03-82c8-84e33aca94e2.png"> After a couple of changes to use proper JMH lifecycle and to factor out SQL parsing into the setup, most of the benchmark time is spent in construction of the filter operator: Lucene: <img width="1509" alt="Screenshot 2022-01-31 at 17 48 12" src="https://user-images.githubusercontent.com/16439049/151845914-9b5440f8-c643-43e1-b873-6fd2e031fcbd.png"> Native: <img width="1540" alt="Screenshot 2022-01-31 at 17 48 48" src="https://user-images.githubusercontent.com/16439049/151846014-a8d5cd66-d61d-419d-8f5e-d0c72807dc4d.png"> The results are more stable and tell a different story, which should help drive future improvement in this space: ``` Benchmark (_fstType) (_intBaseValue) (_numRows) (_query) Mode Cnt Score Error Units BenchmarkNativeAndLuceneBasedLike.query LUCENE 1000 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 25 65.626 ± 1.454 us/op BenchmarkNativeAndLuceneBasedLike.query NATIVE 1000 2500000 SELECT INT_COL, URL_COL FROM MyTable WHERE DOMAIN_NAMES LIKE '%domain%' avgt 25 232.908 ± 17.302 us/op ``` Future improvements may include iterating over more than one block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org