gortiz commented on PR #8818:
URL: https://github.com/apache/pinot/pull/8818#issuecomment-1152506922

   My last commit includes a JMH benchmark. I found several interesting things. 
Please take a look, some of my conclusions are so shocking that it wouldn't 
surprise me if another pair of eyes discovered that I did something wrong.
   
   TL;DR: I think we should apply this optimization only when executed in 
columns without indexes. This may make the implementation more difficult, but 
the performance gain may be very large.
   
   Benchmarks are:
   - optimal1Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 
'domain0'`
   - optimal1Regex: `SELECT INT_COL FROM MyTable WHERE 
regexp_like(DOMAIN_NAMES, 'domain0')`
   - optimal10: `SELECT INT_COL FROM MyTable WHERE regexp_like(DOMAIN_NAMES, 
'domain\d')`
   - increasing10Like: `SELECT INT_COL FROM MyTable WHERE DOMAIN_NAMES LIKE 
'%domain0%' or DOMAIN_NAMES LIKE '%domain1%' or ... or DOMAIN_NAMES LIKE 
'%domain9%'`
   - increasing10Regex: `SELECT INT_COL FROM MyTable WHERE 
regexp_like(DOMAIN_NAMES LIKE, '^.domain0.*$') or ... or 
regexp_like(DOMAIN_NAMES LIKE, '^.domain9.*$')`
   - increasing10Fusing: `SELECT INT_COL FROM MyTable WHERE 
regexp_like(DOMAIN_NAMES, '(?:^.*domain0.*$)|...|(?:^.*domain9.*$)')`
   - decreasing9Like, decreasing9Regex and decreasing9Fusing: Are like 
`increasing10Like`, `increasing10Regex` and `increasing10Fusing` but going from 
9 to 1
      - Note: they have in fact 9 elements. This a mistake. Initially queries 
where created dynamically and I did it wrong. But it was a lucky mistake that 
shows that Lucene fails when there are 10 fused expression but it doesn't fail 
when there are 9.
   
   Last results I´ve got were:
   ```
   Benchmark                               (_fstType)  Mode  Cnt     Score      
Error  Units
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   
13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    
0.019  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    
0.040  ms/op
   BenchmarkFuseRegexp.decreasing9Like         LUCENE  avgt    5     0.615 ±    
0.056  ms/op
   BenchmarkFuseRegexp.decreasing9Like         NATIVE  avgt    5     0.515 ±    
0.111  ms/op
   BenchmarkFuseRegexp.decreasing9Like           null  avgt    5  3596.864 ± 
1335.035  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    
0.362  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    
0.222  ms/op
   BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  
379.729  ms/op
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    
0.138  ms/op
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    
0.062  ms/op
   BenchmarkFuseRegexp.increasing10Like        LUCENE  avgt    5     0.696 ±    
0.121  ms/op
   BenchmarkFuseRegexp.increasing10Like        NATIVE  avgt    5     0.540 ±    
0.049  ms/op
   BenchmarkFuseRegexp.increasing10Like          null  avgt    5  3543.339 ±  
183.168  ms/op
   BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    
0.208  ms/op
   BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    
0.159  ms/op
   BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  
289.928  ms/op
   BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    
0.066  ms/op
   BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    
0.011  ms/op
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    
0.082  ms/op
   BenchmarkFuseRegexp.optimal1Like            LUCENE  avgt    5     0.179 ±    
0.066  ms/op
   BenchmarkFuseRegexp.optimal1Like            NATIVE  avgt    5     0.134 ±    
0.029  ms/op
   BenchmarkFuseRegexp.optimal1Like              null  avgt    5   470.387 ±  
187.235  ms/op
   BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    
0.029  ms/op
   BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    
0.027  ms/op
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   
59.722  ms/op
   ```
   
   After analyzing it, there are some points that are quite clear:
   
   ## We cannot use this optimization with Lucene
   
   The cost is too high when the expression gets complex. It even with `
   java.lang.RuntimeException: 
org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
Determinizing 
.*(?:^.*domain0.*$)|(?:^.*domain1.*$)|(?:^.*domain2.*$)|(?:^.*domain3.*$)|(?:^.*domain4.*$)|(?:^.*domain5.*$)|(?:^.*domain6.*$)|(?:^.*domain7.*$)|(?:^.*domain8.*$)|(?:^.*domain9.*$).*
 would result in more than 10000 states.` when the expression have 10 branches.
   
   ## In general `COL like 'whatever'` expression is 2x-3x worse than 
`regexp_like(COL, 'similar expression')`
   
   Reading the code it looks to me that Pinot automatically translate one to 
the other, but it that transformation is not applied when 
`BaseQueriesTest.getBrokerResponse` is called (which is great, because we can 
test the difference).
   
   I've also tried the controller UI interface and this issue is repeated 
there. Using the github_events dataset, 
   ```
   select * from githubEvents where regexp_like(actor_url, '^.*victor\d.*$') 
limit 1
   ```
   is quite faster than
   ```
   select * from githubEvents where actor_url LIKE '%victor7%' limit 1
   ```
   
   ## Pinot regex execute faster wildcards than a single values
   
   It is quite clear in the optimal cases:
   ```
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    
0.082  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   
59.722  ms/op
   ```
   
   At the beginning I thought it was an error in my code, but it is easy to try 
that using the github_events dataset.
   ```
   select * from githubEvents where regexp_like(actor_url, 'victor\d') limit 1
   ```
   Responds quite faster than
    
   ```
   select * from githubEvents where regexp_like(actor_url, 'victor7') limit 1
   ```
   
   The difference is not the 2.000x that the benchmark shows, but is 
consistently between 2-3x.
   
   I'm guessing that the issue actually comes from Java regex engine, but it is 
just a hunch, I didn't actually test it.
   
   ## Native or Lucene indexes don't actually increase the performance of this 
regexp
   
   This benchmark wasn't focused on that and more studies should be done, but 
it seems clear that:
   
   ### When there are several regexp in the actual expression, indexes are very 
useful.
   
   Examples:
   ```
   BenchmarkFuseRegexp.decreasing9Regex        LUCENE  avgt    5     0.593 ±    
0.362  ms/op
   BenchmarkFuseRegexp.decreasing9Regex        NATIVE  avgt    5     0.422 ±    
0.222  ms/op
   BenchmarkFuseRegexp.decreasing9Regex          null  avgt    5  1053.673 ±  
379.729  ms/op
   
   BenchmarkFuseRegexp.increasing10Regex       LUCENE  avgt    5     0.590 ±    
0.208  ms/op
   BenchmarkFuseRegexp.increasing10Regex       NATIVE  avgt    5     0.435 ±    
0.159  ms/op
   BenchmarkFuseRegexp.increasing10Regex         null  avgt    5  1096.238 ±  
289.928  ms/op
   ```
   
   - When there is only one regexp in the actual expressions, indexes are not 
very useful.
   For example:
   ```
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.decreasing9Fusing       LUCENE  avgt    5    24.198 ±   
13.439  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing       NATIVE  avgt    5     0.263 ±    
0.019  ms/op
   BenchmarkFuseRegexp.decreasing9Fusing         null  avgt    5     0.476 ±    
0.040  ms/op
   -- this actually has a single, quite complex regex expression
   BenchmarkFuseRegexp.increasing10Fusing      LUCENE  avgt    5     ERROR
   BenchmarkFuseRegexp.increasing10Fusing      NATIVE  avgt    5     0.326 ±    
0.138  ms/op
   BenchmarkFuseRegexp.increasing10Fusing        null  avgt    5     0.226 ±    
0.062  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain\d')`
   BenchmarkFuseRegexp.optimal10               LUCENE  avgt    5     0.169 ±    
0.066  ms/op
   BenchmarkFuseRegexp.optimal10               NATIVE  avgt    5     0.136 ±    
0.011  ms/op
   BenchmarkFuseRegexp.optimal10                 null  avgt    5     0.149 ±    
0.082  ms/op
   -- this is just a `where regexp_like(DOMAIN_NAMES, 'domain0')`
   BenchmarkFuseRegexp.optimal1Regex           LUCENE  avgt    5     0.172 ±    
0.029  ms/op
   BenchmarkFuseRegexp.optimal1Regex           NATIVE  avgt    5     0.141 ±    
0.027  ms/op
   BenchmarkFuseRegexp.optimal1Regex             null  avgt    5   243.160 ±   
59.722  ms/op
   ```
   
   The last one shows a big difference, but it may be related to the issue with 
regex without wildcards that I discussed above.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to