[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-27 Thread GitBox


iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1053335430


   +1 That would hide the implementation details from users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   ```
   3. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   |  50th percentile service time | 
200s-in-range | 97.5456 |  97.374 | -0.17156 | ms |
   |  90th percentile service time | 
200s-in-range | 99.0628 | 99.3968 |  0.33391 | ms |
   |  99th percentile service time | 
200s-in-range | 103.838 | 101.961 | -1.87675 | ms |
   | 100th percentile service time | 
200s-in-range | 104.579 | 120.306 |  15.7267 | ms |
   |error rate | 
200s-in-range |   0 |   0 |0 |  % |
   |Min Throughput | 
400s-in-range | 49.8772 | 49.9738 |  0.09658 |  ops/s |
   |   Mean Throughput | 
400s-in-range | 49.8831 | 49.9751 |  0.09204 |  ops/s |
   | Median Throughput | 
400s-in-range | 49.8831 | 49.9751 |  0.09204 |  ops/s |
   |Max Throughput | 
400s-in-range |  49.889 | 49.9765 |   0.0875 |  ops/s |
   |   50th percentile latency | 
400s-in-range | 4.90736 | 4.90565 | -0.00171 | ms |
   |   90th percentile latency | 
400s-in-range | 5.36891 | 5.35761 | -0.01129 | ms |
   |   99th percentile latency | 
400s-in-range | 5.65861 | 5.62417 | -0.03444 | ms |

[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   |  50th percentile service time | 
200s-in-range | 97.5456 |  97.374 | -0.17156 | ms |
   |  90th percentile service time | 
200s-in-range | 99.0628 | 99.3968 |  0.33391 | ms |
   |  99th percentile service time | 
200s-in-range | 103.838 | 101.961 | -1.87675 | ms |
   | 100th percentile service time | 
200s-in-range | 104.579 | 120.306 |  15.7267 | ms |
   |error rate | 
200s-in-range |   0 |   0 |0 |  % |
   |Min Throughput | 
400s-in-range | 49.8772 | 49.9738 |  0.09658 |  ops/s |
   |   Mean Throughput | 
400s-in-range | 49.8831 | 49.9751 |  0.09204 |  ops/s |
   | Median Throughput | 
400s-in-range | 49.8831 | 49.9751 |  0.09204 |  ops/s |
   |Max Throughput | 
400s-in-range |  49.889 | 49.9765 |   0.0875 |  ops/s |
   |   50th percentile latency | 
400s-in-range | 4.90736 | 4.90565 | -0.00171 | ms |
   |   90th percentile latency | 
400s-in-range | 5.36891 | 5.35761 | -0.01129 | ms |
   |   99th percentile latency | 
400s-in-range | 5.65861 | 5.62417 | -0.03444 | ms |
   |  100

[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 13.0162 |   3.0894 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 13.0482 |  3.10265 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 13.0526 |  3.10708 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 13.0712 |  3.10722 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 25504.9 | -13159.8 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 | 27291.3 | -14058.5 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 27681.4 | -14272.8 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 27723.8 | -14297.8 | ms |
   |  50th percentile service time | 
200s-in-range | 97.5456 | 73.0836 |  -24.462 | ms |
   |  90th percentile service time | 
200s-in-range | 99.0628 |  74.586 | -24.4768 | ms |
   |  99th percentile service time | 
200s-in-range | 103.838 | 91.0001 | -12.8379 | ms |
   | 100th percentile service time | 
200s-in-range | 104.579 | 120.735 |  16.1559 | ms |
   ```
   3. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range |9.54473 | 13.0162 |  3.47149 |  ops/s |
   |   Mean Throughput | 
200s-in-range |9.58063 | 13.0482 |  3.46758 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.5815 | 13.0526 |  3.47114 |  ops/s |
   |Max Throughput | 
200s-in-range |9.61395 | 13.0712 |  3.45725 |  ops/s |
   |   50th percentile latency | 
200s-in-range |40581.6 | 25504.9 | -15076.7 | ms |
   |   90th percentile latency | 
200s-in-range |43334.5 | 27291.3 | -16043.2 | ms |
   |   99th percentile latency | 
200s-in-range |43949.5 | 27681.4 | -16268.1 | ms |
   |  100th percentile latency | 
200s-in-range |44016.2 | 27723.8 | -16292.4 | ms |
   |  50th percentile service time | 
200s-in-range |98.6711 | 73.0836 | -25.5875 | ms |
   |  90th percentile service time | 
200s-in-range |100.634 |  74.586 |  -26.048 | ms |
   |  99th percentile service time | 
200s-in-range |121.701 | 91.0001 | -30.7012 | ms |
   | 100th percentile service time | 
200s-in-range |127.223 | 120.735 | -6.48813 | ms |
   ```
   3. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range |9.54473 | 10.1708 |   0.6261 |  ops/s |
   |   Mean Throughput | 
200s-in-range |9.58063 | 10.1765 |  0.59589 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.5815 | 10.1759 |  0.59436 |  ops/s |
   |Max Throughput | 
200s-in-range |9.61395 | 10.1828 |  0.56884 |  ops/s |
   |   50th percentile latency | 
200s-in-range |40581.6 | 37436.6 | -3144.97 | ms |
   |   90th percentile latency | 
200s-in-range |43334.5 | 40092.1 | -3242.37 | ms |
   |   99th percentile latency | 
200s-in-range |43949.5 | 40689.7 | -3259.83 | ms |
   |  100th percentile latency | 
200s-in-range |44016.2 | 40755.5 | -3260.66 | ms |
   |  50th percentile service time | 
200s-in-range |98.6711 | 96.4568 | -2.21426 | ms |
   |  90th percentile service time | 
200s-in-range |100.634 | 98.8512 |  -1.7828 | ms |
   |  99th percentile service time | 
200s-in-range |121.701 | 106.898 |  -14.803 | ms |
   | 100th percentile service time | 
200s-in-range |127.223 | 109.426 |  -17.797 | ms |
   ```
   3. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc,  it may need to create a 
new **SortedNumericDocValues** instance and advance from the first doc many 
times, so it will be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   use rally to compare performance of httLog small dataset
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range |9.54473 | 10.1261 |  0.58137 |  ops/s |
   |   Mean Throughput | 
200s-in-range |9.58063 |   10.14 |  0.55941 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.5815 | 10.1405 |  0.55902 |  ops/s |
   |Max Throughput | 
200s-in-range |9.61395 | 10.1524 |  0.53847 |  ops/s |
   |   50th percentile latency | 
200s-in-range |40581.6 | 37628.2 | -2953.41 | ms |
   |   90th percentile latency | 
200s-in-range |43334.5 | 40271.4 | -3063.11 | ms |
   |   99th percentile latency | 
200s-in-range |43949.5 | 40868.3 |  -3081.2 | ms |
   |  100th percentile latency | 
200s-in-range |44016.2 | 40933.7 | -3082.54 | ms |
   |  50th percentile service time | 
200s-in-range |98.6711 | 96.5836 |  -2.0875 | ms |
   |  90th percentile service time | 
200s-in-range |100.634 |  97.772 | -2.86207 | ms |
   |  99th percentile service time | 
200s-in-range |121.701 | 99.2353 | -22.4661 | ms |
   | 100th percentile service time | 
200s-in-range |127.223 | 99.4463 | -27.7768 | ms 
   ```
   3. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10445) Reproducible assertion failure in o.a.l.luke.models.documents.TestDocumentsImpl.testNextTermDoc

2022-02-27 Thread Tomoko Uchida (Jira)
Tomoko Uchida created LUCENE-10445:
--

 Summary: Reproducible assertion failure in 
o.a.l.luke.models.documents.TestDocumentsImpl.testNextTermDoc
 Key: LUCENE-10445
 URL: https://issues.apache.org/jira/browse/LUCENE-10445
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/luke
Reporter: Tomoko Uchida


Command
{code}
gradlew :lucene:luke:test --tests 
"org.apache.lucene.luke.models.documents.TestDocumentsImpl.testNextTermDoc" 
-Ptests.seed=64C33B756A050564 -Ptests.multiplier=2 -Ptests.nightly=true
{code}

Stacktrace
{code}
org.apache.lucene.luke.models.documents.TestDocumentsImpl > testNextTermDoc 
FAILED
java.lang.AssertionError: expected:<4> but was:<2>
at 
__randomizedtesting.SeedInfo.seed([64C33B756A050564:6EFB0CF0167683A3]:0)
at junit@4.13.1/org.junit.Assert.fail(Assert.java:89)
at junit@4.13.1/org.junit.Assert.failNotEquals(Assert.java:835)
at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:647)
at junit@4.13.1/org.junit.Assert.assertEquals(Assert.java:633)
at 
org.apache.lucene.luke.models.documents.TestDocumentsImpl.testNextTermDoc(TestDocumentsImpl.java:207)

{code}

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10427) OLAP likewise rollup during segment merge process

2022-02-27 Thread Suhan Mao (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498591#comment-17498591
 ] 

Suhan Mao commented on LUCENE-10427:


[~jpountz] Thanks for your reply!

As I know, the current rollup implementation in ES is to periodically run a 
composite aggregation to query the aggregated result and insert into another 
index.

*But this approach has several disadvantages:*
 # It still need to save the detailed data which may not be needed by the user 
if they only want to run aggregate queries. This is what clickhouse aggregate 
merge tree or druid does currently, they only store aggregated data. But as you 
mentioned, we can create a sidecar index besides the raw data index, I think it 
is also acceptable in the first step.
 # Composite aggregation will cause OOM problem and it maybe very slow if the 
data volume is very big. For example, if the query granularity is 1h, composite 
aggregation will extract all the buckets within one hour from the raw data.
 # Cronjob-like scheduled queries cannot handle late arriving data, if the data 
belonging to previous time interval arrives late after the query, it will be 
ignored in the rollup index which will cause data accuracy issue.
 # From resource consumption perspective, if we must do a merge on segments, 
why not do rollup in the process of merge within one IO round?
 # If the ES rollup granularity is one hour, the latest 1 hour data is not 
visible in the rollup index because the hourly scheduled composite aggregate 
query is not started yet.

*To answer your questions*
 - Q: Different segments would have different granularities
 - A: Different segments within one index all share the same granularity which 
is an index level settings, this granularity is probably the minimum query 
granularity required by the user.
 - Q:merges would no longer combine segments but also perform lossy compression.
 - A:yes. doc count will be heavily reduced after merge and this is as expected 
because smaller data volume will speed up the query performance.
 - Q:all file formats would need to be aware of rollups?
 - A:Currently,I have implemented several formats of docvalues/BKD tree/FST ... 
It is the most commonly used in OLAP scenarios.
 - Q:numeric doc values would need to be able to store multiple fields under 
the hood (min, max, etc.)
 - A:docvalues will not need to store semantics under the hood. We can store 
the information in the index settings. All the supported aggregate operator 
should follow associative property and commutative property. For example, 
max(a,b,c) = max(a, max(b,c)), sum(a, b, c) = sum(a, sum(b,c)), 
hll_union(a,b,c) = hll_union(a, hll_union(b,c)) if data type is binary. So the 
format of the same field in docvalues is always the same. Docvalues cannot tell 
wether a doc is the raw data or aggregated data.

*How we can start from scratch*

I think we can start from a sidecar solution first. Assume that index A is the 
index storing raw data. And index A' is a sidecar index which is a continuous 
rolling up index.

Assume that the schema of index A is:

d0 time, d1 long, d2 keyword, m1 long, m2 long, m3 binary(hll),x1,x2,x3 ..

x1, x2 and x3 fields are no related to rollup and they are just additional 
normal fields.

d0 is the event time, d1 and d2 are all dimensions and m1, m2 and m3 are all 
metrics.

If we want to rollup the data to hourly granularity, we can create a rollup 
sidecar index A' which only contains d0, d1, d2, m1, m2, m3 fields  and do 
rollup during merge process. User can submit query to A or A' accordingly.

What's more, we can create several rollup indices which is often called 
"materialized view" in OLAP scenarios.

For example, if we need another view that only store d0, d1, m3 and rollup 
granularity is daily, we ca create an additional sidecar index A''.

User only need to write raw data once to index A and all the rollup calculation 
is calculated in the internal of lucene. User should submit query to different 
level of indices accordingly.

 

What do you think?

 

 

 

> OLAP likewise rollup during segment merge process
> -
>
> Key: LUCENE-10427
> URL: https://issues.apache.org/jira/browse/LUCENE-10427
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Suhan Mao
>Priority: Major
>
> Currently, many OLAP engines support rollup feature like 
> clickhouse(AggregateMergeTree)/druid. 
> Rollup definition: [https://athena.ecs.csus.edu/~mei/olap/OLAPoperations.php]
> One of the way to do rollup is to merge the same dimension buckets into one 
> and do sum()/min()/max() operation on metric fields during segment 
> compact/merge process. This can significantly reduce the size of the data and 
> speed up the query a lot.
>  
> *Abstraction of how to do*
>  # Define ro

[jira] [Comment Edited] (LUCENE-10427) OLAP likewise rollup during segment merge process

2022-02-27 Thread Suhan Mao (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498591#comment-17498591
 ] 

Suhan Mao edited comment on LUCENE-10427 at 2/27/22, 1:38 PM:
--

[~jpountz] Thanks for your reply!

As I know, the current rollup implementation in ES is to periodically run a 
composite aggregation to query the aggregated result and insert into another 
index.

*But this approach has several disadvantages:*
 # It still need to save the detailed data which may not be needed by the user 
if they only want to run aggregate queries. This is what clickhouse aggregate 
merge tree or druid does currently, they only store aggregated data. But as you 
mentioned, we can create a sidecar index besides the raw data index, I think it 
is also acceptable in the first step.
 # Composite aggregation will cause OOM problem and it maybe very slow if the 
data volume is very big. For example, if the query granularity is 1h, composite 
aggregation will extract all the buckets within one hour from the raw data.
 # Cronjob-like scheduled queries cannot handle late arriving data, if the data 
belonging to previous time interval arrives late after the query, it will be 
ignored in the rollup index which will cause data accuracy issue.
 # From resource consumption perspective, if we must do a merge on segments, 
why not do rollup in the process of merge within one IO round?
 # If the ES rollup granularity is one hour, the latest 1 hour data is not 
visible in the rollup index because the hourly scheduled composite aggregate 
query is not started yet.

*To answer your questions*
 - Q: Different segments would have different granularities
 - A: Different segments within one index all share the same granularity which 
is an index level settings, this granularity is probably the minimum query 
granularity required by the user.
 - Q:merges would no longer combine segments but also perform lossy compression.
 - A:yes. doc count will be heavily reduced after merge and this is as expected 
because smaller data volume will speed up the query performance.
 - Q:all file formats would need to be aware of rollups?
 - A:Currently,I have implemented several formats of docvalues/BKD tree/FST ... 
It is the most commonly used in OLAP scenarios.
 - Q:numeric doc values would need to be able to store multiple fields under 
the hood (min, max, etc.)
 - A:docvalues will not need to store semantics under the hood. We can store 
the information in the index settings. All the supported aggregate operator 
should follow associative property and commutative property. For example, 
max(a,b,c) = max(a, max(b,c)), sum(a, b, c) = sum(a, sum(b,c)), 
hll_union(a,b,c) = hll_union(a, hll_union(b,c)) if data type is binary. So the 
format of the same field in docvalues is always the same. Docvalues cannot tell 
wether a doc is the raw data or aggregated data.

*How we can start from scratch*

I think we can start from a sidecar solution first. Assume that index A is the 
index storing raw data. And index A' is a sidecar index which is a continuous 
rolling up index.

Assume that the schema of index A is:

d0 time, d1 long, d2 keyword, m1 long, m2 long, m3 binary(hll),x1,x2,x3 ..

x1, x2 and x3 fields are no related to rollup and they are just additional 
normal fields.

d0 is the event time, d1 and d2 are all dimensions and m1, m2 and m3 are all 
metrics.

If we want to rollup the data to hourly granularity, we can create a rollup 
sidecar index A' which only contains d0, d1, d2, m1, m2, m3 fields  and do 
rollup during merge process. User can submit query to A or A' accordingly.

What's more, we can create several rollup indices which is often called 
"materialized view" in OLAP scenarios.

For example, if we need another view that only store d0, d1, m3 and rollup 
granularity is daily, we can create an additional sidecar index A''.

User only need to write raw data once to index A and all the rollup calculation 
is calculated in the internal of lucene. User should submit query to different 
level of indices accordingly.

 

What do you think?

 

 

 


was (Author: maosuhan):
[~jpountz] Thanks for your reply!

As I know, the current rollup implementation in ES is to periodically run a 
composite aggregation to query the aggregated result and insert into another 
index.

*But this approach has several disadvantages:*
 # It still need to save the detailed data which may not be needed by the user 
if they only want to run aggregate queries. This is what clickhouse aggregate 
merge tree or druid does currently, they only store aggregated data. But as you 
mentioned, we can create a sidecar index besides the raw data index, I think it 
is also acceptable in the first step.
 # Composite aggregation will cause OOM problem and it maybe very slow if the 
data volume is very big. For example, if the query granularity is 1h, composit

[jira] [Comment Edited] (LUCENE-10427) OLAP likewise rollup during segment merge process

2022-02-27 Thread Suhan Mao (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17498591#comment-17498591
 ] 

Suhan Mao edited comment on LUCENE-10427 at 2/27/22, 1:39 PM:
--

[~jpountz] Thanks for your reply!

As I know, the current rollup implementation in ES is to periodically run a 
composite aggregation to query the aggregated result and insert into another 
index.

*But this approach has several disadvantages:*
 # It still need to save the detailed data which may not be needed by the user 
if they only want to run aggregate queries. This is what clickhouse aggregate 
merge tree or druid does currently, they only store aggregated data. But as you 
mentioned, we can create a sidecar index besides the raw data index, I think it 
is also acceptable in the first step.
 # Composite aggregation will cause OOM problem and it maybe very slow if the 
data volume is very big. For example, if the query granularity is 1h, composite 
aggregation will extract all the buckets within one hour from the raw data.
 # Cronjob-like scheduled queries cannot handle late arriving data, if the data 
belonging to previous time interval arrives late after the query, it will be 
ignored in the rollup index which will cause data accuracy issue.
 # From resource consumption perspective, if we must do a merge on segments, 
why not do rollup in the process of merge within one IO round?
 # If the ES rollup granularity is one hour, the latest 1 hour data is not 
visible in the rollup index because the hourly scheduled composite aggregate 
query is not started yet.

*To answer your questions*
 - Q: Different segments would have different granularities
 - A: Different segments within one index all share the same granularity which 
is an index level settings, this granularity is probably the minimum query 
granularity required by the user.
 - Q:merges would no longer combine segments but also perform lossy compression.
 - A:yes. doc count will be heavily reduced after merge and this is as expected 
because smaller data volume will speed up the query performance.
 - Q:all file formats would need to be aware of rollups?
 - A:Currently,I have implemented several formats of docvalues/BKD tree/FST ... 
It is the most commonly used in OLAP scenarios.
 - Q:numeric doc values would need to be able to store multiple fields under 
the hood (min, max, etc.)
 - A:docvalues will not need to store semantics under the hood. We can store 
the information in the index settings. All the supported aggregate operator 
should follow associative property and commutative property. For example, 
max(a,b,c) = max(a, max(b,c)), sum(a, b, c) = sum(a, sum(b,c)), 
hll_union(a,b,c) = hll_union(a, hll_union(b,c)) if data type is binary. So the 
format of the same field in docvalues is always the same. Docvalues cannot tell 
wether a doc is the raw data or aggregated data.

*How we can start from scratch*

I think we can start from a sidecar solution first. Assume that index A is the 
index storing raw data. And index A' is a sidecar index which is a continuous 
rolling up index.

Assume that the schema of index A is:

d0 time, d1 long, d2 keyword, m1 long, m2 long, m3 binary(hll),x1,x2,x3 ..

x1, x2 and x3 fields are no related to rollup and they are just additional 
normal fields.

d0 is the event time, d1 and d2 are all dimensions and m1, m2 and m3 are all 
metrics.

If we want to rollup the data to hourly granularity, we can create a rollup 
sidecar index A' which only contains d0, d1, d2, m1, m2, m3 fields  and do 
rollup during merge process. User can submit query to A or A' accordingly.

What's more, we can create several rollup indices which is often called 
"materialized view" in OLAP scenarios.

For example, if we need another view that only store d0, d1, m3 and rollup 
granularity is daily, we can create an additional sidecar index A''.

User only need to write raw data once to index A and all the rollup calculation 
is performed in the internal of lucene. User should submit query to different 
level of indices accordingly.

 

What do you think?

 

 

 


was (Author: maosuhan):
[~jpountz] Thanks for your reply!

As I know, the current rollup implementation in ES is to periodically run a 
composite aggregation to query the aggregated result and insert into another 
index.

*But this approach has several disadvantages:*
 # It still need to save the detailed data which may not be needed by the user 
if they only want to run aggregate queries. This is what clickhouse aggregate 
merge tree or druid does currently, they only store aggregated data. But as you 
mentioned, we can create a sidecar index besides the raw data index, I think it 
is also acceptable in the first step.
 # Composite aggregation will cause OOM problem and it maybe very slow if the 
data volume is very big. For example, if the query granularity is 1h, composite

[GitHub] [lucene-solr] kiranchitturi opened a new pull request #2644: SOLR-16009 Add custom udfs for filtering inside multi-valued fields

2022-02-27 Thread GitBox


kiranchitturi opened a new pull request #2644:
URL: https://github.com/apache/lucene-solr/pull/2644


   * Since Solr supports multi-valued fields, it would be crucial for Solr SQL 
to provide support for users to filter multi-valued fields via SQL
   * Support syntax like `WHERE mv_field = 'a' and mv_field = 'b'` would not 
make sense using SQL but we can support this via custom udfs
   * Add new udfs `array_contains_all` and `array_contains_any` to filter 
inside multi-valued fields as this provides a clean interface to use and no 
need to mess with the calcite rules


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-27 Thread GitBox


iverase commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053981720


   I like this idea but I agree with Adrien that the API change does not look 
right. More over, I don't think we need to add it to PointValues as this is 
something specific for Index sorting and it won't apply in all situation. I 
think it makes little sense when we use the kd-tree as an effective r-tree like 
in range fields.
   
   I think it is possible to add a specific implementation on 
`IndexSortSortedNumericDocValuesRangeQuery` that does not require on an API 
change. I came up with this implementation, something closer to that should 
work?
   
   ```
 /**
  * Returns the first document whose packed value is greater than or equal 
to the provided packed value
  * or -1 if all packed values are smaller than the provided one,
  */
 public final int nextDoc(PointValues values, byte[] packedValue) throws 
IOException {
 final int numIndexDimensions = values.getNumIndexDimensions();
 final int bytesPerDim = values.getBytesPerDimension();
 final ByteArrayComparator comparator = 
ArrayUtil.getUnsignedComparator(bytesPerDim);
 final Predicate biggerThan = testPackedValue -> {
 for (int dim = 0; dim < numIndexDimensions; dim++) {
 final int offset = dim * bytesPerDim;
 if (comparator.compare(packedValue, offset, testPackedValue, 
offset) <= 0) {
 return false;
 }
 }
 return true;
 };
 return nextDoc(values.getPointTree(), biggerThan);
 }
   
 private int nextDoc(PointValues.PointTree pointTree, Predicate 
biggerThan) throws IOException {
 if (biggerThan.test(pointTree.getMaxPackedValue())) {
 // doc is before us
 return -1;
 } else if (pointTree.moveToChild()) {
 // navigate down
 do {
 final int doc = nextDoc(pointTree, biggerThan);
 if (doc != -1) {
 return doc;
 }
 } while (pointTree.moveToSibling());
 pointTree.moveToParent();
 return -1;
 } else {
 // doc is in this leaf
 final int[] doc = {-1};
 pointTree.visitDocValues(new IntersectVisitor() {
 @Override
 public void visit(int docID) {
 throw new AssertionError("Invalid call to visit(docID)");
 }
   
 @Override
 public void visit(int docID, byte[] packedValue) {
 if (doc[0] == -1 && biggerThan.test(packedValue) == false) 
{
 doc[0] = docID;
 }
 }
   
 @Override
 public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
 return Relation.CELL_CROSSES_QUERY;
 }
 });
 assert doc[0] != -1;
 return doc[0];
 }
 }
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org