[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

2022-02-26 Thread GitBox


rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1052127011


   if we add `grow(long)` that simply truncates and forwards, then it 
encapsulates this within this class. The code stays simple and the caller 
doesn't need to know about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10430) Literal double quotes cause exception in class RegExp

2022-02-26 Thread Holger Rehn (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holger Rehn updated LUCENE-10430:
-
Description: 
Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular 
expressions that contain double quotes (except in character classes). This of 
course affects corresponding RegexpQuerys, as well.

Example: 
{code:java}
Query  q = new RegexpQuery( new Term( "field", "a\"b" ) );
RegExp r = new RegExp( "a\"b" );{code}
Both fail with:
{code:java}
java.lang.IllegalArgumentException: expected '"' at position 3
    at org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299)
    at 
org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229)
    at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218)
    at org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192)
    at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185)
    at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187)
    at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179)
    at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173)
    at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496)
...{code}
As a workaround we currently replace all double quotes with a dot.

  was:
Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular 
expressions that contain double quotes. This of course affects corresponding 
RegexpQuerys, as well.

Example: 
{code:java}
Query  q = new RegexpQuery( new Term( "field", "a\"b" ) );
RegExp r = new RegExp( "a\"b" );{code}
Both fail with:
{code:java}
java.lang.IllegalArgumentException: expected '"' at position 3
    at org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299)
    at 
org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229)
    at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218)
    at org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192)
    at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185)
    at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187)
    at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179)
    at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173)
    at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496)
...{code}
As a workaround we currently replace all double quotes with a dot.


> Literal double quotes cause exception in class RegExp
> -
>
> Key: LUCENE-10430
> URL: https://issues.apache.org/jira/browse/LUCENE-10430
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.0
>Reporter: Holger Rehn
>Priority: Major
>
> Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular 
> expressions that contain double quotes (except in character classes). This of 
> course affects corresponding RegexpQuerys, as well.
> Example: 
> {code:java}
> Query  q = new RegexpQuery( new Term( "field", "a\"b" ) );
> RegExp r = new RegExp( "a\"b" );{code}
> Both fail with:
> {code:java}
> java.lang.IllegalArgumentException: expected '"' at position 3
>     at 
> org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229)
>     at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187)
>     at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179)
>     at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173)
>     at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496)
> ...{code}
> As a workaround we currently replace all double quotes with a dot.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10430) Literal double quotes cause exception in class RegExp

2022-02-26 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10430.
--
Resolution: Not A Problem

You aren't escaping the quote. You're sending a java string of length 1 ("). 
You need to create the string as "\\\"" in java so that it is of length 2 (\").

> Literal double quotes cause exception in class RegExp
> -
>
> Key: LUCENE-10430
> URL: https://issues.apache.org/jira/browse/LUCENE-10430
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 9.0
>Reporter: Holger Rehn
>Priority: Major
>
> Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular 
> expressions that contain double quotes (except in character classes). This of 
> course affects corresponding RegexpQuerys, as well.
> Example: 
> {code:java}
> Query  q = new RegexpQuery( new Term( "field", "a\"b" ) );
> RegExp r = new RegExp( "a\"b" );{code}
> Both fail with:
> {code:java}
> java.lang.IllegalArgumentException: expected '"' at position 3
>     at 
> org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229)
>     at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185)
>     at 
> org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187)
>     at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179)
>     at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173)
>     at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496)
> ...{code}
> As a workaround we currently replace all double quotes with a dot.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#count using bkd binary search

2022-02-26 Thread GitBox


wjp719 commented on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   ```
   2. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-26 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** to label if all doc between min/max 
doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will 
skip to call the **delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   ```
   2. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-26 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in ** BoundedDocSetIdIterator**to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   ```
   2. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-02-26 Thread GitBox


wjp719 edited a comment on pull request #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677


   > This looks very similar to the implementation of `Weight#count` on 
`PointRangeQuery` and should only perform marginally faster? It's uncreal to me 
whether this PR buys us much.
   
   Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd 
binary search to find out the min/max docId, to create the 
**IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when 
create **Scorer** instead of using docvalue to binary search to find out 
min/max docId. 
   
   As we known, docvalue can only advance forward, but binary search may need 
to walk back to get the docValue of the middle doc, so every search operation 
in binary search using docvalue, it needs to create a new 
**SortedNumericDocValues** instance and advance from the first doc, so it will 
be more cpu and IO consuming.
   
   I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to 
label if all doc between min/max doc exists, so in the 
**BoundedDocSetIdIterator#advance()** method, it will skip to call the 
**delegate.advance()** to check if the doc exists
   
   ### benchmark result
   I also test this pr performance with main branch: 
    dataset
   I use two dataset, the small dataset is the 
[httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with 
about 200million doc
   the big one is our application log with 1.4billion doc
    query
   query is a boolean query with a range query clause and a term query clause, 
for the small dataset, the query is
   ```
   "query": {
   "bool": {
 "must": [
   {
 "range": {
   "@timestamp": {
"gte": "1998-06-08T05:00:01Z",
 "lt": "1998-06-15T00:00:00Z"
   }
 }
   },
   {
 "match": {
   "status": "200"
 }
   }
 ]
   }
 }
   ```
    result
   1. with es rally tool. (it run many times, so the disk data is cached)
   ```
   |Metric |  
Task |Baseline |   Contender | Diff |   Unit |
   |Min Throughput | 
200s-in-range | 9.92683 | 10.0551 |  0.12825 |  ops/s |
   |   Mean Throughput | 
200s-in-range | 9.94556 | 10.0642 |  0.11868 |  ops/s |
   | Median Throughput | 
200s-in-range | 9.94556 | 10.0633 |   0.1177 |  ops/s |
   |Max Throughput | 
200s-in-range | 9.96398 | 10.0737 |  0.10974 |  ops/s |
   |   50th percentile latency | 
200s-in-range | 38664.7 | 38022.7 | -641.967 | ms |
   |   90th percentile latency | 
200s-in-range | 41349.8 |   40704 | -645.858 | ms |
   |   99th percentile latency | 
200s-in-range | 41954.2 | 41308.7 | -645.491 | ms |
   |  100th percentile latency | 
200s-in-range | 42021.6 | 41377.6 | -643.989 | ms |
   ```
   2. manually run one time(clear all  cache) 
   ```
   |dataSet|main branch latency|this pr latency|latency improvement|
   |httpLog|  267ms|  167ms|   -38%|
   |our application log| 2829ms| 1093ms|   -62%|
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org