[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it
rmuir commented on pull request #709: URL: https://github.com/apache/lucene/pull/709#issuecomment-1052127011 if we add `grow(long)` that simply truncates and forwards, then it encapsulates this within this class. The code stays simple and the caller doesn't need to know about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10430) Literal double quotes cause exception in class RegExp
[ https://issues.apache.org/jira/browse/LUCENE-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holger Rehn updated LUCENE-10430: - Description: Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular expressions that contain double quotes (except in character classes). This of course affects corresponding RegexpQuerys, as well. Example: {code:java} Query q = new RegexpQuery( new Term( "field", "a\"b" ) ); RegExp r = new RegExp( "a\"b" );{code} Both fail with: {code:java} java.lang.IllegalArgumentException: expected '"' at position 3 at org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299) at org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229) at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218) at org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192) at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185) at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187) at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179) at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173) at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496) ...{code} As a workaround we currently replace all double quotes with a dot. was: Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular expressions that contain double quotes. This of course affects corresponding RegexpQuerys, as well. Example: {code:java} Query q = new RegexpQuery( new Term( "field", "a\"b" ) ); RegExp r = new RegExp( "a\"b" );{code} Both fail with: {code:java} java.lang.IllegalArgumentException: expected '"' at position 3 at org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299) at org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229) at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218) at org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192) at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185) at org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187) at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179) at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173) at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496) ...{code} As a workaround we currently replace all double quotes with a dot. > Literal double quotes cause exception in class RegExp > - > > Key: LUCENE-10430 > URL: https://issues.apache.org/jira/browse/LUCENE-10430 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.0 >Reporter: Holger Rehn >Priority: Major > > Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular > expressions that contain double quotes (except in character classes). This of > course affects corresponding RegexpQuerys, as well. > Example: > {code:java} > Query q = new RegexpQuery( new Term( "field", "a\"b" ) ); > RegExp r = new RegExp( "a\"b" );{code} > Both fail with: > {code:java} > java.lang.IllegalArgumentException: expected '"' at position 3 > at > org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299) > at > org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229) > at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218) > at > org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192) > at > org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185) > at > org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187) > at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179) > at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173) > at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496) > ...{code} > As a workaround we currently replace all double quotes with a dot. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10430) Literal double quotes cause exception in class RegExp
[ https://issues.apache.org/jira/browse/LUCENE-10430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10430. -- Resolution: Not A Problem You aren't escaping the quote. You're sending a java string of length 1 ("). You need to create the string as "\\\"" in java so that it is of length 2 (\"). > Literal double quotes cause exception in class RegExp > - > > Key: LUCENE-10430 > URL: https://issues.apache.org/jira/browse/LUCENE-10430 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 9.0 >Reporter: Holger Rehn >Priority: Major > > Class org.apache.lucene.util.automaton.RegExp fails to parse valid regular > expressions that contain double quotes (except in character classes). This of > course affects corresponding RegexpQuerys, as well. > Example: > {code:java} > Query q = new RegexpQuery( new Term( "field", "a\"b" ) ); > RegExp r = new RegExp( "a\"b" );{code} > Both fail with: > {code:java} > java.lang.IllegalArgumentException: expected '"' at position 3 > at > org.apache.lucene.util.automaton.RegExp.parseSimpleExp(RegExp.java:1299) > at > org.apache.lucene.util.automaton.RegExp.parseCharClassExp(RegExp.java:1229) > at org.apache.lucene.util.automaton.RegExp.parseComplExp(RegExp.java:1218) > at > org.apache.lucene.util.automaton.RegExp.parseRepeatExp(RegExp.java:1192) > at > org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1185) > at > org.apache.lucene.util.automaton.RegExp.parseConcatExp(RegExp.java:1187) > at org.apache.lucene.util.automaton.RegExp.parseInterExp(RegExp.java:1179) > at org.apache.lucene.util.automaton.RegExp.parseUnionExp(RegExp.java:1173) > at org.apache.lucene.util.automaton.RegExp.(RegExp.java:496) > ...{code} > As a workaround we currently replace all double quotes with a dot. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#count using bkd binary search
wjp719 commented on pull request #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677 > This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much. Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId. As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming. ### benchmark result I also test this pr performance with main branch: dataset I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc the big one is our application log with 1.4billion doc query query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is ``` "query": { "bool": { "must": [ { "range": { "@timestamp": { "gte": "1998-06-08T05:00:01Z", "lt": "1998-06-15T00:00:00Z" } } }, { "match": { "status": "200" } } ] } } ``` result 1. with es rally tool. (it run many times, so the disk data is cached) ``` |Metric | Task |Baseline | Contender | Diff | Unit | |Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s | | Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s | | Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s | |Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s | | 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms | | 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms | | 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms | | 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms | ``` 2. manually run one time(clear all cache) ``` |dataSet|main branch latency|this pr latency|latency improvement| |httpLog| 267ms| 167ms| -38%| |our application log| 2829ms| 1093ms| -62%| ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 edited a comment on pull request #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677 > This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much. Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId. As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming. I also add a variable **allDocExist** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists ### benchmark result I also test this pr performance with main branch: dataset I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc the big one is our application log with 1.4billion doc query query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is ``` "query": { "bool": { "must": [ { "range": { "@timestamp": { "gte": "1998-06-08T05:00:01Z", "lt": "1998-06-15T00:00:00Z" } } }, { "match": { "status": "200" } } ] } } ``` result 1. with es rally tool. (it run many times, so the disk data is cached) ``` |Metric | Task |Baseline | Contender | Diff | Unit | |Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s | | Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s | | Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s | |Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s | | 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms | | 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms | | 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms | | 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms | ``` 2. manually run one time(clear all cache) ``` |dataSet|main branch latency|this pr latency|latency improvement| |httpLog| 267ms| 167ms| -38%| |our application log| 2829ms| 1093ms| -62%| ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 edited a comment on pull request #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677 > This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much. Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId. As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming. I also add a variable **allDocExist** in ** BoundedDocSetIdIterator**to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists ### benchmark result I also test this pr performance with main branch: dataset I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc the big one is our application log with 1.4billion doc query query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is ``` "query": { "bool": { "must": [ { "range": { "@timestamp": { "gte": "1998-06-08T05:00:01Z", "lt": "1998-06-15T00:00:00Z" } } }, { "match": { "status": "200" } } ] } } ``` result 1. with es rally tool. (it run many times, so the disk data is cached) ``` |Metric | Task |Baseline | Contender | Diff | Unit | |Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s | | Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s | | Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s | |Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s | | 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms | | 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms | | 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms | | 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms | ``` 2. manually run one time(clear all cache) ``` |dataSet|main branch latency|this pr latency|latency improvement| |httpLog| 267ms| 167ms| -38%| |our application log| 2829ms| 1093ms| -62%| ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 edited a comment on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 edited a comment on pull request #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1053190677 > This looks very similar to the implementation of `Weight#count` on `PointRangeQuery` and should only perform marginally faster? It's uncreal to me whether this PR buys us much. Hi, @jpountz, I refactor the code, now if conditions meet, I use the bkd binary search to find out the min/max docId, to create the **IndexSortSortedNumericDocValuesRangeQuery.BoundedDocSetIdIterator** when create **Scorer** instead of using docvalue to binary search to find out min/max docId. As we known, docvalue can only advance forward, but binary search may need to walk back to get the docValue of the middle doc, so every search operation in binary search using docvalue, it needs to create a new **SortedNumericDocValues** instance and advance from the first doc, so it will be more cpu and IO consuming. I also add a variable **allDocExist** in **BoundedDocSetIdIterator** to label if all doc between min/max doc exists, so in the **BoundedDocSetIdIterator#advance()** method, it will skip to call the **delegate.advance()** to check if the doc exists ### benchmark result I also test this pr performance with main branch: dataset I use two dataset, the small dataset is the [httpLog](https://github.com/elastic/rally-tracks/tree/master/http_logs) with about 200million doc the big one is our application log with 1.4billion doc query query is a boolean query with a range query clause and a term query clause, for the small dataset, the query is ``` "query": { "bool": { "must": [ { "range": { "@timestamp": { "gte": "1998-06-08T05:00:01Z", "lt": "1998-06-15T00:00:00Z" } } }, { "match": { "status": "200" } } ] } } ``` result 1. with es rally tool. (it run many times, so the disk data is cached) ``` |Metric | Task |Baseline | Contender | Diff | Unit | |Min Throughput | 200s-in-range | 9.92683 | 10.0551 | 0.12825 | ops/s | | Mean Throughput | 200s-in-range | 9.94556 | 10.0642 | 0.11868 | ops/s | | Median Throughput | 200s-in-range | 9.94556 | 10.0633 | 0.1177 | ops/s | |Max Throughput | 200s-in-range | 9.96398 | 10.0737 | 0.10974 | ops/s | | 50th percentile latency | 200s-in-range | 38664.7 | 38022.7 | -641.967 | ms | | 90th percentile latency | 200s-in-range | 41349.8 | 40704 | -645.858 | ms | | 99th percentile latency | 200s-in-range | 41954.2 | 41308.7 | -645.491 | ms | | 100th percentile latency | 200s-in-range | 42021.6 | 41377.6 | -643.989 | ms | ``` 2. manually run one time(clear all cache) ``` |dataSet|main branch latency|this pr latency|latency improvement| |httpLog| 267ms| 167ms| -38%| |our application log| 2829ms| 1093ms| -62%| ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org