[GitHub] [lucene] romseygeek commented on issue #12318: Async Usage of Lucene Monitor through a Reactive Programming based application

2023-06-05 Thread via GitHub


romseygeek commented on issue #12318:
URL: https://github.com/apache/lucene/issues/12318#issuecomment-1576322597

   Yes, I'd say `match` is unlikely to do any actual IO unless you're 
explicitly using an on-disk directory (ie not MMapDirectory or 
ByteBuffersDirectory) for the query index.  So it will all be CPU-bound and 
there won't really be any opportunities to hand off to another thread while 
waiting for disk.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] fudongyingluck commented on pull request #12339: feat: soft delete optimize

2023-06-05 Thread via GitHub


fudongyingluck commented on PR #12339:
URL: https://github.com/apache/lucene/pull/12339#issuecomment-1576330887

   This is the esrally result. The command is like`esrally race 
--track=http_logs --target-hosts=*:9201  --pipeline=benchmark-only --offline  
--user-tag=softdelete:baseline --challenge=update`
   > |Metric |   Task | 
   Baseline |   Contender |Diff |   Unit |   Diff % |
   
|--:|---:|:|:|:|---:|-:|
   |Cumulative indexing time of primary shards ||   
515.49|   504.15|   -11.3398  |min |   -2.20% |
   | Min cumulative indexing time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   |  Median cumulative indexing time across primary shard ||   
 17.7529  |17.9699  | 0.2169  |min |   +1.22% |
   | Max cumulative indexing time across primary shard ||   
404.723   |   393.369   |   -11.3536  |min |   -2.81% |
   |   Cumulative indexing throttle time of primary shards ||   
  0   | 0   | 0   |min |0.00% |
   |Min cumulative indexing throttle time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   | Median cumulative indexing throttle time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   |Max cumulative indexing throttle time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   |   Cumulative merge time of primary shards ||   
133.81|   127.489   |-6.32017 |min |   -4.72% |
   |  Cumulative merge count of primary shards ||   
173   |   172   |-1   ||   -0.58% |
   |Min cumulative merge time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   | Median cumulative merge time across primary shard ||   
  2.61536 | 2.96084 | 0.34548 |min |  +13.21% |
   |Max cumulative merge time across primary shard ||   
118.648   |   110.923   |-7.7245  |min |   -6.51% |
   |  Cumulative merge throttle time of primary shards ||   
 57.0305  |55.1042  |-1.92633 |min |   -3.38% |
   |   Min cumulative merge throttle time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   |Median cumulative merge throttle time across primary shard ||   
  0.215533| 0.307242| 0.09171 |min |  +42.55% |
   |   Max cumulative merge throttle time across primary shard ||   
 55.2842  |53.1749  |-2.10932 |min |   -3.82% |
   | Cumulative refresh time of primary shards ||   
 21.5803  |20.5713  |-1.009   |min |   -4.68% |
   |Cumulative refresh count of primary shards ||   
668   |   674   | 6   ||   +0.90% |
   |  Min cumulative refresh time across primary shard ||   
  0   | 0   | 0   |min |0.00% |
   |   Median cumulative refresh time across primary shard ||   
  0.542333| 0.508642|-0.03369 |min |   -6.21% |
   |  Max cumulative refresh time across primary shard ||   
 18.1363  |17.4352  |-0.70113 |min |   -3.87% |
   |   Cumulative flush time of primary shards ||   
  9.37332 |10.4646  | 1.09132 |min |  +11.64% |
   |  Cumulative flush count of primary shards ||   
 63   |64   | 1   ||   +1.59% |
   |Min cumulative flush time across primary shard ||   
  0.00296667  | 0.0001  |-0.00287 |min |  -96.63% |
   | Median cumulative flush time across primary shard ||   
  0.0971583   | 0.0769667   |-0.02019 |min |  -20.78% |
   |Max cumulative flush time across primary shard ||   
  8.6855  | 9.83638 | 1.15088 |min |  +13.25% |
   |   Total Young Gen GC time ||  
1070.97|  1065.08|-5.889   |  s |   -0.55% |
   |  Total Young Gen GC count ||  
8254   |  8187   |   -67   ||   -0.81% |
   | Total Old Gen GC time |

[GitHub] [lucene] eliaporciani commented on a diff in pull request #12253: GITHUB-12252: Add function queries for computing similarity scores between knn vectors

2023-06-05 Thread via GitHub


eliaporciani commented on code in PR #12253:
URL: https://github.com/apache/lucene/pull/12253#discussion_r1217704204


##
lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/KnnVectorValueSource.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.queries.function.valuesource;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.Objects;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.queries.function.FunctionValues;
+import org.apache.lucene.queries.function.ValueSource;
+
+/** function that returns a constant vector value for every document. */
+public class KnnVectorValueSource extends ValueSource {
+
+  List vector;
+
+  public KnnVectorValueSource(List constVector) {
+this.vector = constVector;
+  }
+
+  @Override
+  public FunctionValues getValues(Map context, 
LeafReaderContext readerContext)
+  throws IOException {
+return new FunctionValues() {
+  @Override
+  public float[] floatVectorVal(int doc) {

Review Comment:
   This method is called for each document and it introduces an overhead. We 
should try to create the arrays at most once. 
   The problem is that we have both the byte[] and the float[] arrays as 
possibilities and we cannot know which one will be needed in advance.
   Maybe we can create the float[] vector (and byte[]) lazily during the first 
call of this method and store it in a variable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] almogtavor commented on issue #12318: Async Usage of Lucene Monitor through a Reactive Programming based application

2023-06-05 Thread via GitHub


almogtavor commented on issue #12318:
URL: https://github.com/apache/lucene/issues/12318#issuecomment-1576648070

   @romseygeek Oh so that sounds even better than what I thought. In this case, 
I can treat the `match` operation like a total sync operation and use it in 
Reactor without bounding the `match` operation to its own thread pool. Similar 
to the way I treat Jackson's ObjectMapper's string-object operations for 
example.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12249: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth

2023-06-05 Thread via GitHub


jpountz commented on code in PR #12249:
URL: https://github.com/apache/lucene/pull/12249#discussion_r1218211772


##
lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java:
##
@@ -660,4 +663,27 @@ public void testMultipleSidePathsWithGaps() throws 
Exception {
 it.next(), new String[] {"king", "alfred", "saxons", "ruled"}, new 
int[] {1, 1, 3, 1});
 assertFalse(it.hasNext());
   }
+
+  public void testLongTokenStreamStackOverflowError() throws Exception {
+
+ArrayList tokens =
+new ArrayList() {
+  {
+add(token("fast", 1, 1));
+add(token("wi", 1, 1));
+add(token("wifi", 0, 2));
+add(token("fi", 1, 1));
+  }
+};
+
+// Add in too many tokens to get a high depth graph
+for (int i = 0; i < 1024 * 10; i++) {
+  tokens.add(token("network", 1, 1));
+}
+
+TokenStream ts = new CannedTokenStream(tokens.toArray(new Token[0]));
+GraphTokenStreamFiniteStrings graph = new 
GraphTokenStreamFiniteStrings(ts);
+
+assertThrows(IllegalArgumentException.class, () -> 
graph.articulationPoints());

Review Comment:
   nit: we prefer method refs to lambdas whenever possible
   
   ```suggestion
   assertThrows(IllegalArgumentException.class, graph::articulationPoints);
   ```



##
lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java:
##
@@ -660,4 +661,27 @@ public void testMultipleSidePathsWithGaps() throws 
Exception {
 it.next(), new String[] {"king", "alfred", "saxons", "ruled"}, new 
int[] {1, 1, 3, 1});
 assertFalse(it.hasNext());
   }
+
+  public void testLongTokenStreamStackOverflowError() throws Exception {
+ArrayList tokens =
+new ArrayList() {
+  {
+add(token("turbo", 1, 1));
+add(token("fast", 0, 2));
+add(token("charged", 1, 1));
+add(token("wi", 1, 1));
+add(token("wifi", 0, 2));
+add(token("fi", 1, 1));
+  }
+};
+
+// Add in too many tokens to get a high depth graph
+for (int i = 0; i < 1024 * 10; i++) {

Review Comment:
   @cfournie I think that Erik's point is that your test would not catch an 
issue where the exception only gets thrown with a depth of 4000 or more, so 
changing the number of iterations to 1024+1 would help make sure that the 
exception gets thrown as early as we expect. I would prefer changing the number 
of iterations to 1024+1 as well.



##
lucene/CHANGES.txt:
##
@@ -80,6 +80,8 @@ Bug Fixes
 
 * GITHUB#12220: Hunspell: disallow hidden title-case entries from compound 
middle/end
 
+* LUCENE-10181: Restrict 
GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth. (Chris 
Fournier)

Review Comment:
   Can you move it to the 9.7 section instead of 10.0? This feels like a change 
that could go in a minor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12249: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth

2023-06-05 Thread via GitHub


jpountz commented on code in PR #12249:
URL: https://github.com/apache/lucene/pull/12249#discussion_r1218216466


##
lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java:
##
@@ -16,6 +16,9 @@
  */
 package org.apache.lucene.util.graph;
 
+import static org.apache.lucene.util.automaton.Operations.MAX_RECURSION_LEVEL;

Review Comment:
   FYI the build complains that this import is never used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sohami opened a new issue, #12347: Allow extensions of IndexSearcher to provide custom SliceExecutor and slices computation

2023-06-05 Thread via GitHub


sohami opened a new issue, #12347:
URL: https://github.com/apache/lucene/issues/12347

   ### Description
   
   For concurrent segment search, lucene uses the slices method to compute the 
number of work units which can be processed concurrently.
   
   a) It calculates slices in the [constructor of 
IndexSearcher](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L239)
 with default thresholds for document count and segment counts. 
   b) Provides an implementation of [SliceExecutor (i.e. 
QueueSizeBasedExecutor)](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L1008)
 based on executor type which applies the backpressure in concurrent execution 
based on a limiting factor of 1.5 times the passed in threadpool maxPoolSize.
   
   In OpenSearch, there is a search threadpool which serves the search request 
to all the lucene indices (or OpenSearch shards) assigned to a node. Each node 
can get the requests to some or all the indices on that node.
   I am exploring a mechanism such that I can dynamically control the max 
slices for each lucene index search request. For example: search requests to 
some indices on that node to have max 4 slices each and others to have 2 slices 
each. Then the threadpool shared to execute these slices does not have any 
limiting factor. In this model the top level search threadpool will limit the 
number of active search requests which will limit the number of work units in 
the SliceExecutor threadpool. 
   
   For this the derived implementation of IndexSearcher can get an input value 
in the constructor to control the slice count computation. Even though the 
`slice` method is `protected` it gets called from the constructor of base 
`IndexSearcher` class which prevents the derived class from using the passed in 
input. 
   
   To achieve this I am making change along the lines as suggested on 
discussion thread in dev mailing list to get some feedback


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] fudongyingluck commented on pull request #12339: feat: soft delete optimize

2023-06-05 Thread via GitHub


fudongyingluck commented on PR #12339:
URL: https://github.com/apache/lucene/pull/12339#issuecomment-1577903918

   lucene benchmark result, `python3.10 src/python/localrun.py -source 
wikimediumall`
   ```TaskQPS baseline  StdDevQPS my_modified_version  StdDev   
 Pct diff p-value
   BrowseDateSSDVFacets1.54 (11.4%)1.46 
(16.1%)   -5.2% ( -29% -   25%) 0.242
 OrHighMedDayTaxoFacets5.38  (5.6%)5.24  
(5.0%)   -2.6% ( -12% -8%) 0.127
   PKLookup  279.48  (3.0%)  273.06  
(3.1%)   -2.3% (  -8% -3%) 0.018
   MedTermDayTaxoFacets   35.78  (2.2%)   35.10  
(1.8%)   -1.9% (  -5% -2%) 0.002
   BrowseDateTaxoFacets7.23 (22.3%)7.10 
(23.8%)   -1.8% ( -39% -   56%) 0.802
   HighIntervalsOrdered   10.59  (8.9%)   10.42  
(8.6%)   -1.6% ( -17% -   17%) 0.568
  BrowseDayOfYearTaxoFacets7.30 (21.8%)7.19 
(23.9%)   -1.6% ( -38% -   56%) 0.829
LowIntervalsOrdered4.55  (7.1%)4.48  
(7.1%)   -1.5% ( -14% -   13%) 0.495
MedIntervalsOrdered6.90  (8.1%)6.81  
(7.3%)   -1.4% ( -15% -   15%) 0.565
 Fuzzy2  118.84  (2.2%)  117.28  
(2.5%)   -1.3% (  -5% -3%) 0.078
Respell   82.74  (3.1%)   81.79  
(4.0%)   -1.2% (  -7% -6%) 0.308
  HighTermMonthSort 3093.29  (5.8%) 3057.85  
(6.7%)   -1.1% ( -12% -   12%) 0.562
BrowseRandomLabelTaxoFacets6.40 (38.8%)6.33 
(40.9%)   -1.1% ( -58% -  128%) 0.930
   HighTerm  791.45  (5.1%)  783.46  
(4.7%)   -1.0% ( -10% -9%) 0.517
 HighPhrase   30.44  (2.3%)   30.16  
(2.2%)   -0.9% (  -5% -3%) 0.190
 Fuzzy1  108.68  (2.7%)  107.67  
(3.6%)   -0.9% (  -7% -5%) 0.359
   OrHighNotMed  320.94  (6.6%)  318.02  
(5.3%)   -0.9% ( -11% -   11%) 0.629
  OrNotHighHigh  468.36  (5.3%)  464.33  
(4.2%)   -0.9% (  -9% -9%) 0.568
LowSloppyPhrase   34.97  (4.1%)   34.69  
(4.2%)   -0.8% (  -8% -7%) 0.534
  MedPhrase  242.27  (2.5%)  240.32  
(1.9%)   -0.8% (  -5% -3%) 0.248
 AndHighMed   77.34  (6.0%)   76.76  
(5.7%)   -0.8% ( -11% -   11%) 0.686
   OrHighNotLow  744.00  (6.5%)  738.66  
(5.8%)   -0.7% ( -12% -   12%) 0.711
 AndHighLow  586.58  (3.5%)  582.51  
(4.2%)   -0.7% (  -8% -7%) 0.573
   HighSloppyPhrase3.91  (4.5%)3.89  
(3.9%)   -0.6% (  -8% -8%) 0.670
MedSpanNear   37.46  (2.1%)   37.26  
(2.5%)   -0.6% (  -5% -4%) 0.441
  LowPhrase  153.02  (2.2%)  152.17  
(2.1%)   -0.6% (  -4% -3%) 0.417
   OrNotHighLow 1030.00  (3.2%) 1025.40  
(3.5%)   -0.4% (  -6% -6%) 0.675
   Wildcard   35.75  (3.2%)   35.59  
(4.5%)   -0.4% (  -7% -7%) 0.723
MedTerm  761.12  (5.8%)  757.86  
(6.0%)   -0.4% ( -11% -   12%) 0.819
AndHighHigh   22.42  (6.5%)   22.33  
(5.7%)   -0.4% ( -11% -   12%) 0.830
LowTerm  689.41  (3.9%)  686.65  
(4.6%)   -0.4% (  -8% -8%) 0.768
   HighSpanNear2.47  (4.2%)2.46  
(5.0%)   -0.4% (  -9% -9%) 0.789
   AndHighHighDayTaxoFacets7.97  (1.6%)7.94  
(1.9%)   -0.4% (  -3% -3%) 0.522
  OrHighNotHigh  352.84  (6.6%)  351.68  
(4.9%)   -0.3% ( -11% -   11%) 0.859
AndHighMedDayTaxoFacets   48.80  (1.6%)   48.65  
(2.3%)   -0.3% (  -4% -3%) 0.611
MedSloppyPhrase   24.12  (2.4%)   24.04  
(2.5%)   -0.3% (  -5% -4%) 0.684
  OrHighMed   37.82  (6.3%)   37.72  
(5.5%)   -0.3% ( -11% -   12%) 0.891
   HighTermTitleBDVSort7.13  (8.7%)7.11  
(8.1%)   -0.2% ( -15% -   18%) 0.927
LowSpanNear   26.13  (3.7%)   26.08  
(3.3%)   -0.2% (  -6% -7%) 0.866
Prefix3  408.84  (1.3%)  408.62  
(2.1%)   -0.1% (  -3% -3%) 0.923
   OrNotHighMed  469.82  (4.2%)  470.09  
(3.6%)0.1% (