[GitHub] [lucene] romseygeek commented on issue #12318: Async Usage of Lucene Monitor through a Reactive Programming based application
romseygeek commented on issue #12318: URL: https://github.com/apache/lucene/issues/12318#issuecomment-1576322597 Yes, I'd say `match` is unlikely to do any actual IO unless you're explicitly using an on-disk directory (ie not MMapDirectory or ByteBuffersDirectory) for the query index. So it will all be CPU-bound and there won't really be any opportunities to hand off to another thread while waiting for disk. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] fudongyingluck commented on pull request #12339: feat: soft delete optimize
fudongyingluck commented on PR #12339: URL: https://github.com/apache/lucene/pull/12339#issuecomment-1576330887 This is the esrally result. The command is like`esrally race --track=http_logs --target-hosts=*:9201 --pipeline=benchmark-only --offline --user-tag=softdelete:baseline --challenge=update` > |Metric | Task | Baseline | Contender |Diff | Unit | Diff % | |--:|---:|:|:|:|---:|-:| |Cumulative indexing time of primary shards || 515.49| 504.15| -11.3398 |min | -2.20% | | Min cumulative indexing time across primary shard || 0 | 0 | 0 |min |0.00% | | Median cumulative indexing time across primary shard || 17.7529 |17.9699 | 0.2169 |min | +1.22% | | Max cumulative indexing time across primary shard || 404.723 | 393.369 | -11.3536 |min | -2.81% | | Cumulative indexing throttle time of primary shards || 0 | 0 | 0 |min |0.00% | |Min cumulative indexing throttle time across primary shard || 0 | 0 | 0 |min |0.00% | | Median cumulative indexing throttle time across primary shard || 0 | 0 | 0 |min |0.00% | |Max cumulative indexing throttle time across primary shard || 0 | 0 | 0 |min |0.00% | | Cumulative merge time of primary shards || 133.81| 127.489 |-6.32017 |min | -4.72% | | Cumulative merge count of primary shards || 173 | 172 |-1 || -0.58% | |Min cumulative merge time across primary shard || 0 | 0 | 0 |min |0.00% | | Median cumulative merge time across primary shard || 2.61536 | 2.96084 | 0.34548 |min | +13.21% | |Max cumulative merge time across primary shard || 118.648 | 110.923 |-7.7245 |min | -6.51% | | Cumulative merge throttle time of primary shards || 57.0305 |55.1042 |-1.92633 |min | -3.38% | | Min cumulative merge throttle time across primary shard || 0 | 0 | 0 |min |0.00% | |Median cumulative merge throttle time across primary shard || 0.215533| 0.307242| 0.09171 |min | +42.55% | | Max cumulative merge throttle time across primary shard || 55.2842 |53.1749 |-2.10932 |min | -3.82% | | Cumulative refresh time of primary shards || 21.5803 |20.5713 |-1.009 |min | -4.68% | |Cumulative refresh count of primary shards || 668 | 674 | 6 || +0.90% | | Min cumulative refresh time across primary shard || 0 | 0 | 0 |min |0.00% | | Median cumulative refresh time across primary shard || 0.542333| 0.508642|-0.03369 |min | -6.21% | | Max cumulative refresh time across primary shard || 18.1363 |17.4352 |-0.70113 |min | -3.87% | | Cumulative flush time of primary shards || 9.37332 |10.4646 | 1.09132 |min | +11.64% | | Cumulative flush count of primary shards || 63 |64 | 1 || +1.59% | |Min cumulative flush time across primary shard || 0.00296667 | 0.0001 |-0.00287 |min | -96.63% | | Median cumulative flush time across primary shard || 0.0971583 | 0.0769667 |-0.02019 |min | -20.78% | |Max cumulative flush time across primary shard || 8.6855 | 9.83638 | 1.15088 |min | +13.25% | | Total Young Gen GC time || 1070.97| 1065.08|-5.889 | s | -0.55% | | Total Young Gen GC count || 8254 | 8187 | -67 || -0.81% | | Total Old Gen GC time |
[GitHub] [lucene] eliaporciani commented on a diff in pull request #12253: GITHUB-12252: Add function queries for computing similarity scores between knn vectors
eliaporciani commented on code in PR #12253: URL: https://github.com/apache/lucene/pull/12253#discussion_r1217704204 ## lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/KnnVectorValueSource.java: ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.queries.function.valuesource; + +import java.io.IOException; +import java.util.List; +import java.util.Map; +import java.util.Objects; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.queries.function.FunctionValues; +import org.apache.lucene.queries.function.ValueSource; + +/** function that returns a constant vector value for every document. */ +public class KnnVectorValueSource extends ValueSource { + + List vector; + + public KnnVectorValueSource(List constVector) { +this.vector = constVector; + } + + @Override + public FunctionValues getValues(Map context, LeafReaderContext readerContext) + throws IOException { +return new FunctionValues() { + @Override + public float[] floatVectorVal(int doc) { Review Comment: This method is called for each document and it introduces an overhead. We should try to create the arrays at most once. The problem is that we have both the byte[] and the float[] arrays as possibilities and we cannot know which one will be needed in advance. Maybe we can create the float[] vector (and byte[]) lazily during the first call of this method and store it in a variable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] almogtavor commented on issue #12318: Async Usage of Lucene Monitor through a Reactive Programming based application
almogtavor commented on issue #12318: URL: https://github.com/apache/lucene/issues/12318#issuecomment-1576648070 @romseygeek Oh so that sounds even better than what I thought. In this case, I can treat the `match` operation like a total sync operation and use it in Reactor without bounding the `match` operation to its own thread pool. Similar to the way I treat Jackson's ObjectMapper's string-object operations for example. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12249: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth
jpountz commented on code in PR #12249: URL: https://github.com/apache/lucene/pull/12249#discussion_r1218211772 ## lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java: ## @@ -660,4 +663,27 @@ public void testMultipleSidePathsWithGaps() throws Exception { it.next(), new String[] {"king", "alfred", "saxons", "ruled"}, new int[] {1, 1, 3, 1}); assertFalse(it.hasNext()); } + + public void testLongTokenStreamStackOverflowError() throws Exception { + +ArrayList tokens = +new ArrayList() { + { +add(token("fast", 1, 1)); +add(token("wi", 1, 1)); +add(token("wifi", 0, 2)); +add(token("fi", 1, 1)); + } +}; + +// Add in too many tokens to get a high depth graph +for (int i = 0; i < 1024 * 10; i++) { + tokens.add(token("network", 1, 1)); +} + +TokenStream ts = new CannedTokenStream(tokens.toArray(new Token[0])); +GraphTokenStreamFiniteStrings graph = new GraphTokenStreamFiniteStrings(ts); + +assertThrows(IllegalArgumentException.class, () -> graph.articulationPoints()); Review Comment: nit: we prefer method refs to lambdas whenever possible ```suggestion assertThrows(IllegalArgumentException.class, graph::articulationPoints); ``` ## lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java: ## @@ -660,4 +661,27 @@ public void testMultipleSidePathsWithGaps() throws Exception { it.next(), new String[] {"king", "alfred", "saxons", "ruled"}, new int[] {1, 1, 3, 1}); assertFalse(it.hasNext()); } + + public void testLongTokenStreamStackOverflowError() throws Exception { +ArrayList tokens = +new ArrayList() { + { +add(token("turbo", 1, 1)); +add(token("fast", 0, 2)); +add(token("charged", 1, 1)); +add(token("wi", 1, 1)); +add(token("wifi", 0, 2)); +add(token("fi", 1, 1)); + } +}; + +// Add in too many tokens to get a high depth graph +for (int i = 0; i < 1024 * 10; i++) { Review Comment: @cfournie I think that Erik's point is that your test would not catch an issue where the exception only gets thrown with a depth of 4000 or more, so changing the number of iterations to 1024+1 would help make sure that the exception gets thrown as early as we expect. I would prefer changing the number of iterations to 1024+1 as well. ## lucene/CHANGES.txt: ## @@ -80,6 +80,8 @@ Bug Fixes * GITHUB#12220: Hunspell: disallow hidden title-case entries from compound middle/end +* LUCENE-10181: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth. (Chris Fournier) Review Comment: Can you move it to the 9.7 section instead of 10.0? This feels like a change that could go in a minor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12249: Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth
jpountz commented on code in PR #12249: URL: https://github.com/apache/lucene/pull/12249#discussion_r1218216466 ## lucene/core/src/test/org/apache/lucene/util/graph/TestGraphTokenStreamFiniteStrings.java: ## @@ -16,6 +16,9 @@ */ package org.apache.lucene.util.graph; +import static org.apache.lucene.util.automaton.Operations.MAX_RECURSION_LEVEL; Review Comment: FYI the build complains that this import is never used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] sohami opened a new issue, #12347: Allow extensions of IndexSearcher to provide custom SliceExecutor and slices computation
sohami opened a new issue, #12347: URL: https://github.com/apache/lucene/issues/12347 ### Description For concurrent segment search, lucene uses the slices method to compute the number of work units which can be processed concurrently. a) It calculates slices in the [constructor of IndexSearcher](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L239) with default thresholds for document count and segment counts. b) Provides an implementation of [SliceExecutor (i.e. QueueSizeBasedExecutor)](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L1008) based on executor type which applies the backpressure in concurrent execution based on a limiting factor of 1.5 times the passed in threadpool maxPoolSize. In OpenSearch, there is a search threadpool which serves the search request to all the lucene indices (or OpenSearch shards) assigned to a node. Each node can get the requests to some or all the indices on that node. I am exploring a mechanism such that I can dynamically control the max slices for each lucene index search request. For example: search requests to some indices on that node to have max 4 slices each and others to have 2 slices each. Then the threadpool shared to execute these slices does not have any limiting factor. In this model the top level search threadpool will limit the number of active search requests which will limit the number of work units in the SliceExecutor threadpool. For this the derived implementation of IndexSearcher can get an input value in the constructor to control the slice count computation. Even though the `slice` method is `protected` it gets called from the constructor of base `IndexSearcher` class which prevents the derived class from using the passed in input. To achieve this I am making change along the lines as suggested on discussion thread in dev mailing list to get some feedback -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] fudongyingluck commented on pull request #12339: feat: soft delete optimize
fudongyingluck commented on PR #12339: URL: https://github.com/apache/lucene/pull/12339#issuecomment-1577903918 lucene benchmark result, `python3.10 src/python/localrun.py -source wikimediumall` ```TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value BrowseDateSSDVFacets1.54 (11.4%)1.46 (16.1%) -5.2% ( -29% - 25%) 0.242 OrHighMedDayTaxoFacets5.38 (5.6%)5.24 (5.0%) -2.6% ( -12% -8%) 0.127 PKLookup 279.48 (3.0%) 273.06 (3.1%) -2.3% ( -8% -3%) 0.018 MedTermDayTaxoFacets 35.78 (2.2%) 35.10 (1.8%) -1.9% ( -5% -2%) 0.002 BrowseDateTaxoFacets7.23 (22.3%)7.10 (23.8%) -1.8% ( -39% - 56%) 0.802 HighIntervalsOrdered 10.59 (8.9%) 10.42 (8.6%) -1.6% ( -17% - 17%) 0.568 BrowseDayOfYearTaxoFacets7.30 (21.8%)7.19 (23.9%) -1.6% ( -38% - 56%) 0.829 LowIntervalsOrdered4.55 (7.1%)4.48 (7.1%) -1.5% ( -14% - 13%) 0.495 MedIntervalsOrdered6.90 (8.1%)6.81 (7.3%) -1.4% ( -15% - 15%) 0.565 Fuzzy2 118.84 (2.2%) 117.28 (2.5%) -1.3% ( -5% -3%) 0.078 Respell 82.74 (3.1%) 81.79 (4.0%) -1.2% ( -7% -6%) 0.308 HighTermMonthSort 3093.29 (5.8%) 3057.85 (6.7%) -1.1% ( -12% - 12%) 0.562 BrowseRandomLabelTaxoFacets6.40 (38.8%)6.33 (40.9%) -1.1% ( -58% - 128%) 0.930 HighTerm 791.45 (5.1%) 783.46 (4.7%) -1.0% ( -10% -9%) 0.517 HighPhrase 30.44 (2.3%) 30.16 (2.2%) -0.9% ( -5% -3%) 0.190 Fuzzy1 108.68 (2.7%) 107.67 (3.6%) -0.9% ( -7% -5%) 0.359 OrHighNotMed 320.94 (6.6%) 318.02 (5.3%) -0.9% ( -11% - 11%) 0.629 OrNotHighHigh 468.36 (5.3%) 464.33 (4.2%) -0.9% ( -9% -9%) 0.568 LowSloppyPhrase 34.97 (4.1%) 34.69 (4.2%) -0.8% ( -8% -7%) 0.534 MedPhrase 242.27 (2.5%) 240.32 (1.9%) -0.8% ( -5% -3%) 0.248 AndHighMed 77.34 (6.0%) 76.76 (5.7%) -0.8% ( -11% - 11%) 0.686 OrHighNotLow 744.00 (6.5%) 738.66 (5.8%) -0.7% ( -12% - 12%) 0.711 AndHighLow 586.58 (3.5%) 582.51 (4.2%) -0.7% ( -8% -7%) 0.573 HighSloppyPhrase3.91 (4.5%)3.89 (3.9%) -0.6% ( -8% -8%) 0.670 MedSpanNear 37.46 (2.1%) 37.26 (2.5%) -0.6% ( -5% -4%) 0.441 LowPhrase 153.02 (2.2%) 152.17 (2.1%) -0.6% ( -4% -3%) 0.417 OrNotHighLow 1030.00 (3.2%) 1025.40 (3.5%) -0.4% ( -6% -6%) 0.675 Wildcard 35.75 (3.2%) 35.59 (4.5%) -0.4% ( -7% -7%) 0.723 MedTerm 761.12 (5.8%) 757.86 (6.0%) -0.4% ( -11% - 12%) 0.819 AndHighHigh 22.42 (6.5%) 22.33 (5.7%) -0.4% ( -11% - 12%) 0.830 LowTerm 689.41 (3.9%) 686.65 (4.6%) -0.4% ( -8% -8%) 0.768 HighSpanNear2.47 (4.2%)2.46 (5.0%) -0.4% ( -9% -9%) 0.789 AndHighHighDayTaxoFacets7.97 (1.6%)7.94 (1.9%) -0.4% ( -3% -3%) 0.522 OrHighNotHigh 352.84 (6.6%) 351.68 (4.9%) -0.3% ( -11% - 11%) 0.859 AndHighMedDayTaxoFacets 48.80 (1.6%) 48.65 (2.3%) -0.3% ( -4% -3%) 0.611 MedSloppyPhrase 24.12 (2.4%) 24.04 (2.5%) -0.3% ( -5% -4%) 0.684 OrHighMed 37.82 (6.3%) 37.72 (5.5%) -0.3% ( -11% - 12%) 0.891 HighTermTitleBDVSort7.13 (8.7%)7.11 (8.1%) -0.2% ( -15% - 18%) 0.927 LowSpanNear 26.13 (3.7%) 26.08 (3.3%) -0.2% ( -6% -7%) 0.866 Prefix3 408.84 (1.3%) 408.62 (2.1%) -0.1% ( -3% -3%) 0.923 OrNotHighMed 469.82 (4.2%) 470.09 (3.6%)0.1% (