[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1463441417 Thanks @gsmiller for your review and suggestions! > What about updating FixedBitSet#or(disi) to use this? That's used when rewriting MultiTermQuery instances, and I would think we'd see a performance improvement to prefix3 and wildcard benchmark tasks. I guess the thinking there would be to "flip" an entire long to -1L at once if a run of 64 docs is included in a dense DISI sent to the or (and then subsequently advance beyond the "dense run" in the DISI). > > This idea will probably have diminishing returns after https://github.com/apache/lucene/pull/12055, since we now prioritize only building the more sparse iterators into the bitset upfront, but it could still help. If you really want to look for impact, try switching RegexpQuery and PrefixQuery to use CONSTANT_SCORE_REWRITE instead of CONSTANT_SCORE_BLENDED_REWRITE (in there constructors). That should highlight an impact in the benchmark. I think @mdmarshmallow might be working on this as per https://github.com/apache/lucene/issues/11915#issuecomment-1459502217. As part of this PR, I've also added a new API to `BitSet#nextClearBit` just like JDK's `BitSet` API, which might be useful here as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on code in PR #12194: URL: https://github.com/apache/lucene/pull/12194#discussion_r1132077572 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java: ## @@ -479,6 +481,31 @@ private void refillDocs() throws IOException { assert docBuffer[BLOCK_SIZE] == NO_MORE_DOCS; } +@Override +public int peekNextNonMatchingDocID() throws IOException { Review Comment: Thanks for the suggestion! This should indeed give us a good speed up if we don't have to linearly scan docBuffer to locate. However, as I'm trying not to load the subsequent block since callers still need to be in current block (state) for nextDoc/advance, we may not be able to optimize for multiple-block-consecutive-docs scenario? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] alessandrobenedetti commented on pull request #12169: Introduced the Word2VecSynonymFilter
alessandrobenedetti commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1463774563 I'll leave the BoostAttribute discussion for another time as I don't have access to the code right now, but javaDocs or not, it seems extremely suspicious to have a public class that shouldn't be used, it smells bad practice, quite frankly. Regarding the "should this be a token filter or something else", I don't see any particular difference with the regular Synonym token filter ( it's just a difference way of providing synonyms from a machine learned model rather than the explicit file). The boost side of things (which should be optional) is designed for the query time usage only (if I remember correctly we don't have any kind of indexing time boost anymore nor there was an index time per term boost at all). Not much difference from the weighted synonyms approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12198: Reduce contention in DocumentsWriterFlushControl.
jpountz opened a new pull request, #12198: URL: https://github.com/apache/lucene/pull/12198 lucene-util's `IndexGeoNames` benchmark is heavily contended when running with many indexing threads, 20 in my case. The main offender is `DocumentsWriterFlushControl#doAfterDocument`, which runs after every index operation to update doc and RAM accounting. This change reduces contention by only updating RAM accounting if the amount of RAM consumption that has not been committed yet by a single DWPT is at least 0.1% of the total RAM buffer size. This effectively batches updates to RAM accounting, similarly to what happens when using `IndexWriter#addDocuments` to index multiple documents at once. Since updates to RAM accounting may be batched, `FlushPolicy` can no longer distinguish between inserts, updates and deletes, so all 3 methods got merged into a single one. With this change, `IndexGeoNames` goes from ~22s to ~19s and the main offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available URL: https://github.com/apache/lucene/pull/12089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
gsmiller commented on PR #12089: URL: https://github.com/apache/lucene/pull/12089#issuecomment-1463886798 Closing this out for now. I think my last comment provides a pretty good summary of where things landed here. I did experiment a bit with other ways of estimating cost within `TermInSetQuery`'s score supplier, but couldn't come up with anything useful. The core problem remains that using field-level statistics just isn't granular enough to make the "right decision" in all cases between postings or doc values, so unless we do some term seeking, I don't see a way to improve it. Since there was significant push-back on doing this, I'll close out for now. Thanks again everyone! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #12184: [nocommit] Introduce ExpressionFacets along with a demo.
gsmiller commented on code in PR #12184: URL: https://github.com/apache/lucene/pull/12184#discussion_r1132530610 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/ExpressionFacets.java: ## @@ -0,0 +1,253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy; + +import java.io.IOException; +import java.text.ParseException; +import java.util.Arrays; +import java.util.HashMap; +import java.util.HashSet; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Objects; +import java.util.Set; +import java.util.regex.Matcher; +import java.util.regex.Pattern; +import org.apache.lucene.expressions.Bindings; +import org.apache.lucene.expressions.SimpleBindings; +import org.apache.lucene.expressions.js.JavascriptCompiler; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.index.LeafReaderContext; +import org.apache.lucene.index.SortedNumericDocValues; +import org.apache.lucene.search.ConjunctionUtils; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.DoubleValues; +import org.apache.lucene.search.DoubleValuesSource; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.util.FixedBitSet; + +/** + * Facet by a provided expression string. Variables in the expression string are expected to be of + * the form: {@code val__[sum|max]}, where {@code } must be a valid + * reference in the provided {@link Bindings}. + * + * nocommit: This is a simple demo and is incomplete and not tested. + */ +public class ExpressionFacets extends FloatTaxonomyFacets { + private static final Pattern RE = Pattern.compile("(?=(val_.+_(max|sum)))"); + private static final Map F_MAP = new HashMap<>(); + + static { +F_MAP.put("sum", AssociationAggregationFunction.SUM); +F_MAP.put("max", AssociationAggregationFunction.MAX); + } + + public ExpressionFacets( + String indexFieldName, + TaxonomyReader taxoReader, + FacetsConfig config, + FacetsCollector facetsCollector, + String expression, + Bindings bindings) + throws IOException, ParseException { +super(indexFieldName, taxoReader, AssociationAggregationFunction.SUM, config); +Set aggregations = parseAggregations(expression); +SimpleBindings aggregationBindings = new SimpleBindings(); +aggregate(expression, bindings, aggregationBindings, aggregations, facetsCollector); + } + + /** + * Identify the component aggregations needed by the expression. String parsing is not my strong + * suit, so I'm sure this is clunky at best and buggy at worst. + */ + private static Set parseAggregations(String expression) { +Set result = new HashSet<>(); +Matcher m = RE.matcher(expression); +while (m.find()) { + int start = m.start(); + int sumPos = expression.indexOf("sum", start); + if (sumPos == -1) { +sumPos = Integer.MAX_VALUE; + } + int maxPos = expression.indexOf("max", start); + if (maxPos == -1) { +maxPos = Integer.MAX_VALUE; + } + int end = Math.min(sumPos, maxPos); + if (end == Integer.MAX_VALUE) { +throw new IllegalArgumentException("Invalid syntax"); + } + end += 3; + String component = expression.substring(start, end); + String[] tokens = component.split("_"); + if (tokens.length < 3 || "val".equals(tokens[0]) == false) { +throw new IllegalArgumentException("Invalid syntax"); + } + AssociationAggregationFunction func = + F_MAP.get(tokens[tokens.length - 1].toLowerCase(Locale.ROOT)); + String ref; + if (tokens.length > 3) { +StringBuilder sb = new StringBuilder(); +for (int i = 1; i < tokens.length - 1; i++) { + sb.append(tokens[i]); + if (i < tokens.length - 2) { +sb.append("_"); + } +} +ref = sb.toString(); + } else { +ref = tokens[1]; + } + result.add(new FieldAggregation(component, ref, func)); +} + +return result; + } + + private void aggregate( + String expr
[GitHub] [lucene] gsmiller commented on pull request #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality
gsmiller commented on PR #11746: URL: https://github.com/apache/lucene/pull/11746#issuecomment-1463976493 I back ported this manually. Closing out the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader
gsmiller commented on PR #12171: URL: https://github.com/apache/lucene/pull/12171#issuecomment-1463977181 Closing this out for now since I don't think we want to add these APIs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader
gsmiller closed pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader URL: https://github.com/apache/lucene/pull/12171 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality
gsmiller closed pull request #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality URL: https://github.com/apache/lucene/pull/11746 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader
gsmiller commented on PR #12171: URL: https://github.com/apache/lucene/pull/12171#issuecomment-1463977754 Thanks again @shubhamvishu for raising the idea and looking into the effectiveness of the caches more generally! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery
gsmiller commented on PR #11739: URL: https://github.com/apache/lucene/pull/11739#issuecomment-1463987636 We finally did this in #12156! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery
gsmiller closed pull request #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery URL: https://github.com/apache/lucene/pull/11739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller closed pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost
gsmiller closed pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost URL: https://github.com/apache/lucene/pull/11741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #12199: Reduce contention in DocumentsWriterPerThreadPool.
jpountz opened a new pull request, #12199: URL: https://github.com/apache/lucene/pull/12199 Obtaining a DWPT and putting it back into the pool is subject to contention. This change reduces contention by using 8 sub pools that are tried sequentially. When applied on top of #12198, this reduces the time to index geonames with 20 threads from ~19s to ~16-17s. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11780: GH#11601: Add ability to compute reader states after refresh
gsmiller commented on PR #11780: URL: https://github.com/apache/lucene/pull/11780#issuecomment-1464091101 Sorry @stefanvodita, just now coming back to this after a break. OK, so current state-of-the-world: Right now, users have to instantiate reader state instances for each field they're using for SSDV faceting so they can provide those to SSDFacets. These reader state instances must be re-created when the index is refreshed, and users have to manage this on their own. Creating these state instances is expensive, so users need a way to keep track of them, but also refresh them when necessary. To be fair though, users have all the capabilities they need to do efficient SSDV faceting, it's just a lot to keep track of for them (in my opinion anyway). So I think the real question here is if we can do something that makes users' lives easier w.r.t. keeping track of these states and refreshing them when necessary. I think Mike's point-of-view is that maybe this isn't such a big deal and may not be necessary? If we _do_ want to explore ways to make this interaction easier for users, I think `FacetsConfig` is the only realistic "hook" in some sense. One issue is that there is actually no current place where we can know what SSDV fields in the index are being used for faceting. We can't rely on the "dim config" exposed by FacetsConfig since users are not required to register any specific dim config for a faceting field if they're happy with the defaults. I wonder if one option to explore is to have a `FacetsConfig` instance keep track of the SSDV fields being used for faceting when it builds docs? That's the one place we actually know that an SSDV field is being added for the purpose of faceting. If the `FacetsConfig` instance kept track of SSDV faceting fields in this way, we could add a helper method to `FacetsConfig` that takes an `IndexReader` as input and creates a new set of reader states. Then, we could consider a different ctor for SSDVFacets that takes a `FacetsConfig` instance instead of taking a state instance directly, and the `FacetsConfig` instance could provide the correct state instance. I think there's some complexity in here though since a single `FacetsConfig` instance could be used with different index readers while a refresh is happening, so it would probably need to keep track of a state instance for each SSDV field + IndexReader or something. I don't know. This might all just be too complicated to be worth i t, but it would be simpler for users if all they had to do was provide their `FacetsConfig` instance to SSDVFacets and then call some method on that `FacetsConfig` instance whenever they refresh their index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh
gsmiller commented on code in PR #11780: URL: https://github.com/apache/lucene/pull/11780#discussion_r1132608504 ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/SSDVReaderStatesCalculator.java: ## @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.sortedset; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import org.apache.lucene.facet.FacetsConfig; +import org.apache.lucene.search.IndexSearcher; +import org.apache.lucene.search.ReferenceManager; + +/** + * Get reader states after a refresh. Example call: readerStates = + * searcherManager.maybeRefreshAndReturnState(new SSDVReaderStatesCalculator(), config); + */ +public class SSDVReaderStatesCalculator +implements ReferenceManager.StateCalculator< +List, IndexSearcher, FacetsConfig> { + + /** No arguments needed */ + public SSDVReaderStatesCalculator() {} + + @Override + public List calculate(IndexSearcher current, FacetsConfig config) { +return config.getDimConfigs().keySet().stream() Review Comment: One issue here is that the user may have indexed SSDVFacetFields without making any configuration changes (i.e., the default dim config was fine for them). In this case, the dim configs you reference here won't know about the fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
dweiss commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132628795 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: For some reason, double locking always makes me cringe. There are lengthy discussions about this idiom all around the web (visibility of partially constructed objects). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
jpountz commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132647604 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: I don't think that this is a case of double-checked locking. In general double-checked locking tries to reduce the overhead of acquiring a lock by adding a quick check before the lock. Here the logic is different, it would be legal to remove the call to `poll` under lock, I only added it because there is a chance that a thread had to wait on the lock, so we could check if another DWPT was added to the queue in the meantime in order to save creating a new DWPT. I don't think it's actually important, I could remove the second call to `poll` to make it look less like double-checked locking. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
dweiss commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132698882 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: It's not about the poll. It's about whether dwpt = newWriter(); can be expanded and reordered by the compiler so that dwpt gets assigned a value before the constructor (or whatever initialization inside newWriter) takes place. Any thread checking dwpt == null outside synchronized could, at least in theory, see such a "partially constructed" object. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
dweiss commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132703519 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: Alex Shipilev had a nice writeup about it - found it here: https://shipilev.net/blog/2014/safe-public-construction/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
mdmarshmallow commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464188468 > > I think @mdmarshmallow might be working on this as per [#11915 (comment)](https://github.com/apache/lucene/issues/11915#issuecomment-1459502217). As part of this PR, I've also added a new API to `BitSet#nextClearBit` just like JDK's `BitSet` API, which might be useful here as well. Yeah I am working on it! Hopefully should be done by today as it's not that big of a change :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
jpountz commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132726733 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: Thanks for sharing this blog post, I remember reading it in the past, it was a good re-read. I mentioned polls because I thought that they were what made you think that this code is a case of double-checked locking, as there is a first call before the lock and another one under the lock, like the null checks with double-checked locking with singletons. I need to go on weekend, I'll try to post a convincing explanation of why this is not a case of double-checked locking and why it is safe next week. By the way, thanks for looking! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.
dweiss commented on code in PR #12199: URL: https://github.com/apache/lucene/pull/12199#discussion_r1132739002 ## lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java: ## @@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() { * operation (add/updateDocument). */ DocumentsWriterPerThread getAndLock() { -synchronized (this) { - ensureOpen(); - DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); - if (dwpt == null) { -dwpt = newWriter(); +ensureOpen(); +DocumentsWriterPerThread dwpt = freeList.poll(DocumentsWriterPerThread::tryLock); +if (dwpt == null) { Review Comment: No need to convince me, @jpountz - I just expressed my reluctance at it because, well, it requires convincing. :) Unless there's really a huge gain, I typically just go with what I understand works (making the field volatile, for example). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] frzhanguber opened a new issue, #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel
frzhanguber opened a new issue, #12200: URL: https://github.com/apache/lucene/issues/12200 ### Description when loading with a dense graph with M=64 and beamWidth=400, the following part graphOffsetsByLevel calculation overflows, since both M, Integer.BYTES and numNodesOnLevel0 are integer, it assumes the result is an integer as well, which is a long (offset) ``` graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * numNodesOnLevel0; ``` A fix could be ``` graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * (long) numNodesOnLevel0; ``` The code snippet: ``` // calculate for each level the start offsets in vectorIndex file from where to read // neighbours graphOffsetsByLevel = new long[numLevels]; for (int level = 0; level < numLevels; level++) { if (level == 0) { graphOffsetsByLevel[level] = 0; } else if (level == 1) { int numNodesOnLevel0 = size; graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * numNodesOnLevel0; } else { int numNodesOnPrevLevel = nodesByLevel[level - 1].length; graphOffsetsByLevel[level] = graphOffsetsByLevel[level - 1] + (1 + M) * Integer.BYTES * numNodesOnPrevLevel; } } } ``` ### Version and environment details Lucene 9.4.1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
gsmiller commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464329664 Woohoo! Thanks @zacharymorn / @mdmarshmallow! I suspect we may not really see any benefit though if the DISI can only expose the next non-matching doc within its current block. I think the real advantage here would come from being able to actually skip blocks in the DISI, which would rely on knowing that there are actually multiple consecutive "dense" blocks in a DISI. But... if our only way to know there are multiple consecutive "dense" blocks involves decoding those blocks anyway, maybe there isn't much gain to be had? Hmm... not sure. Excited to see what we learn though! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaikhu opened a new pull request, #12201: Github 10633 Update Javadoc comment to mention gradle instead of ant
shaikhu opened a new pull request, #12201: URL: https://github.com/apache/lucene/pull/12201 I believe this is a fix for issue [10633](https://github.com/apache/lucene/issues/10633). Although the path is slightly different, this is the only class I could find with the name `TestBackwardsCompatibility`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir closed issue #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel
rmuir closed issue #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel URL: https://github.com/apache/lucene/issues/12200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel
rmuir commented on issue #12200: URL: https://github.com/apache/lucene/issues/12200#issuecomment-1464410196 duplicate of #11905 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] erikhatcher commented on pull request #12159: Remove the Now Unused Class `pointInPolygon`.
erikhatcher commented on PR #12159: URL: https://github.com/apache/lucene/pull/12159#issuecomment-1464504770 Looks good to me. Eagle eye Marcus. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12153: Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery
gsmiller commented on PR #12153: URL: https://github.com/apache/lucene/pull/12153#issuecomment-1464578469 +1 Did a little digging, and it appears to have been carried over from some copy/paste of `TestDocValuesQueries#testToString`. Let's remove. We've already got this (exact) coverage over in `TestDocValuesQueries`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
zacharymorn commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464741062 Thanks @mdmarshmallow for working on it! Btw, I just pushed a commit (https://github.com/apache/lucene/pull/12194/commits/f78182bae7e92b23136b5975dbb9d3199e5e3065) that fixed some bugs identified by randomized tests, you may want to pull that for your work especially if you are using `BitSet#nextClearBit` method. >I suspect we may not really see any benefit though if the DISI can only expose the next non-matching doc within its current block. I think the real advantage here would come from being able to actually skip blocks in the DISI, which would rely on knowing that there are actually multiple consecutive "dense" blocks in a DISI. But... if our only way to know there are multiple consecutive "dense" blocks involves decoding those blocks anyway, maybe there isn't much gain to be had? Hmm... not sure. Excited to see what we learn though! @gsmiller Yeah I'm guessing that as well especially for posting and sparse / dense block, as it would take at least one pass to identify the next candidate. I have tried to cache the result as well and would like to see if that helps, and how it performs under benchmark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI
mdmarshmallow commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464803846 Yeah I was doing some of my own debugging and saw some of those issues. I think this fixed a decent amount of the issues I was seeing but I'm still seeing problems with some tests. I'm not sure if it's an issue with my code or not though, so I'll need to dig a bit deeper. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org