[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1463441417

   Thanks @gsmiller for your review and suggestions!
   
   > What about updating FixedBitSet#or(disi) to use this? That's used when 
rewriting MultiTermQuery instances, and I would think we'd see a performance 
improvement to prefix3 and wildcard benchmark tasks. I guess the thinking there 
would be to "flip" an entire long to -1L at once if a run of 64 docs is 
included in a dense DISI sent to the or (and then subsequently advance beyond 
the "dense run" in the DISI).
   > 
   > This idea will probably have diminishing returns after 
https://github.com/apache/lucene/pull/12055, since we now prioritize only 
building the more sparse iterators into the bitset upfront, but it could still 
help. If you really want to look for impact, try switching RegexpQuery and 
PrefixQuery to use CONSTANT_SCORE_REWRITE instead of 
CONSTANT_SCORE_BLENDED_REWRITE (in there constructors). That should highlight 
an impact in the benchmark.
   
   I think @mdmarshmallow might be working on this as per 
https://github.com/apache/lucene/issues/11915#issuecomment-1459502217.  As part 
of this PR, I've also added a new API to `BitSet#nextClearBit` just like JDK's 
`BitSet` API,  which might be useful here as well. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a diff in pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


zacharymorn commented on code in PR #12194:
URL: https://github.com/apache/lucene/pull/12194#discussion_r1132077572


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java:
##
@@ -479,6 +481,31 @@ private void refillDocs() throws IOException {
   assert docBuffer[BLOCK_SIZE] == NO_MORE_DOCS;
 }
 
+@Override
+public int peekNextNonMatchingDocID() throws IOException {

Review Comment:
   Thanks for the suggestion! This should indeed give us a good speed up if we 
don't have to linearly scan docBuffer to locate. However, as I'm trying not to 
load the subsequent block since callers still need to be in current block 
(state) for nextDoc/advance, we may not be able to optimize for 
multiple-block-consecutive-docs scenario? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-10 Thread via GitHub


alessandrobenedetti commented on PR #12169:
URL: https://github.com/apache/lucene/pull/12169#issuecomment-1463774563

   I'll leave the BoostAttribute discussion for another time as I don't have 
access to the code right now, but javaDocs or not, it seems extremely 
suspicious to have a public class that shouldn't be used, it smells bad 
practice, quite frankly.
   
   Regarding the "should this be a token filter or something else", I don't see 
any particular difference with the regular Synonym token filter ( it's just a 
difference way of providing synonyms from a machine learned model rather than 
the explicit file).
   The boost side of things (which should be optional) is designed for the 
query time usage only (if I remember correctly we don't have any kind of 
indexing time boost anymore nor there was an index time per term boost at all).
   Not much difference from the weighted synonyms approach. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #12198: Reduce contention in DocumentsWriterFlushControl.

2023-03-10 Thread via GitHub


jpountz opened a new pull request, #12198:
URL: https://github.com/apache/lucene/pull/12198

   lucene-util's `IndexGeoNames` benchmark is heavily contended when running 
with many indexing threads, 20 in my case. The main offender is 
`DocumentsWriterFlushControl#doAfterDocument`, which runs after every index 
operation to update doc and RAM accounting.
   
   This change reduces contention by only updating RAM accounting if the amount 
of RAM consumption that has not been committed yet by a single DWPT is at least 
0.1% of the total RAM buffer size. This effectively batches updates to RAM 
accounting, similarly to what happens when using `IndexWriter#addDocuments` to 
index multiple documents at once. Since updates to RAM accounting may be 
batched, `FlushPolicy` can no longer distinguish between inserts, updates and 
deletes, so all 3 methods got merged into a single one.
   
   With this change, `IndexGeoNames` goes from ~22s to ~19s and the main 
offender for contention is now `DocumentsWriterPerThreadPool#getAndLock`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-03-10 Thread via GitHub


gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" 
if doc values are available
URL: https://github.com/apache/lucene/pull/12089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

2023-03-10 Thread via GitHub


gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1463886798

   Closing this out for now. I think my last comment provides a pretty good 
summary of where things landed here. I did experiment a bit with other ways of 
estimating cost within `TermInSetQuery`'s score supplier, but couldn't come up 
with anything useful. The core problem remains that using field-level 
statistics just isn't granular enough to make the "right decision" in all cases 
between postings or doc values, so unless we do some term seeking, I don't see 
a way to improve it. Since there was significant push-back on doing this, I'll 
close out for now. Thanks again everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12184: [nocommit] Introduce ExpressionFacets along with a demo.

2023-03-10 Thread via GitHub


gsmiller commented on code in PR #12184:
URL: https://github.com/apache/lucene/pull/12184#discussion_r1132530610


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/ExpressionFacets.java:
##
@@ -0,0 +1,253 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy;
+
+import java.io.IOException;
+import java.text.ParseException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Set;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import org.apache.lucene.expressions.Bindings;
+import org.apache.lucene.expressions.SimpleBindings;
+import org.apache.lucene.expressions.js.JavascriptCompiler;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.SortedNumericDocValues;
+import org.apache.lucene.search.ConjunctionUtils;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.DoubleValues;
+import org.apache.lucene.search.DoubleValuesSource;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.util.FixedBitSet;
+
+/**
+ * Facet by a provided expression string. Variables in the expression string 
are expected to be of
+ * the form: {@code val__[sum|max]}, where {@code 
} must be a valid
+ * reference in the provided {@link Bindings}.
+ *
+ * nocommit: This is a simple demo and is incomplete and not tested.
+ */
+public class ExpressionFacets extends FloatTaxonomyFacets {
+  private static final Pattern RE = Pattern.compile("(?=(val_.+_(max|sum)))");
+  private static final Map F_MAP = new 
HashMap<>();
+
+  static {
+F_MAP.put("sum", AssociationAggregationFunction.SUM);
+F_MAP.put("max", AssociationAggregationFunction.MAX);
+  }
+
+  public ExpressionFacets(
+  String indexFieldName,
+  TaxonomyReader taxoReader,
+  FacetsConfig config,
+  FacetsCollector facetsCollector,
+  String expression,
+  Bindings bindings)
+  throws IOException, ParseException {
+super(indexFieldName, taxoReader, AssociationAggregationFunction.SUM, 
config);
+Set aggregations = parseAggregations(expression);
+SimpleBindings aggregationBindings = new SimpleBindings();
+aggregate(expression, bindings, aggregationBindings, aggregations, 
facetsCollector);
+  }
+
+  /**
+   * Identify the component aggregations needed by the expression. String 
parsing is not my strong
+   * suit, so I'm sure this is clunky at best and buggy at worst.
+   */
+  private static Set parseAggregations(String expression) {
+Set result = new HashSet<>();
+Matcher m = RE.matcher(expression);
+while (m.find()) {
+  int start = m.start();
+  int sumPos = expression.indexOf("sum", start);
+  if (sumPos == -1) {
+sumPos = Integer.MAX_VALUE;
+  }
+  int maxPos = expression.indexOf("max", start);
+  if (maxPos == -1) {
+maxPos = Integer.MAX_VALUE;
+  }
+  int end = Math.min(sumPos, maxPos);
+  if (end == Integer.MAX_VALUE) {
+throw new IllegalArgumentException("Invalid syntax");
+  }
+  end += 3;
+  String component = expression.substring(start, end);
+  String[] tokens = component.split("_");
+  if (tokens.length < 3 || "val".equals(tokens[0]) == false) {
+throw new IllegalArgumentException("Invalid syntax");
+  }
+  AssociationAggregationFunction func =
+  F_MAP.get(tokens[tokens.length - 1].toLowerCase(Locale.ROOT));
+  String ref;
+  if (tokens.length > 3) {
+StringBuilder sb = new StringBuilder();
+for (int i = 1; i < tokens.length - 1; i++) {
+  sb.append(tokens[i]);
+  if (i < tokens.length - 2) {
+sb.append("_");
+  }
+}
+ref = sb.toString();
+  } else {
+ref = tokens[1];
+  }
+  result.add(new FieldAggregation(component, ref, func));
+}
+
+return result;
+  }
+
+  private void aggregate(
+  String expr

[GitHub] [lucene] gsmiller commented on pull request #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality

2023-03-10 Thread via GitHub


gsmiller commented on PR #11746:
URL: https://github.com/apache/lucene/pull/11746#issuecomment-1463976493

   I back ported this manually. Closing out the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader

2023-03-10 Thread via GitHub


gsmiller commented on PR #12171:
URL: https://github.com/apache/lucene/pull/12171#issuecomment-1463977181

   Closing this out for now since I don't think we want to add these APIs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader

2023-03-10 Thread via GitHub


gsmiller closed pull request #12171: Add APIs to get ordinal and category cache 
hit/miss count and hit rate in DirectoryTaxonomyReader
URL: https://github.com/apache/lucene/pull/12171


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #11746: Deprecate LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality

2023-03-10 Thread via GitHub


gsmiller closed pull request #11746: Deprecate 
LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant 
functionality
URL: https://github.com/apache/lucene/pull/11746


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12171: Add APIs to get ordinal and category cache hit/miss count and hit rate in DirectoryTaxonomyReader

2023-03-10 Thread via GitHub


gsmiller commented on PR #12171:
URL: https://github.com/apache/lucene/pull/12171#issuecomment-1463977754

   Thanks again @shubhamvishu for raising the idea and looking into the 
effectiveness of the caches more generally!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery

2023-03-10 Thread via GitHub


gsmiller commented on PR #11739:
URL: https://github.com/apache/lucene/pull/11739#issuecomment-1463987636

   We finally did this in #12156!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery

2023-03-10 Thread via GitHub


gsmiller closed pull request #11739: DRAFT: TermInSetQuery refactored to extend 
MultiTermsQuery
URL: https://github.com/apache/lucene/pull/11739


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2023-03-10 Thread via GitHub


gsmiller closed pull request #11741: DRAFT: Experiment with intersecting 
TermInSetQuery terms up-front to better estimate cost
URL: https://github.com/apache/lucene/pull/11741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


jpountz opened a new pull request, #12199:
URL: https://github.com/apache/lucene/pull/12199

   Obtaining a DWPT and putting it back into the pool is subject to contention. 
This change reduces contention by using 8 sub pools that are tried 
sequentially. When applied on top of #12198, this reduces the time to index 
geonames with 20 threads from ~19s to ~16-17s.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11780: GH#11601: Add ability to compute reader states after refresh

2023-03-10 Thread via GitHub


gsmiller commented on PR #11780:
URL: https://github.com/apache/lucene/pull/11780#issuecomment-1464091101

   Sorry @stefanvodita, just now coming back to this after a break.
   
   OK, so current state-of-the-world:
   Right now, users have to instantiate reader state instances for each field 
they're using for SSDV faceting so they can provide those to SSDFacets. These 
reader state instances must be re-created when the index is refreshed, and 
users have to manage this on their own. Creating these state instances is 
expensive, so users need a way to keep track of them, but also refresh them 
when necessary. To be fair though, users have all the capabilities they need to 
do efficient SSDV faceting, it's just a lot to keep track of for them (in my 
opinion anyway).
   
   So I think the real question here is if we can do something that makes 
users' lives easier w.r.t. keeping track of these states and refreshing them 
when necessary. I think Mike's point-of-view is that maybe this isn't such a 
big deal and may not be necessary?
   
   If we _do_ want to explore ways to make this interaction easier for users, I 
think `FacetsConfig` is the only realistic "hook" in some sense. One issue is 
that there is actually no current place where we can know what SSDV fields in 
the index are being used for faceting. We can't rely on the "dim config" 
exposed by FacetsConfig since users are not required to register any specific 
dim config for a faceting field if they're happy with the defaults.
   
   I wonder if one option to explore is to have a `FacetsConfig` instance keep 
track of the SSDV fields being used for faceting when it builds docs? That's 
the one place we actually know that an SSDV field is being added for the 
purpose of faceting. If the `FacetsConfig` instance kept track of SSDV faceting 
fields in this way, we could add a helper method to `FacetsConfig` that takes 
an `IndexReader` as input and creates a new set of reader states. Then, we 
could consider a different ctor for SSDVFacets that takes a `FacetsConfig` 
instance instead of taking a state instance directly, and the `FacetsConfig` 
instance could provide the correct state instance. I think there's some 
complexity in here though since a single `FacetsConfig` instance could be used 
with different index readers while a refresh is happening, so it would probably 
need to keep track of a state instance for each SSDV field + IndexReader or 
something. I don't know. This might all just be too complicated to be worth i
 t, but it would be simpler for users if all they had to do was provide their 
`FacetsConfig` instance to SSDVFacets and then call some method on that 
`FacetsConfig` instance whenever they refresh their index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #11780: GH#11601: Add ability to compute reader states after refresh

2023-03-10 Thread via GitHub


gsmiller commented on code in PR #11780:
URL: https://github.com/apache/lucene/pull/11780#discussion_r1132608504


##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/SSDVReaderStatesCalculator.java:
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.sortedset;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.stream.Collectors;
+import org.apache.lucene.facet.FacetsConfig;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.ReferenceManager;
+
+/**
+ * Get reader states after a refresh. Example call: readerStates =
+ * searcherManager.maybeRefreshAndReturnState(new 
SSDVReaderStatesCalculator(), config);
+ */
+public class SSDVReaderStatesCalculator
+implements ReferenceManager.StateCalculator<
+List, IndexSearcher, FacetsConfig> {
+
+  /** No arguments needed */
+  public SSDVReaderStatesCalculator() {}
+
+  @Override
+  public List calculate(IndexSearcher current, 
FacetsConfig config) {
+return config.getDimConfigs().keySet().stream()

Review Comment:
   One issue here is that the user may have indexed SSDVFacetFields without 
making any configuration changes (i.e., the default dim config was fine for 
them). In this case, the dim configs you reference here won't know about the 
fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


dweiss commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132628795


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   For some reason, double locking always makes me cringe. There are lengthy 
discussions about this idiom all around the web (visibility of partially 
constructed objects).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


jpountz commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132647604


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   I don't think that this is a case of double-checked locking. In general 
double-checked locking tries to reduce the overhead of acquiring a lock by 
adding a quick check before the lock. Here the logic is different, it would be 
legal to remove the call to `poll` under lock, I only added it because there is 
a chance that a thread had to wait on the lock, so we could check if another 
DWPT was added to the queue in the meantime in order to save creating a new 
DWPT. I don't think it's actually important, I could remove the second call to 
`poll` to make it look less like double-checked locking.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


dweiss commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132698882


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   It's not about the poll. It's about whether dwpt = newWriter(); can be 
expanded and reordered by the compiler so that dwpt gets assigned a value 
before the constructor (or whatever initialization inside newWriter) takes 
place. Any thread checking dwpt == null outside synchronized could, at least in 
theory, see such a "partially constructed" object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


dweiss commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132703519


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   Alex Shipilev had a nice writeup about it - found it here: 
https://shipilev.net/blog/2014/safe-public-construction/



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


mdmarshmallow commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464188468

   
   > 
   > I think @mdmarshmallow might be working on this as per [#11915 
(comment)](https://github.com/apache/lucene/issues/11915#issuecomment-1459502217).
 As part of this PR, I've also added a new API to `BitSet#nextClearBit` just 
like JDK's `BitSet` API, which might be useful here as well.
   
   Yeah I am working on it! Hopefully should be done by today as it's not that 
big of a change :)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


jpountz commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132726733


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   Thanks for sharing this blog post, I remember reading it in the past, it was 
a good re-read. I mentioned polls because I thought that they were what made 
you think that this code is a case of double-checked locking, as there is a 
first call before the lock and another one under the lock, like the null checks 
with double-checked locking with singletons. I need to go on weekend, I'll try 
to post a convincing explanation of why this is not a case of double-checked 
locking and why it is safe next week. By the way, thanks for looking!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #12199: Reduce contention in DocumentsWriterPerThreadPool.

2023-03-10 Thread via GitHub


dweiss commented on code in PR #12199:
URL: https://github.com/apache/lucene/pull/12199#discussion_r1132739002


##
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThreadPool.java:
##
@@ -113,15 +113,19 @@ private synchronized DocumentsWriterPerThread newWriter() 
{
* operation (add/updateDocument).
*/
   DocumentsWriterPerThread getAndLock() {
-synchronized (this) {
-  ensureOpen();
-  DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
-  if (dwpt == null) {
-dwpt = newWriter();
+ensureOpen();
+DocumentsWriterPerThread dwpt = 
freeList.poll(DocumentsWriterPerThread::tryLock);
+if (dwpt == null) {

Review Comment:
   No need to convince me, @jpountz - I just expressed my reluctance at it 
because, well, it requires convincing. :) Unless there's really a huge gain, I 
typically just go with what I understand works (making the field volatile, for 
example).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] frzhanguber opened a new issue, #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel

2023-03-10 Thread via GitHub


frzhanguber opened a new issue, #12200:
URL: https://github.com/apache/lucene/issues/12200

   ### Description
   
   when loading with a dense graph with M=64 and beamWidth=400, the following 
part graphOffsetsByLevel calculation overflows, since both M, Integer.BYTES and 
numNodesOnLevel0 are integer, it assumes the result is an integer as well, 
which is a long (offset)
   
   ```
   graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * 
numNodesOnLevel0;
   ```
   
   A fix could be 
   
   ```
   graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * (long) 
numNodesOnLevel0;
   ```
   
   
   The code snippet:
   ```
 // calculate for each level the start offsets in vectorIndex file from 
where to read
 // neighbours
 graphOffsetsByLevel = new long[numLevels];
 for (int level = 0; level < numLevels; level++) {
   if (level == 0) {
 graphOffsetsByLevel[level] = 0;
   } else if (level == 1) {
 int numNodesOnLevel0 = size;
 graphOffsetsByLevel[level] = (1 + (M * 2)) * Integer.BYTES * 
numNodesOnLevel0;
   } else {
 int numNodesOnPrevLevel = nodesByLevel[level - 1].length;
 graphOffsetsByLevel[level] =
 graphOffsetsByLevel[level - 1] + (1 + M) * Integer.BYTES * 
numNodesOnPrevLevel;
   }
 }
   }
   ```
   
   
   ### Version and environment details
   
   Lucene 9.4.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


gsmiller commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464329664

   Woohoo! Thanks @zacharymorn / @mdmarshmallow! I suspect we may not really 
see any benefit though if the DISI can only expose the next non-matching doc 
within its current block. I think the real advantage here would come from being 
able to actually skip blocks in the DISI, which would rely on knowing that 
there are actually multiple consecutive "dense" blocks in a DISI. But... if our 
only way to know there are multiple consecutive "dense" blocks involves 
decoding those blocks anyway, maybe there isn't much gain to be had? Hmm... not 
sure. Excited to see what we learn though!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shaikhu opened a new pull request, #12201: Github 10633 Update Javadoc comment to mention gradle instead of ant

2023-03-10 Thread via GitHub


shaikhu opened a new pull request, #12201:
URL: https://github.com/apache/lucene/pull/12201

   I believe this is a fix for issue 
[10633](https://github.com/apache/lucene/issues/10633).
   
   Although the path is slightly different, this is the only class I could find 
with the name `TestBackwardsCompatibility`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel

2023-03-10 Thread via GitHub


rmuir closed issue #12200: Lucene94HnswVectorsReader integer overflow when 
calculating graphOffsetsByLevel
URL: https://github.com/apache/lucene/issues/12200


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #12200: Lucene94HnswVectorsReader integer overflow when calculating graphOffsetsByLevel

2023-03-10 Thread via GitHub


rmuir commented on issue #12200:
URL: https://github.com/apache/lucene/issues/12200#issuecomment-1464410196

   duplicate of #11905


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] erikhatcher commented on pull request #12159: Remove the Now Unused Class `pointInPolygon`.

2023-03-10 Thread via GitHub


erikhatcher commented on PR #12159:
URL: https://github.com/apache/lucene/pull/12159#issuecomment-1464504770

   Looks good to me.   Eagle eye Marcus.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #12153: Unrelated code in TestIndexSortSortedNumericDocValuesRangeQuery

2023-03-10 Thread via GitHub


gsmiller commented on PR #12153:
URL: https://github.com/apache/lucene/pull/12153#issuecomment-1464578469

   +1
   
   Did a little digging, and it appears to have been carried over from some 
copy/paste of `TestDocValuesQueries#testToString`.  Let's remove. We've already 
got this (exact) coverage over in `TestDocValuesQueries`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464741062

   Thanks @mdmarshmallow for working on it! Btw, I just pushed a commit 
(https://github.com/apache/lucene/pull/12194/commits/f78182bae7e92b23136b5975dbb9d3199e5e3065)
 that fixed some bugs identified by randomized tests,  you may want to pull 
that for your work especially if you are using `BitSet#nextClearBit` method. 
   
   >I suspect we may not really see any benefit though if the DISI can only 
expose the next non-matching doc within its current block. I think the real 
advantage here would come from being able to actually skip blocks in the DISI, 
which would rely on knowing that there are actually multiple consecutive 
"dense" blocks in a DISI. But... if our only way to know there are multiple 
consecutive "dense" blocks involves decoding those blocks anyway, maybe there 
isn't much gain to be had? Hmm... not sure. Excited to see what we learn though!
   
   @gsmiller Yeah I'm guessing that as well especially for posting and sparse / 
dense block, as it would take at least one pass to identify the next candidate. 
I have tried to cache the result as well and would like to see if that helps, 
and how it performs under benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #12194: [GITHUB-11915] [Discussion Only] Make Lucene smarter about long runs of matches via new API on DISI

2023-03-10 Thread via GitHub


mdmarshmallow commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1464803846

   Yeah I was doing some of my own debugging and saw some of those issues. I 
think this fixed a decent amount of the issues I was seeing but I'm still 
seeing problems with some tests. I'm not sure if it's an issue with my code or 
not though, so I'll need to dig a bit deeper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org