[GitHub] [lucene] jpountz merged pull request #11726: Prevent term vectors from exceeding the maximum dictionary size.

2022-09-08 Thread GitBox


jpountz merged PR #11726:
URL: https://github.com/apache/lucene/pull/11726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-09-08 Thread GitBox


jpountz commented on issue #11702:
URL: https://github.com/apache/lucene/issues/11702#issuecomment-1240622777

   The historical objection against multi-value binary support is that it could 
be easily implemented on top of binary doc values. So multi-value binary 
support would add API surface and push more complexity on codecs and 
`IndexingChain` while mostly providing syntactic sugar for users.
   
   We've been following this approach of encoding multi-valued fields in a 
single `BinaryDocValuesField` in Elasticsearch for some less common field types 
like range and geo-shape fields, and it's not always easy in practice: you need 
to collect all values before being able to encode all values into a 
`BinaryDocValuesField` that can be added to a `Document`, which is a bit more 
involved than adding new `Field` instances to a `Document` as we see them.
   
   I wonder if there are intermediate options worth exploring, like adding 
tooling to make it easier to encode multiple values into a single 
`BinaryDocValuesField` on the write side, and creating wrappers around 
`BinaryDocValues` instances on the read side that expose multiple values. 
Similarly to how `FeatureField` didn't add more surface to codec APIs and 
encodes floats as term frequencies of postings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-09-08 Thread GitBox


rmuir commented on issue #11702:
URL: https://github.com/apache/lucene/issues/11702#issuecomment-1240646341

   The use-case here is also not great, talking about a doc having multiple 
locations. Its a pet peeve of mine, I don't think we should add a new major 
docvalues type for such crap :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11756: LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively

2022-09-08 Thread GitBox


msokolov commented on code in PR #11756:
URL: https://github.com/apache/lucene/pull/11756#discussion_r965934198


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -175,9 +176,42 @@ private TopDocs approximateSearch(LeafReaderContext 
context, Bits acceptDocs, in
   }
 
   // We allow this to be overridden so that tests can check what search 
strategy is used
-  protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator 
acceptDocs)
+  protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator 
acceptIterator)
   throws IOException {
-return context.reader().searchNearestVectorsExhaustively(field, target, k, 
acceptDocs);
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+if (fi == null || fi.getVectorDimension() == 0) {
+  // The field does not exist or does not index vectors
+  return NO_RESULTS;
+}
+
+VectorScorer vectorScorer = VectorScorer.create(context, fi, target);
+HitQueue queue = new HitQueue(k, true);
+ScoreDoc topDoc = queue.top();
+int doc;
+while ((doc = acceptIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+  boolean advanced = vectorScorer.advanceExact(doc);
+  assert advanced;
+
+  float score = vectorScorer.score();
+  if (score >= topDoc.score) {

Review Comment:
   Yes I think this was done with the docid-tie-breaking in mind, but that is 
not guaranteed, so we might as well relax this too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #11756: LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively

2022-09-08 Thread GitBox


msokolov commented on code in PR #11756:
URL: https://github.com/apache/lucene/pull/11756#discussion_r965936802


##
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.VectorUtil;
+
+/**
+ * Computes the similarity score between a given query vector and different 
document vectors. This
+ * is primarily used by {@link org.apache.lucene.search.KnnVectorQuery} to run 
an exact, exhaustive
+ * search over the vectors.
+ */
+abstract class VectorScorer {
+  protected final VectorValues values;
+
+  /**
+   * Create a new vector scorer instance.
+   *
+   * @param context the reader context
+   * @param fi the FieldInfo for the field containing document vectors
+   * @param query the query vector to compute the similarity for
+   */
+  static VectorScorer create(LeafReaderContext context, FieldInfo fi, float[] 
query)
+  throws IOException {
+VectorValues values = context.reader().getVectorValues(fi.name);
+VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
+return switch (fi.getVectorEncoding()) {
+  case BYTE -> new ByteVectorScorer(values, query, similarity);
+  case FLOAT32 -> new FloatVectorScorer(values, query, similarity);
+};
+  }
+
+  VectorScorer(VectorValues values) {
+this.values = values;
+  }
+
+  /**
+   * Advance the instance to the given document ID and return true if there is 
a value for that
+   * document.
+   */
+  public boolean advanceExact(int doc) throws IOException {
+int vectorDoc = values.docID();
+if (vectorDoc < doc) {
+  vectorDoc = values.advance(doc);
+}
+return vectorDoc == doc;
+  }
+
+  /** Compute the similarity score for the current document. */
+  abstract float score() throws IOException;
+
+  private static class ByteVectorScorer extends VectorScorer {
+private final BytesRef query;
+private final VectorSimilarityFunction similarity;

Review Comment:
   nit: we could move the similarity to the base class 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-08 Thread GitBox


jpountz commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1240710691

   I guess that the main downside of this approach is that the terms lookups 
are the bottleneck of a `TermInSetQuery` when the included terms have low 
docFreqs. So moving the cost to `scorerSupplier` would kill the benefits of 
using an `IndexOrDocValuesQuery`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


jpountz commented on PR #11738:
URL: https://github.com/apache/lucene/pull/11738#issuecomment-1240714080

   It might not be a big win in practice, but it should be enough to compare 
the `docFreq` with the `docCount` (rather than `maxDoc`) and use this postings 
whose `docFreq` is equal to `docCount` as an iterator of matches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11731: 11730 removed unused fst load mode

2022-09-08 Thread GitBox


jpountz commented on PR #11731:
URL: https://github.com/apache/lucene/pull/11731#issuecomment-1240744318

   Changing the default for the completion postings format might be 
controversial. Maybe we should close this PR since we're still using the 
ON_HEAP mode and have a separate discussion about whether completion postings 
formats could move to OFF_HEAP?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966070400


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -165,9 +143,46 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
 
 PostingsEnum docs = null;
 
-final List collectedTerms = new ArrayList<>();
-if (collectTerms(context, termsEnum, collectedTerms)) {
-  // build a boolean query
+// We will first try to collect up to 'threshold' terms into 
'matchingTerms'
+// if there are too many terms, we will fall back to building the 
'builder'

Review Comment:
   Thanks for the suggestion @rmuir. Let me see if I can use the existing code 
structure a bit more in this change. The reason I didn't want to just call 
`collectTerms` as-is is that we could unnecessarily seek and load term states 
when we've already found a term covering all docs. For example, if the first 
term we visit covers all terms, we can just stop there.
   
   I'm also not sure I'm following your point about reallocating 
`collectedTerms` as part of this change? That's certainly not my intention with 
this code, but maybe I'm staring at a bug and not realizing it? As soon as we 
hit the size threshold, we should be nulling out `collectTerms`, initializing a 
building and just using that for the remaining term iteration. Apologies if I'm 
overlooking something though. Entirely possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


rmuir commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966085670


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -165,9 +143,46 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
 
 PostingsEnum docs = null;
 
-final List collectedTerms = new ArrayList<>();
-if (collectTerms(context, termsEnum, collectedTerms)) {
-  // build a boolean query
+// We will first try to collect up to 'threshold' terms into 
'matchingTerms'
+// if there are too many terms, we will fall back to building the 
'builder'

Review Comment:
   > Thanks for the suggestion @rmuir. Let me see if I can use the existing 
code structure a bit more in this change. The reason I didn't want to just call 
collectTerms as-is is that we could unnecessarily seek and load term states 
when we've already found a term covering all docs. For example, if the first 
term we visit covers all terms, we can just stop there.
   
   I don't think its worth completely restructuring the code for that case? Its 
only 16 terms at most.
   
   > I'm also not sure I'm following your point about reallocating 
`collectedTerms` as part of this change? That's certainly not my intention with 
this code, but maybe I'm staring at a bug and not realizing it? As soon as we 
hit the size threshold, we should be nulling out `collectTerms`, initializing a 
building and just using that for the remaining term iteration. Apologies if I'm 
overlooking something though. Entirely possible.
   
   Yeah, I think i could have read it wrong, my bad. I did try to stare for a 
while but I think i got confused by the decision tree.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on PR #11738:
URL: https://github.com/apache/lucene/pull/11738#issuecomment-1240856191

   @jpountz:
   
   > It might not be a big win in practice, but it should be enough to compare 
the docFreq with the docCount (rather than maxDoc) and use this postings whose 
docFreq is equal to docCount as an iterator of matches.
   
   I like that idea. I wonder if checking for both conditions makes sense? If a 
term contains all docs in the segment, it should be more efficient to use 
`DocIdSet#all` right? (rather than iterating the actual postings). But, if a 
term doesn't contain all docs in the segment but _does_ contain all docs in the 
field (i.e., the field isn't completely dense), we could add an additional 
optimization here to use that single term's postings. Is that what you had in 
mind?
   
   Here's what I'm thinking:
   ```
 int docFreq = termsEnum.docFreq();
 if (reader.maxDoc() == docFreq) {
   return new WeightOrDocIdSet(DocIdSet.all(docFreq));
 } else if (terms.getDocCount() == docFreq) {
   TermStates termStates = new 
TermStates(searcher.getTopReaderContext());
   termStates.register(termsEnum.termState(), context.ord, docFreq, 
termsEnum.totalTermFreq());
   Query q = new ConstantScoreQuery(new TermQuery(new 
Term(query.field, term), termStates));
   Weight weight = searcher.rewrite(q).createWeight(searcher, 
scoreMode, score());
   return new WeightOrDocIdSet(weight);
 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


rmuir commented on PR #11738:
URL: https://github.com/apache/lucene/pull/11738#issuecomment-1240897004

   @gsmiller I think the question is, is it worth adding all those extra 
conditionals? I don't think the `DocIdSet#all` will really be that much faster 
in practice (I'm not even sure how often this optimization will hit for users 
anyway).
   
   Usually if a term is so dense that it is matching all the documents, then we 
aren't really reading actual postings anyway: we basically just read a single 
byte header that means "all ones" for the 128-doc FOR block and don't do any 
actual read/decoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-08 Thread GitBox


jpountz commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1240949394

   The fact that it should be legal for `ConjunctionDISI` to return 
`NO_MORE_DOCS` when the lead iterator advances to `NO_MORE_DOCS` without 
advancing other iterators makes me wonder if this change is hiding another bug. 
Do you understand why it's important for clauses under the `ConjuctionDISI` to 
advance to `NO_MORE_DOCS`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-08 Thread GitBox


jpountz commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966192242


##
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
   advanceLead:
   for (; ; doc = lead.nextDoc()) {
 if (doc >= minLength) {
+  if (doc != NO_MORE_DOCS) {
+lead.advance(NO_MORE_DOCS);
+  }
+  for (BitSetIterator iterator : bitSetIterators) {
+iterator.setDocId(NO_MORE_DOCS);
+  }

Review Comment:
   The `if` statement makes sense to me, but I'm curious how you managed to hit 
this case. This suggests that we create BitSets whose size is not `maxDoc`, do 
you know where this happens?
   
   The `for` loop should be unnecessary, there is no guarantee that all sub 
iterators advance to `NO_MORE_DOCS`. If this causes problems, then it means we 
have another bug somewhere else?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-08 Thread GitBox


gsmiller commented on PR #11741:
URL: https://github.com/apache/lucene/pull/11741#issuecomment-1241017654

   @jpountz thanks for the feedback! If we assume a scenario where we have a 
`TermInSetQuery` over very selective terms (low docFreqs for each), we'd want 
to use the index query unless there's another clause that can lead that query 
that's significantly more restrictive. So there's no benefit of using 
`IndexOrDocValuesQuery` in that scenario, but moving the term lookup to the 
`scorerSupplier` also shouldn't hurt this case since we have to do it anyway.
   
   On the other hand, with today's implementation, we may not know that the 
`TermInSetQuery` is very selective (e.g., maybe there are terms in the field, 
but not in the query, that match a large number of documents). In this case, it 
may be beneficial to use the index-based query, but—because of our naive cost 
heuristic—we could end up using a doc-values query because we're significantly 
over-estimating the cost of the index query.
   
   I think the case where this approach would really hurt us is when the 
`TermInSetQuery` is not particularly restrictive—to the point that we end up 
using the doc-values query—but we have to pay this up-front cost to look up 
terms just to do decide we don't want to use the index-based query after all. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #11757: Fix TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull to handle IllegalStateException from startCommit()

2022-09-08 Thread GitBox


rmuir commented on PR #11757:
URL: https://github.com/apache/lucene/pull/11757#issuecomment-1241023176

   Thank you for reviewing @vigyasharma. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir closed issue #11755: TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull failure

2022-09-08 Thread GitBox


rmuir closed issue #11755: TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull 
failure
URL: https://github.com/apache/lucene/issues/11755


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #11757: Fix TestIndexWriterOnDiskFull.testAddDocumentOnDiskFull to handle IllegalStateException from startCommit()

2022-09-08 Thread GitBox


rmuir merged PR #11757:
URL: https://github.com/apache/lucene/pull/11757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] navneet1v commented on pull request #11753: Added interface to relate a LatLonShape with another shape represented as Component2D

2022-09-08 Thread GitBox


navneet1v commented on PR #11753:
URL: https://github.com/apache/lucene/pull/11753#issuecomment-1241043924

   The function is publically exposed but the class LatLonShape cannot be 
created publically. Hence we need a way to create it.
   
   Please let me know what could be right way to do it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-08 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966264433


##
lucene/core/src/test/org/apache/lucene/util/TestRamUsageEstimator.java:
##
@@ -222,6 +229,28 @@ public void testPrintValues() {
 System.out.println("LONG_SIZE = " + LONG_SIZE);
   }
 
+  public void testHnswGraph() throws IOException {
+int size = atLeast(2000);
+int dim = randomIntBetween(100, 1024);
+int M = randomIntBetween(4, 96);
+VectorSimilarityFunction similarityFunction =
+VectorSimilarityFunction.values()[
+random().nextInt(VectorSimilarityFunction.values().length - 1) + 
1];
+VectorEncoding vectorEncoding =
+
VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 
1];
+TestHnswGraph.RandomVectorValues vectors =
+new TestHnswGraph.RandomVectorValues(size, dim, vectorEncoding, 
random());
+
+HnswGraphBuilder builder =
+HnswGraphBuilder.create(
+vectors, vectorEncoding, similarityFunction, M, M * 2, 
random().nextLong());
+OnHeapHnswGraph hnsw = builder.build(vectors.copy());
+long estimated = RamUsageEstimator.sizeOfObject(hnsw);
+long actual = ramUsed(hnsw);
+
+assertEquals((double) actual, (double) estimated, (double) actual * 0.3);
+  }
+

Review Comment:
   Addressed in 45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-08 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966265177


##
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##
@@ -74,12 +74,8 @@ public void setup() {
 similarityFunction =
 VectorSimilarityFunction.values()[
 random().nextInt(VectorSimilarityFunction.values().length - 1) + 
1];
-if (similarityFunction == VectorSimilarityFunction.DOT_PRODUCT) {
-  vectorEncoding =
-  
VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 
1];
-} else {
-  vectorEncoding = VectorEncoding.FLOAT32;
-}
+vectorEncoding =
+
VectorEncoding.values()[random().nextInt(VectorEncoding.values().length - 1) + 
1];

Review Comment:
   Great feedback! Thanks Adrien & Dawid!  Addressed in 
45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-08 Thread GitBox


mayya-sharipova commented on code in PR #11743:
URL: https://github.com/apache/lucene/pull/11743#discussion_r966265508


##
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java:
##
@@ -46,7 +46,7 @@ public NeighborArray(int maxSize, boolean descOrder) {
* nodes.
*/
   public void add(int newNode, float newScore) {
-if (size == node.length - 1) {
+if (size == node.length) {
   node = ArrayUtil.grow(node, (size + 1) * 3 / 2);

Review Comment:
   Great comment! Addressed in 45ced1e5beac8eb9932933967f3a2ca00d0c7835



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on PR #11738:
URL: https://github.com/apache/lucene/pull/11738#issuecomment-1241061002

   @rmuir that's a fair point. I'll put up another iteration shortly that tries 
to address this feedback. Hopefully it will converge on something that makes 
sense to everyone :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-08 Thread GitBox


jmazanec15 commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966298807


##
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
   advanceLead:
   for (; ; doc = lead.nextDoc()) {
 if (doc >= minLength) {
+  if (doc != NO_MORE_DOCS) {
+lead.advance(NO_MORE_DOCS);
+  }
+  for (BitSetIterator iterator : bitSetIterators) {
+iterator.setDocId(NO_MORE_DOCS);
+  }

Review Comment:
   > The if statement makes sense to me, but I'm curious how you managed to hit 
this case. This suggests that we create BitSets whose size is not maxDoc, do 
you know where this happens?
   
   I think I might be misunderstanding the question. Each bitsetiterator could 
have a different length of bitset, potentially as an optimization 
([minLength](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L256)
 I think suggests this is expected). If a bitsetiterator's top match is 10 and 
there are 1M docs in the index, I think there was no reason to store 1M bits - 
the bitsetiterator can just exhaust after 10.
   
   
   > The for loop should be unnecessary, there is no guarantee that all sub 
iterators advance to NO_MORE_DOCS. If this causes problems, then it means we 
have another bug somewhere else?
   
   Agree this is probably unnecessary. I added it to ensure that [this 
statement](https://github.com/apache/lucene/blob/branch_9_4/lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java#L31-L32)
 holds: "Requires that all of its sub-iterators must be on the same document 
all the time."



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-08 Thread GitBox


jmazanec15 commented on PR #1068:
URL: https://github.com/apache/lucene/pull/1068#issuecomment-1241094561

   > The fact that it should be legal for ConjunctionDISI to return 
NO_MORE_DOCS when the lead iterator advances to NO_MORE_DOCS without advancing 
other iterators makes me wonder if this change is hiding another bug. 
   
   Is it valid for an iterator to advance outside of the ConjunctionDISI? I 
cant find anywhere that prevents this, but I was under the assumption it should 
be invalid to do this.
   
   > Do you understand why it's important for clauses under the ConjuctionDISI 
to advance to NO_MORE_DOCS?
   
   In general I was trying to keep all sub-iterators on the same doc all the 
time. In terms of importance, I guess it depends if the iterators can be used 
outside the ConjunctionDISI. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #11756: LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively

2022-09-08 Thread GitBox


jtibshirani commented on code in PR #11756:
URL: https://github.com/apache/lucene/pull/11756#discussion_r966310331


##
lucene/core/src/java/org/apache/lucene/search/VectorScorer.java:
##
@@ -0,0 +1,104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.index.VectorValues;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.VectorUtil;
+
+/**
+ * Computes the similarity score between a given query vector and different 
document vectors. This
+ * is primarily used by {@link org.apache.lucene.search.KnnVectorQuery} to run 
an exact, exhaustive
+ * search over the vectors.
+ */
+abstract class VectorScorer {
+  protected final VectorValues values;
+
+  /**
+   * Create a new vector scorer instance.
+   *
+   * @param context the reader context
+   * @param fi the FieldInfo for the field containing document vectors
+   * @param query the query vector to compute the similarity for
+   */
+  static VectorScorer create(LeafReaderContext context, FieldInfo fi, float[] 
query)
+  throws IOException {
+VectorValues values = context.reader().getVectorValues(fi.name);
+VectorSimilarityFunction similarity = fi.getVectorSimilarityFunction();
+return switch (fi.getVectorEncoding()) {
+  case BYTE -> new ByteVectorScorer(values, query, similarity);
+  case FLOAT32 -> new FloatVectorScorer(values, query, similarity);
+};
+  }
+
+  VectorScorer(VectorValues values) {
+this.values = values;
+  }
+
+  /**
+   * Advance the instance to the given document ID and return true if there is 
a value for that
+   * document.
+   */
+  public boolean advanceExact(int doc) throws IOException {
+int vectorDoc = values.docID();
+if (vectorDoc < doc) {
+  vectorDoc = values.advance(doc);
+}
+return vectorDoc == doc;
+  }
+
+  /** Compute the similarity score for the current document. */
+  abstract float score() throws IOException;
+
+  private static class ByteVectorScorer extends VectorScorer {
+private final BytesRef query;
+private final VectorSimilarityFunction similarity;

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #11756: LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively

2022-09-08 Thread GitBox


jtibshirani commented on PR #11756:
URL: https://github.com/apache/lucene/pull/11756#issuecomment-1241095296

   Thanks for the reviews, I'll merge and backport to 9.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


rmuir commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966326978


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -179,8 +189,29 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   return new WeightOrDocIdSet(weight);
 }
 
-// Too many terms: go back to the terms we already collected and start 
building the bit set
+// Too many terms: we'll evaluate the term disjunction and populate a 
bitset. We start with
+// the terms we haven't seen yet in case one of them matches all docs 
and lets us optimize
+// (likely rare in practice):

Review Comment:
   i'd prefer we didn't do this "backwards" because it means we no longer 
traverse everything sequentially like we previously did?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #11756: LUCENE-10577: Remove LeafReader#searchNearestVectorsExhaustively

2022-09-08 Thread GitBox


jtibshirani merged PR #11756:
URL: https://github.com/apache/lucene/pull/11756


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966335478


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -179,8 +189,29 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   return new WeightOrDocIdSet(weight);
 }
 
-// Too many terms: go back to the terms we already collected and start 
building the bit set
+// Too many terms: we'll evaluate the term disjunction and populate a 
bitset. We start with
+// the terms we haven't seen yet in case one of them matches all docs 
and lets us optimize
+// (likely rare in practice):

Review Comment:
   The difference in the ordering change is just that we'll start building our 
bitset from the 17th term onwards and then come back and "fill in" the first 
16. We still iterate the query-provided terms a single time in the order 
they're provided, it's just a question of when we go back to "fill in" those 
first 16 (do we "pause" after hitting the 17th term to fill in the first 16 the 
pick back up, or do we continue on and come back).
   
   Do you suspect a performance shift or some other impact if we tweak this? I 
don't have a strong opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966336768


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -165,9 +143,46 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
 
 PostingsEnum docs = null;
 
-final List collectedTerms = new ArrayList<>();
-if (collectTerms(context, termsEnum, collectedTerms)) {
-  // build a boolean query
+// We will first try to collect up to 'threshold' terms into 
'matchingTerms'
+// if there are too many terms, we will fall back to building the 
'builder'

Review Comment:
   > I don't think its worth completely restructuring the code for that case? 
Its only 16 terms at most.
   
   Yeah, that's a fair point. I think we can get the best of both potentially 
anyway (see latest revision). Happy to stay more faithful to the current code 
structure.
   
   > Yeah, I think i could have read it wrong, my bad. I did try to stare for a 
while but I think i got confused by the decision tree.
   
   It's nuanced and slightly tricky code. I've stared at it a lot between this 
and `TermInSetQuery` so it's kind of ingrained in my brain right now, but it is 
a bit tricky to read for sure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1035: LUCENE-10652: Add a top-n range faceting example to RangeFacetsExample

2022-09-08 Thread GitBox


gsmiller merged PR #1035:
URL: https://github.com/apache/lucene/pull/1035


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1035: LUCENE-10652: Add a top-n range faceting example to RangeFacetsExample

2022-09-08 Thread GitBox


gsmiller commented on PR #1035:
URL: https://github.com/apache/lucene/pull/1035#issuecomment-1241128652

   LGTM. Thanks @Yuti-G! Merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


rmuir commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966338760


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -179,8 +189,29 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   return new WeightOrDocIdSet(weight);
 }
 
-// Too many terms: go back to the terms we already collected and start 
building the bit set
+// Too many terms: we'll evaluate the term disjunction and populate a 
bitset. We start with
+// the terms we haven't seen yet in case one of them matches all docs 
and lets us optimize
+// (likely rare in practice):

Review Comment:
   right, whereas before we'd always iterate all the terms/postings in 
sequential order. i feel here that the optimization is being too-invasive on 
the typical case? e.g. on a big index with a MTQ that matches many terms, we 
may suffer page faults and stuff going back to those first 16 terms, when it 
can be avoided.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #1035: LUCENE-10652: Add a top-n range faceting example to RangeFacetsExample

2022-09-08 Thread GitBox


Yuti-G commented on PR #1035:
URL: https://github.com/apache/lucene/pull/1035#issuecomment-1241182558

   Thank you so much @gsmiller!
   
   On Thu, Sep 8, 2022 at 12:20 PM Greg Miller ***@***.***>
   wrote:
   
   > LGTM. Thanks @Yuti-G ! Merged.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-08 Thread GitBox


gsmiller commented on code in PR #11738:
URL: https://github.com/apache/lucene/pull/11738#discussion_r966403309


##
lucene/core/src/java/org/apache/lucene/search/MultiTermQueryConstantScoreWrapper.java:
##
@@ -179,8 +189,29 @@ private WeightOrDocIdSet rewrite(LeafReaderContext 
context) throws IOException {
   return new WeightOrDocIdSet(weight);
 }
 
-// Too many terms: go back to the terms we already collected and start 
building the bit set
+// Too many terms: we'll evaluate the term disjunction and populate a 
bitset. We start with
+// the terms we haven't seen yet in case one of them matches all docs 
and lets us optimize
+// (likely rare in practice):

Review Comment:
   Got it. Yeah that makes sense to me. I'll flip this back around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova merged pull request #11743: LUCENE-10592 Better estimate memory for HNSW graph

2022-09-08 Thread GitBox


mayya-sharipova merged PR #11743:
URL: https://github.com/apache/lucene/pull/11743


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new issue, #11758: Follow-up refactors to 8-bit quantization change

2022-09-08 Thread GitBox


jtibshirani opened a new issue, #11758:
URL: https://github.com/apache/lucene/issues/11758

   ### Description
   
   This issue tracks ideas for refactors as a follow-up to #11613, where we 
added support for 8-bit vector values:
   * We extended `KnnVectorsWriter` to be generic, accepting both `float[]` and 
`BytesRef` values. It'd be good to explore simplifying the write API to remove 
these generics.
   * We now switch on `VectorEncoding` in several places during graph building 
and search. Is there a way to encapsulate this better? Maybe there could be a 
helper class that knows the encoding and can perform distance computations, so 
that `HnswGraphBuilder` and `HnswGraphSearcher` don't need to know about the 
encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jmazanec15 commented on a diff in pull request #1068: LUCENE-10674: Update subiterators when BitSetConjDISI exhausts

2022-09-08 Thread GitBox


jmazanec15 commented on code in PR #1068:
URL: https://github.com/apache/lucene/pull/1068#discussion_r966532126


##
lucene/core/src/java/org/apache/lucene/search/ConjunctionDISI.java:
##
@@ -281,6 +281,12 @@ private int doNext(int doc) throws IOException {
   advanceLead:
   for (; ; doc = lead.nextDoc()) {
 if (doc >= minLength) {
+  if (doc != NO_MORE_DOCS) {
+lead.advance(NO_MORE_DOCS);
+  }
+  for (BitSetIterator iterator : bitSetIterators) {
+iterator.setDocId(NO_MORE_DOCS);
+  }

Review Comment:
   Oh I see. For this, we created a DocIdSetIterator using DocIdSetBuilder. For 
[maxDoc](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java#L116),
 it was less than the number of docs in the index. This led to the length of 
the bitsets being different.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org