Re: [PR] Remove IndexSearcher#search(List, Weight, Collector) [lucene]

2024-09-13 Thread via GitHub


javanna merged PR #13780:
URL: https://github.com/apache/lucene/pull/13780


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


uschindler commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348202070

   Backport or not?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


jpountz commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348207344

   I would backport as it looks pretty safe, though I doubt we have anyone 
using this postings format, so it likely doesn't matter much in practice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348218644

   > can you merge main back into your branch?
   
   Merged.
   
   > can we find other occurrences using some regex searches?
   
   I searched code with `final int targetUptoMid = targetUpto` which these 
iterating compare start with. 
   There  is only two places remains in 
`org.apache.lucene.backward_codecs.lucene40.blocktree.SegmentTermsEnum`, since 
this is a bakward class, I am not sure we should modify it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


uschindler commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348293165

   > I would backport as it looks pretty safe, though I doubt we have anyone 
using this postings format, so it likely doesn't matter much in practice.
   
   The we should move the changes entry. Do we have a corresponding 9.x section 
already?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] similarities: provide default computeNorm implementation; remove remaining discountOverlaps setters; [lucene]

2024-09-13 Thread via GitHub


cpoerschke merged PR #13757:
URL: https://github.com/apache/lucene/pull/13757


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348387100

   > The we should move the changes entry. Do we have a corresponding 9.x 
section already?
   
   We already have a similar change entry for 
https://github.com/apache/lucene/pull/13252 under 9.11.0, so we should move 
this change entry?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


mikemccand commented on code in PR #13723:
URL: https://github.com/apache/lucene/pull/13723#discussion_r1758531891


##
lucene/test-framework/src/test/org/apache/lucene/tests/util/TestFloatingPointComparisons.java:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.tests.util;
+
+public class TestFloatingPointComparisons extends LuceneTestCase {
+  public void test() {
+// Test adjacent numbers
+assertFalse(doubleEquals(Double.longBitsToDouble(1L), 
Double.longBitsToDouble(2L), 0));
+assertTrue(doubleEquals(Double.longBitsToDouble(1L), 
Double.longBitsToDouble(2L), 1));
+assertFalse(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), 
(short) 0));
+assertTrue(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), 
(short) 1));
+
+// Test signed zeros
+assertTrue(doubleEquals(0.0d, -0.0d, 0));
+assertTrue(floatEquals(0.0f, -0.0f, (short) 0));
+
+// Test NaNs
+assertFalse(doubleEquals(Double.NaN, Double.NaN, Integer.MAX_VALUE));
+assertFalse(floatEquals(Float.NaN, Float.NaN, (short) 32767));

Review Comment:
   You could maybe also use `Math.nextAfter` to add one or two ulps to a random 
float/double and then `assert` that `float/doubleEquals` agrees?



##
lucene/CHANGES.txt:
##
@@ -422,7 +422,9 @@ Build
 
 Other
 
-(No changes)
+
+* GITHUB#13720: Add float comparison based on unit of least precision and use 
it to stop test failures caused by float

Review Comment:
   Maybe add `IEEE 754 float summation implemented by Java not being 
commutative` or so?
   
   Mathematically float summation is fine :)



##
lucene/test-framework/src/test/org/apache/lucene/tests/util/TestFloatingPointComparisons.java:
##
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.tests.util;
+
+public class TestFloatingPointComparisons extends LuceneTestCase {
+  public void test() {
+// Test adjacent numbers
+assertFalse(doubleEquals(Double.longBitsToDouble(1L), 
Double.longBitsToDouble(2L), 0));
+assertTrue(doubleEquals(Double.longBitsToDouble(1L), 
Double.longBitsToDouble(2L), 1));
+assertFalse(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), 
(short) 0));
+assertTrue(floatEquals(Float.intBitsToFloat(1), Float.intBitsToFloat(2), 
(short) 1));
+
+// Test signed zeros
+assertTrue(doubleEquals(0.0d, -0.0d, 0));
+assertTrue(floatEquals(0.0f, -0.0f, (short) 0));

Review Comment:
   Fun.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


mikemccand commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2348502055

   > > I don't like the last commit because it changes from a assert-like 
method to a boolean returning method.
   > 
   > I changed it away from an assertion because I liked this more. It makes it 
so you can assert on floats _not_ being equal or use their equality in a 
condition, without making an assertion statement much longer. Do you not like 
it because it's too generic? We could move it to `TestUtil` or we could provide 
assertion methods alongside the equality methods, if that helps.
   
   Maybe we could do both?
   
   I.e. add the `TestUtil` boolean-returning methods (`float/doubleEquals`) to 
`TestUtil` but then also add the sugar methods `void 
assertFloat/DoubleUlpEquals`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] The "PatternCaptureGroupTokenFilter" generates identical offsets, which causes issues with highlighting the string. [lucene]

2024-09-13 Thread via GitHub


shikhasharma3708 opened a new issue, #13783:
URL: https://github.com/apache/lucene/issues/13783

   ### Description
   
   I am implementing the PatternCaptureGroupTokenFilter in my code to generate 
tokens based on multiple regular expressions, with the goal of highlighting any 
matches found within the string. Currently, I am working with Lucene 9, but I 
am encountering the following error during execution.
   
   I'm using one of the latest Lucene jars (9.11.1) for 
PatternCaptureGroupTokenFilter, but the token positions are not as expected, 
making it difficult to accurately highlight the search results.
   
   Here are the tokens generated for the string:
   
   `{
 "tokens" : [
   {
 "token" : "test:data",
 "start_offset" : 0,
 "end_offset" : 9,
 "type" : "word",
 "position" : 0
   },
   {
 "token" : "test",
 "start_offset" : 0,
 "end_offset" : 9,
 "type" : "word",
 "position" : 0
   },
   {
 "token" : ":data",
 "start_offset" : 0,
 "end_offset" : 9,
 "type" : "word",
 "position" : 0
   },
   {
 "token" : "test:",
 "start_offset" : 0,
 "end_offset" : 9,
 "type" : "word",
 "position" : 0
   },
   {
 "token" : "test",
 "start_offset" : 10,
 "end_offset" : 14,
 "type" : "word",
 "position" : 1
   }
 ]
   }`
   
   The generated offsets are not compatible for highlighting the searchedQuery.
   
   To generate different offsets, i tried using the 
`org.apache.lucene.analysis.tokenattributes.OffsetAttribute` Lucene package. 
which is giving me the different offsets but  i am encountering another error 
while indexing the document.
   
   I am using the below java code.
   `public final class PatternCaptureGroupTokenFilter extends TokenFilter {
   
   private final CharTermAttribute charTermAttr = 
addAttribute(CharTermAttribute.class);
   private final PositionIncrementAttribute posAttr = 
addAttribute(PositionIncrementAttribute.class);
   private final OffsetAttribute offsetAtt = 
addAttribute(OffsetAttribute.class);
   
   private final TypeAttribute typeAttribute = 
addAttribute(TypeAttribute.class);
   private State state;
   private final Matcher[] matchers;
   private final CharsRefBuilder spare = new CharsRefBuilder();
   private final int[] groupCounts;
   private final boolean preserveOriginal;
   private int[] currentGroup;
   private int currentMatcher;
   private int main_token_start;
   private int main_token_end;
   
   public PatternCaptureGroupTokenFilter(TokenStream input,
   boolean preserveOriginal, Pattern... patterns) {
   super(input);
   
   this.preserveOriginal = preserveOriginal;
   this.matchers = new Matcher[patterns.length];
   this.groupCounts = new int[patterns.length];
   this.currentGroup = new int[patterns.length];
   for (int i = 0; i < patterns.length; i++) {
   this.matchers[i] = patterns[i].matcher("");
   this.groupCounts[i] = this.matchers[i].groupCount();
   this.currentGroup[i] = -1;
   }
   }
   
   private boolean nextCapture() {
   
   int min_offset = Integer.MAX_VALUE;
   currentMatcher = -1;
   Matcher matcher;
   
   for (int i = 0; i < matchers.length; i++) {
   matcher = matchers[i];
   if (currentGroup[i] == -1) {
   currentGroup[i] = matcher.find() ? 1 : 0;
   }
   if (currentGroup[i] != 0) {
   while (currentGroup[i] < groupCounts[i] + 1) {
   final int start = matcher.start(currentGroup[i]);
   final int end = matcher.end(currentGroup[i]);
   
   if (start == end || preserveOriginal && start == 0
   && spare.length() == end) {
   currentGroup[i]++;
   continue;
   }
   if (start < min_offset) {
   min_offset = start;
   currentMatcher = i;
   }
   break;
   }
   if (currentGroup[i] == groupCounts[i] + 1) {
   currentGroup[i] = -1;
   i--;
   }
   }
   }
   return currentMatcher != -1;
   }
   
   @Override
   public boolean incrementToken() throws IOException {
   if (currentMatcher != -1 && nextCapture()) {
   assert state != null;
   clearAttributes();
   restoreState(state);
   final int start = matchers[currentMatcher]
   .start(currentGroup[currentMatcher]);
 

[PR] Change docValuesSkipIndex from a boolean to an enum. [lucene]

2024-09-13 Thread via GitHub


jpountz opened a new pull request, #13784:
URL: https://github.com/apache/lucene/pull/13784

   At the moment, our skip indexes record min/max ordinal/value per range of 
doc IDs. It would be natural to extend it to other pre-aggregated data such as 
a sum and value count, which facets could take advantage of. This change 
switches `docValuesSkipIndex` from a boolean to an enum so that we could 
release such changes in the future in an additive fashion, by adding constants 
to this enum and new methods to `DocValuesSkipper`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-13 Thread via GitHub


jpountz commented on PR #13779:
URL: https://github.com/apache/lucene/pull/13779#issuecomment-2348633015

   >  think you had said 9/22 would be a feature freeze date
   
   I was thinking of doing it next week, but we can backport this PR even 
though the branch has been cut if it looks ready/safe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-13 Thread via GitHub


jpountz commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1758641965


##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {
+return dimension() * getEncoding().byteSize;
+  }
+
+  /** The vector encoding of these values. */
+  public abstract VectorEncoding getEncoding();
+
+  /** Returns a Bits accepting docs accepted by the argument and having a 
vector value */
+  public Bits getAcceptOrds(Bits acceptDocs) {
+// FIXME: change default to return acceptDocs and provide this impl
+// somewhere more specialized (in every non-dense impl).
+if (acceptDocs == null) {
+  return null;
+}
+return new Bits() {
+  @Override
+  public boolean get(int index) {
+return acceptDocs.get(ordToDoc(index));
+  }
+
+  @Override
+  public int length() {
+return size();
+  }
+};
+  }
+
+  /**
+   * Return the iterator for this instance. If you need multiple iterators, 
call 
+   * this.copy().iterator().
+   */
+  public DocIndexIterator iterator() {
+if (iterator == null) {
+  iterator = createIterator();
+}
+return iterator;
+  }
+
+  /**
+   * Create an iterator for this instance; typically called once by 
iterator(). Wrapper
+   * value classes delegate to their inner instance's iterator and shouldn't 
implement this.
+   */
+  protected DocIndexIterator createIterator() {
+throw new UnsupportedOperationException();
+  }
+
+  /**
+   * A DocIdSetIterator that also provides an index() method tracking a 
distinct ordinal for a
+   * vector associated with each doc.
+   */
+  public abstract static class DocIndexIterator extends DocIdSetIterator {
+
+/** return the value index (aka "ordinal" or "ord") corresponding to the 
current doc */
+public abstract int index();
+
+@Override
+public int advance(int target) throws IOException {
+  return slowAdvance(target);
+}
+
+@Override
+public long cost() {
+  throw new UnsupportedOperationException();

Review Comment:
   If we default cost() to returning size(), that would work for me. But I'm 
not comfortable with having implementations of DocIdSetIterator#cost that may 
throw, which means e.g. that they cannot be put in a Conjunction DISI(which 
will want to sort its clauses by cost).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


uschindler commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2348681854

   > > > I don't like the last commit because it changes from a assert-like 
method to a boolean returning method.
   > > 
   > > 
   > > I changed it away from an assertion because I liked this more. It makes 
it so you can assert on floats _not_ being equal or use their equality in a 
condition, without making an assertion statement much longer. Do you not like 
it because it's too generic? We could move it to `TestUtil` or we could provide 
assertion methods alongside the equality methods, if that helps.
   > 
   > Maybe we could do both?
   > 
   > I.e. add the `TestUtil` boolean-returning methods (`float/doubleEquals`) 
to `TestUtil` but then also add the sugar methods `void 
assertFloat/DoubleUlpEquals`?
   
   That was my intention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


uschindler commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348768324

   > > The we should move the changes entry. Do we have a corresponding 9.x 
section already?
   > 
   > We already have a similar change entry for #13252 under 9.11.0, so we 
should move this change entry?
   
   Yes, let's merge them an list both PRs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2348774168

   Ok, I will do it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


stefanvodita commented on PR #13723:
URL: https://github.com/apache/lucene/pull/13723#issuecomment-2349071396

   I've moved the methods around and, as I was writing more tests, realised I'm 
not going to be as comprehensive as the originals tests, so I adapted those 
instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Replace Map with IntObjectHashMap for KnnVectorsReader [lucene]

2024-09-13 Thread via GitHub


bugmakerr commented on PR #13763:
URL: https://github.com/apache/lucene/pull/13763#issuecomment-2349265068

   @benwtrent @jpountz I have merged main branch into this one, can I get a 
review on this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix Flaky Test In TestBlockJoinBulkScorer [lucene]

2024-09-13 Thread via GitHub


Mikep86 opened a new pull request, #13785:
URL: https://github.com/apache/lucene/pull/13785

   Fix the `testSetMinCompetitiveScoreWithScoreModeMax` test, which sometimes 
failed due to randomizations in how the docs were scored
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] First-class random access API for KnnVectorValues [lucene]

2024-09-13 Thread via GitHub


ChrisHegarty commented on code in PR #13779:
URL: https://github.com/apache/lucene/pull/13779#discussion_r1759099771


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/HasIndexSlice.java:
##
@@ -14,23 +14,16 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.lucene.util.quantization;
+package org.apache.lucene.codecs.lucene95;
 
-import java.io.IOException;
-import org.apache.lucene.util.hnsw.RandomAccessVectorValues;
+import org.apache.lucene.store.IndexInput;
 
 /**
- * Random access values for byte[], but also includes accessing 
the score correction
- * constant for the current vector in the buffer.
- *
- * @lucene.experimental
+ * Implementors can return the IndexInput from which their values are read. 
For use by vector
+ * quantizers.
  */
-public interface RandomAccessQuantizedByteVectorValues extends 
RandomAccessVectorValues.Bytes {
-
-  ScalarQuantizer getScalarQuantizer();
-
-  float getScoreCorrectionConstant(int vectorOrd) throws IOException;
+public interface HasIndexSlice {
 
-  @Override
-  RandomAccessQuantizedByteVectorValues copy() throws IOException;
+  /** Returns an IndexInput from which to read this instance's values. */
+  IndexInput getSlice();

Review Comment:
   I very much like this, and had something similar in a past unmarked PR. šŸ‘ 



##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.index;
+
+import java.io.IOException;
+import org.apache.lucene.codecs.lucene90.IndexedDISI;
+import org.apache.lucene.document.KnnByteVectorField;
+import org.apache.lucene.document.KnnFloatVectorField;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.util.Bits;
+
+/**
+ * This class abstracts addressing of document vector values indexed as {@link 
KnnFloatVectorField}
+ * or {@link KnnByteVectorField}.
+ *
+ * @lucene.experimental
+ */
+public abstract class KnnVectorValues {
+
+  /** The iterator associated with these values. */
+  protected DocIndexIterator iterator;
+
+  /** Return the dimension of the vectors */
+  public abstract int dimension();
+
+  /**
+   * Return the number of vectors for this field.
+   *
+   * @return the number of vectors returned by this iterator
+   */
+  public abstract int size();
+
+  /**
+   * Return the docid of the document indexed with the given vector ordinal. 
This default
+   * implementation returns the argument and is appropriate for dense values 
implementations where
+   * every doc has a single value.
+   */
+  public int ordToDoc(int ord) {
+return ord;
+  }
+
+  /**
+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when 
you need to access
+   * different values at once, to avoid overwriting the underlying vector 
returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;
+
+  /** Returns the vector byte length, defaults to dimension multiplied by 
float byte size */
+  public int getVectorByteLength() {

Review Comment:
   yes. I think that it is needed. For example, for int4 the byte length 
depends on whether or not the vector data is compressed. Maybe update the 
javadoc - it's not always *float*.



##
lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java:
##
@@ -0,0 +1,281 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * l

Re: [PR] Use range optimizations for "slow" MultiTermQueries when terms happen to be contiguous [lucene]

2024-09-13 Thread via GitHub


iverase commented on PR #13693:
URL: https://github.com/apache/lucene/pull/13693#issuecomment-2349297359

   I have my doubts in this approach and I am unsure we should expose the 
scorer that way. On the other hand, I wonder if we can rewrite the query to a 
range query. I understand that in this case, you need to be able to apply the 
optimisation to all segments, but that's probably the most common case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


uschindler commented on code in PR #13723:
URL: https://github.com/apache/lucene/pull/13723#discussion_r1759153816


##
lucene/CHANGES.txt:
##
@@ -422,7 +422,9 @@ Build
 
 Other
 
-(No changes)
+
+* GITHUB#13720: Add float comparison based on unit of least precision and use 
it to stop test failures caused by float
+  summation not being commutative in Java's IEEE 754 implementation. (Alex 
Herbert, Stefan Vodita)

Review Comment:
   It is always not commutative regardless of implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


uschindler commented on code in PR #13723:
URL: https://github.com/apache/lucene/pull/13723#discussion_r1759155851


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -864,6 +864,14 @@ public static void assumeNoException(String msg, Exception 
e) {
 RandomizedTest.assumeNoException(msg, e);
   }
 
+  public static void assertFloatUlpEquals(final float x, final float y, final 
short maxUlps) {
+assertTrue(TestUtil.floatUlpEquals(x, y, maxUlps));

Review Comment:
   Maybe add a message showing string of both values.
   The problem with assert true is that you get no useful output.
   This is why I wanted to have the assert methods.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Can we remove `compress` option for quantized KNN vector indexing? [lucene]

2024-09-13 Thread via GitHub


mikemccand commented on issue #13768:
URL: https://github.com/apache/lucene/issues/13768#issuecomment-2349362383

   >> But I don't think we should block removing compress option due to 
non-SIMD results?
   
   Actually, thinking about this more ... I'm changing my mind.  I don't fully 
understand how poor our Panama/SIMD coverage is across CPU types/versions, 
"typically" in use by our users.  E.g. for ARM CPUs (various versions of NEON 
instructions).  What %tg of our users would hit the non-SIMD (non-Panama) path?
   
   It's spooky that the likes of OpenSearch, Elasticsearch, Solr are needing to 
pull in their own Panama FMA wrappers around native code to better optimize for 
certain vectorized instruction cases (see discussion on #13572).  Ideally such 
optimizations would be in Lucene so we could make decisions like this (remove 
`compress` option to simplify our API / reduce surface area) with more 
confidence.
   
   I'd like to run benchmarks across many more CPUs before rushing to a 
decision here, and I think for now we should otherwise respect the non-SIMD 
results?  I love our new `aws-jmh` dev tool (thank you @rmuir)!  I looked at 
its `playbook.yml` to figure out if I could also add "go check out 
`luceneutil`, download this massive 95 GB `.vec` file, and run `knnPerfTest.py` 
and summarize the results" but I haven't made much progress so far ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] New JMH benchmark method - vdot8s that implement int8 dotProduct in C… [lucene]

2024-09-13 Thread via GitHub


ChrisHegarty commented on PR #13572:
URL: https://github.com/apache/lucene/pull/13572#issuecomment-2349442717

   > > Anyways: At moment we do not want to have native code in Lucene Core.
   > ..
   > Having the likes of OpenSearch, Elasticsearch, and Solr implement their 
own (high performance) direct native SIMD access seems not ideal to me. It 
really should somehow be an option in Lucene, even if it's not the default.
   
   As someone who wrote and maintains the native scorer in Elasticsearch, it 
gives great speed ups for Scalar Quantised vectors on ARM. Since Panama Vector 
is horrible for arithmetic operations on byte-sized values; both widen and 
accessing various parts of the vector. We have an AVX implementation too, but 
it gives less improvements. I don't see Panama Vector improving in this area 
any time soon :-(.  For floats we don't do any native.
   
   In the latest Binary Quantization 
https://github.com/apache/lucene/pull/13651, Panama Vector performs very 
nicely.  It might be that some of the motivation for a native scorer may just 
go away.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Add unit-of-least-precision float comparison [lucene]

2024-09-13 Thread via GitHub


stefanvodita commented on code in PR #13723:
URL: https://github.com/apache/lucene/pull/13723#discussion_r1759505470


##
lucene/CHANGES.txt:
##
@@ -422,7 +422,9 @@ Build
 
 Other
 
-(No changes)
+
+* GITHUB#13720: Add float comparison based on unit of least precision and use 
it to stop test failures caused by float
+  summation not being commutative in Java's IEEE 754 implementation. (Alex 
Herbert, Stefan Vodita)

Review Comment:
   Now that I think about it, it's not even a commutativity issue. It's an 
associativity issue.
   `a + b == b + a`
   But
   `(a + b) + c != a + (b + c)`
   I'll fix the entry.
   



##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -864,6 +864,14 @@ public static void assumeNoException(String msg, Exception 
e) {
 RandomizedTest.assumeNoException(msg, e);
   }
 
+  public static void assertFloatUlpEquals(final float x, final float y, final 
short maxUlps) {
+assertTrue(TestUtil.floatUlpEquals(x, y, maxUlps));

Review Comment:
   Good point, will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2350779495

   > let's merge them an list both PRs.
   
   I merged them into one entry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Remove recurse into sub block when scan leaf block in IDVersionSegmentTermsEnumFrame. [lucene]

2024-09-13 Thread via GitHub


vsop-479 opened a new pull request, #13786:
URL: https://github.com/apache/lucene/pull/13786

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-13 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1759648681


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Yeah... I think you're right. I'll still need an abstraction to describe 
what the Collector produces, but then either that "thing" will know how to 
merge themselves, or it can be an anonymous function passed to the 
CollectorManager.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use Arrays.compareUnsigned in IDVersionSegmentTermsEnum and OrdsSegmentTermsEnum. [lucene]

2024-09-13 Thread via GitHub


vsop-479 commented on PR #13782:
URL: https://github.com/apache/lucene/pull/13782#issuecomment-2350800984

   > can we find other occurrences using some regex searches?
   
   Hmm, I also find some suffix's loop comparing in 
`IDVersionSegmentTermsEnumFrame` and `OrdsSegmentTermsEnumFrame`, similar to 
`SegmentTermsEnumFrame`, these code maybe could replaced by 
`Arrays#compareUnsigned` too.
   
   But, firstly I find we attempt to load sub block when scanning a leaf block: 
https://github.com/apache/lucene/pull/13786. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix comment on compare suffix and target. [lucene]

2024-09-13 Thread via GitHub


vsop-479 opened a new pull request, #13787:
URL: https://github.com/apache/lucene/pull/13787

   ### Description
   
   Since we do not loop over bytes (hand-written) in the suffix, to comaring to 
the target anymore.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove usage of IndexSearcher#search(Query, Collector) from join package [lucene]

2024-09-13 Thread via GitHub


msfroh commented on code in PR #13747:
URL: https://github.com/apache/lucene/pull/13747#discussion_r1759668285


##
lucene/join/src/java/org/apache/lucene/search/join/MergeableCollector.java:
##
@@ -0,0 +1,26 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.join;
+
+import java.io.IOException;
+import org.apache.lucene.search.Collector;
+
+interface MergeableCollector> extends 
Collector {
+
+  /** Folds the results from another collector of the same type into this 
collector. */
+  void merge(C collector) throws IOException;
+}

Review Comment:
   Oh, wait, I'm a moron. 
   
   I should just more or  less turn all of these Collectors into 
CollectorManagers. Then I'm not duplicating anything.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Reduce number of calculations in FSTCompiler [lucene]

2024-09-13 Thread via GitHub


mrhbj opened a new pull request, #13788:
URL: https://github.com/apache/lucene/pull/13788

   ### Description
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org