Re: [PR] Allow FST builder to use different writer (#12543) [lucene]

2023-11-07 Thread via GitHub


dungba88 commented on code in PR #12624:
URL: https://github.com/apache/lucene/pull/12624#discussion_r1384518777


##
lucene/core/src/java/org/apache/lucene/util/fst/FSTWriter.java:
##
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.fst;
+
+import java.io.IOException;
+
+/** Abstract class which provides low-level functionality to write to a FST */
+public interface FSTWriter {
+
+  /**

Review Comment:
   These 2 methods are same as DataOutput, but DataOutput is a class which we 
want to avoid extending from.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Take advantage of bloom filter when delete terms [lucene]

2023-11-07 Thread via GitHub


gf2121 commented on issue #12725:
URL: https://github.com/apache/lucene/issues/12725#issuecomment-1798007152

   > do we have any numbers if it actually helps applying deletes?
   
   I made a naive benchmark comparing `seekCeil` and `seekExact` when deleting 
terms on a field with bloom filter, result shows `seekExact` style ~9% faster 
in average. (All terms are 16 bytes UUID)
   
   https://bytedance.feishu.cn/sheets/Jwjzs3FdXhFbVwtfADJc6L8rnCb";
 data-importRangeRawData-range="'Sheet1'!A1:D12">
   
   round | seekCeil(ms) | candidate(ms) | diff
   -- | -- | -- | --
   1 | 46211 | 42783 | -7%
   2 | 44641 | 40698 | -9%
   3 | 44462 | 43471 | -2%
   4 | 47059 | 43842 | -7%
   5 | 45049 | 43609 | -3%
   6 | 49529 | 40953 | -17%
   7 | 44707 | 41498 | -7%
   8 | 45732 | 40785 | -11%
   9 | 45396 | 41321 | -9%
   10 | 48759 | 40760 | -16%
   avg | 46154.5 | 41972 | -9%
   
   
   
   > wild idea but would it make sense to build an automaton off these terms 
and intersect it? We could reuse it for multiple segments? I am not sure how 
big the costs are for that but it would potentially in a codec agnostic way?
   
   This sounds great. I think intersecting with automation can not take 
advantage of bloom filter too. So it should be a competitive approach comparing 
to `seekExact` or `seekCeil` ? I'd like to give it a try.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Early terminate visit BKD leaf when current value greater than upper point in sorted dim. [lucene]

2023-11-07 Thread via GitHub


vsop-479 commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1798153626

   @iverase I implemented visitWithSortedDim in LatLonPointDistanceQuery. 
Please take a look when you get a chance!
   BTW, The contents of getIntersectVisitor and getInverseIntersectVisitor in 
LatLonPointDistanceQuery and PointRangeQuery are identical. Should we move them 
to a DefaultIntersectVisitor and make match/relate abstract in it, i mean in a 
new PR without this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Remove BytesReader#reversed() [lucene]

2023-11-07 Thread via GitHub


mikemccand merged PR #12777:
URL: https://github.com/apache/lucene/pull/12777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Remove `FST.BytesReader#reversed` method? [lucene]

2023-11-07 Thread via GitHub


mikemccand closed issue #12759: Remove `FST.BytesReader#reversed` method?
URL: https://github.com/apache/lucene/issues/12759


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we ban Random#nextInt(int, int)? [lucene]

2023-11-07 Thread via GitHub


mikemccand closed issue #12771: Should we ban Random#nextInt(int, int)?
URL: https://github.com/apache/lucene/issues/12771


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we ban Random#nextInt(int, int)? [lucene]

2023-11-07 Thread via GitHub


mikemccand commented on issue #12771:
URL: https://github.com/apache/lucene/issues/12771#issuecomment-1798212632

   Indeed, the javadocs have the same terrifying explanation -- thanks 
@dungba88!: `If bound is a power of two then limiting is a simple masking 
operation. Otherwise, the result is re-calculated by invoking nextInt() until 
the result is greater than or equal zero and less than bound.`
   
   Fortunately, thanks to @rmuir digging into OpenJDK sources, it looks like 
this is a false alarm: these methods first pre-mask to the "next largest power 
of 2" fully containing the requested range, and iteratively pull random numbers 
in that range, until finding one within the requested (not necessarily power of 
2) range.  So the number of iterations of that loop is "typically" very low (1 
or 2).  I cancelled the PR (#12772) and forgot to close this issue -- closing 
now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-07 Thread via GitHub


mikemccand commented on code in PR #12778:
URL: https://github.com/apache/lucene/pull/12778#discussion_r1384692182


##
lucene/core/src/test/org/apache/lucene/util/TestByteBlockPool.java:
##
@@ -25,7 +24,34 @@
 
 public class TestByteBlockPool extends LuceneTestCase {
 
-  public void testReadAndWrite() throws IOException {
+  public void testAppendFromOtherPool() {
+ByteBlockPool pool = new ByteBlockPool(new 
ByteBlockPool.DirectAllocator());
+final int numBytes = atLeast(2 << 16);
+byte[] bytes = new byte[numBytes];
+for (int i = 0; i < numBytes; i++) {
+  bytes[i] = (byte) random().nextInt(256);

Review Comment:
   > It seems we are also banning the `random().nextInt()` call in #12771
   
   Sorry, no need to avoid this API -- it is unbanned now -- it was a false 
alarm!  Sorry for the scare :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-07 Thread via GitHub


mikemccand commented on code in PR #12778:
URL: https://github.com/apache/lucene/pull/12778#discussion_r1384697358


##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ public void append(final BytesRef bytes) {
 append(bytes.bytes, bytes.offset, bytes.length);
   }
 
+  /**
+   * Append the bytes from a source {@link ByteBlockPool} at a given offset 
and length
+   *
+   * @param srcPool the source pool to copy from
+   * @param offset the source pool offset
+   * @param length the number of bytes to copy
+   */
+  public void append(ByteBlockPool srcPool, long offset, int length) {
+int bytesLeft = length;
+while (bytesLeft > 0) {
+  int bufferLeft = BYTE_BLOCK_SIZE - byteUpto;
+  if (bytesLeft < bufferLeft) { // fits within current buffer
+copyBytes(srcPool, offset, bytesLeft);
+break;
+  } else { // fill up this buffer and move to next one
+if (bufferLeft > 0) {
+  copyBytes(srcPool, offset, bufferLeft);
+  bytesLeft -= bufferLeft;
+  offset += bufferLeft;
+}
+nextBuffer();
+  }
+}
+  }
+
+  // copy from source pool until no bytes left. length must be fit within the 
current head buffer
+  private void copyBytes(ByteBlockPool srcPool, long offset, int length) {
+while (length > 0) {

Review Comment:
   Maybe add comment that we need `while` loop here in case the byte slice to 
be copied spans multiple `byte[]` buffers in the `srcPool`?



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ public void append(final BytesRef bytes) {
 append(bytes.bytes, bytes.offset, bytes.length);
   }
 
+  /**
+   * Append the bytes from a source {@link ByteBlockPool} at a given offset 
and length
+   *
+   * @param srcPool the source pool to copy from
+   * @param offset the source pool offset
+   * @param length the number of bytes to copy
+   */
+  public void append(ByteBlockPool srcPool, long offset, int length) {
+int bytesLeft = length;
+while (bytesLeft > 0) {
+  int bufferLeft = BYTE_BLOCK_SIZE - byteUpto;
+  if (bytesLeft < bufferLeft) { // fits within current buffer
+copyBytes(srcPool, offset, bytesLeft);
+break;
+  } else { // fill up this buffer and move to next one
+if (bufferLeft > 0) {
+  copyBytes(srcPool, offset, bufferLeft);
+  bytesLeft -= bufferLeft;
+  offset += bufferLeft;
+}
+nextBuffer();
+  }
+}
+  }
+
+  // copy from source pool until no bytes left. length must be fit within the 
current head buffer
+  private void copyBytes(ByteBlockPool srcPool, long offset, int length) {
+while (length > 0) {

Review Comment:
   Can we add an `assert` confirming `length` does indeed fit within the head 
buffer?



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ public void append(final BytesRef bytes) {
 append(bytes.bytes, bytes.offset, bytes.length);
   }
 
+  /**
+   * Append the bytes from a source {@link ByteBlockPool} at a given offset 
and length
+   *
+   * @param srcPool the source pool to copy from
+   * @param offset the source pool offset
+   * @param length the number of bytes to copy
+   */
+  public void append(ByteBlockPool srcPool, long offset, int length) {
+int bytesLeft = length;
+while (bytesLeft > 0) {
+  int bufferLeft = BYTE_BLOCK_SIZE - byteUpto;
+  if (bytesLeft < bufferLeft) { // fits within current buffer
+copyBytes(srcPool, offset, bytesLeft);
+break;
+  } else { // fill up this buffer and move to next one
+if (bufferLeft > 0) {
+  copyBytes(srcPool, offset, bufferLeft);
+  bytesLeft -= bufferLeft;
+  offset += bufferLeft;
+}
+nextBuffer();
+  }
+}
+  }
+
+  // copy from source pool until no bytes left. length must be fit within the 
current head buffer
+  private void copyBytes(ByteBlockPool srcPool, long offset, int length) {

Review Comment:
   Can we rename to make it clear the length must fit within single buffer?  
Also, it's not a generic `copyBytes` but also an `append` operation?
   
   Maybe `appendBytesSingleBuffer` or something?



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ public void append(final BytesRef bytes) {
 append(bytes.bytes, bytes.offset, bytes.length);
   }
 
+  /**
+   * Append the bytes from a source {@link ByteBlockPool} at a given offset 
and length
+   *
+   * @param srcPool the source pool to copy from
+   * @param offset the source pool offset
+   * @param length the number of bytes to copy
+   */
+  public void append(ByteBlockPool srcPool, long offset, int length) {

Review Comment:
   Rename to `srcOffset`?



##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ pu

Re: [PR] Copy directly between 2 ByteBlockPool to avoid double-copy [lucene]

2023-11-07 Thread via GitHub


dungba88 commented on code in PR #12778:
URL: https://github.com/apache/lucene/pull/12778#discussion_r1384749437


##
lucene/core/src/java/org/apache/lucene/util/ByteBlockPool.java:
##
@@ -234,6 +234,44 @@ public void append(final BytesRef bytes) {
 append(bytes.bytes, bytes.offset, bytes.length);
   }
 
+  /**
+   * Append the bytes from a source {@link ByteBlockPool} at a given offset 
and length
+   *
+   * @param srcPool the source pool to copy from
+   * @param offset the source pool offset
+   * @param length the number of bytes to copy
+   */
+  public void append(ByteBlockPool srcPool, long offset, int length) {
+int bytesLeft = length;
+while (bytesLeft > 0) {
+  int bufferLeft = BYTE_BLOCK_SIZE - byteUpto;
+  if (bytesLeft < bufferLeft) { // fits within current buffer
+copyBytes(srcPool, offset, bytesLeft);
+break;
+  } else { // fill up this buffer and move to next one
+if (bufferLeft > 0) {
+  copyBytes(srcPool, offset, bufferLeft);
+  bytesLeft -= bufferLeft;
+  offset += bufferLeft;
+}
+nextBuffer();
+  }
+}
+  }
+
+  // copy from source pool until no bytes left. length must be fit within the 
current head buffer
+  private void copyBytes(ByteBlockPool srcPool, long offset, int length) {
+while (length > 0) {

Review Comment:
   Thank you, I have incorporated the comment into the new rev



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Enable executing using NFA in RegexpQuery [lucene]

2023-11-07 Thread via GitHub


rmuir commented on code in PR #12767:
URL: https://github.com/apache/lucene/pull/12767#discussion_r1384795478


##
lucene/core/src/test/org/apache/lucene/search/TestRegexpQuery.java:
##
@@ -80,7 +80,10 @@ private long caseInsensitiveRegexQueryNrHits(String regex) 
throws IOException {
 newTerm(regex),
 RegExp.ALL,
 RegExp.ASCII_CASE_INSENSITIVE,
-Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
+RegexpQuery.DEFAULT_PROVIDER,
+Operations.DEFAULT_DETERMINIZE_WORK_LIMIT,
+MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE,
+random().nextBoolean());

Review Comment:
   yeah unfortunately this test is incredibly weak...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub


benwtrent commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1798357836

   @kevindrosendahl this looks really interesting! Thank you for digging in and 
starting the experimentation!
   
   I haven't had a chance to read your branch yet, but hope to soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Stop exploring HNSW graph if scores are not getting better. [lucene]

2023-11-07 Thread via GitHub


benwtrent merged PR #12770:
URL: https://github.com/apache/lucene/pull/12770


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798365624

   For realistic data I tried to index `wikimedium10m` with ramBuffer = 1024m 
and accumulating the took of all invoking of `BytesRefHash#sort`. Result shows 
the took sum decreased from 27161ms -> 12203ms, which is much better than i 
expected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


dweiss commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798550955

   Wow. Nice improvement!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


mikemccand commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1798603970

   This is incredible speedup :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Fix nested link warning from renderSiteJavadoc [lucene]

2023-11-07 Thread via GitHub


slow-J opened a new pull request, #12779:
URL: https://github.com/apache/lucene/pull/12779

   Very minor change.
   
   ### Description
   Without this change, we are getting this warning in `> Task 
:lucene:core:renderSiteJavadoc`
   
   ```
   
/local/home/jslowins/upstream_bench/lucene_bench_home_2/lucene_candidate/lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:32:
 warning: Tag {@link}: nested link
* @see IOUtils to obtain {@link Reader} instances
 ^
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]

2023-11-07 Thread via GitHub


mikemccand commented on code in PR #12779:
URL: https://github.com/apache/lucene/pull/12779#discussion_r1385198200


##
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##
@@ -29,7 +29,7 @@
 /**
  * Loader for text files that represent a list of stopwords.
  *
- * @see IOUtils to obtain {@link Reader} instances
+ * @see IOUtils to obtain Reader instances

Review Comment:
   Ahh this is a problem because the `to obtain Reader instances` is already a 
link pointing to `IOUtils`?  And so we cannot stick another link for `Reader` 
inside there.  Makes sense.  I'll merge shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]

2023-11-07 Thread via GitHub


slow-J commented on code in PR #12779:
URL: https://github.com/apache/lucene/pull/12779#discussion_r1385202002


##
lucene/core/src/java/org/apache/lucene/analysis/WordlistLoader.java:
##
@@ -29,7 +29,7 @@
 /**
  * Loader for text files that represent a list of stopwords.
  *
- * @see IOUtils to obtain {@link Reader} instances
+ * @see IOUtils to obtain Reader instances

Review Comment:
   Yes exactly, see the `see also` section of 
https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/analysis/WordlistLoader.html



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix nested link warning from renderSiteJavadoc [lucene]

2023-11-07 Thread via GitHub


mikemccand merged PR #12779:
URL: https://github.com/apache/lucene/pull/12779


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Explore partially decoding blocks (within-block skipping) [lucene]

2023-11-07 Thread via GitHub


Tony-X commented on issue #12749:
URL: https://github.com/apache/lucene/issues/12749#issuecomment-1799460960

   > How would it work? Since blocks are delta-coded, you can't know the value 
at a given index without decoding all previous values and computing their sum? 
Or you need to store some checkpoints separately, but then it may be 
easier/better to simply go with smaller blocks (e.g. 64 doc IDs instead of 128)?
   
   +1. Delta-encoding here is the blocker, unless we change the encoding 
scheme. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-07 Thread via GitHub


kevindrosendahl opened a new pull request, #12780:
URL: https://github.com/apache/lucene/pull/12780

   ### Description
   
   When using cosine similarity, the `ScalarQuantizer` normalizes vectors when 
calculating quantiles and `ScalarQuantizedRandomVectorScorer` normalizes query 
vectors before scoring them, but `Lucene99ScalarQuantizedVectorsWriter` does 
not normalize the vectors prior to quantizing them when producing the quantized 
vectors to write to disk. This PR normalizes vectors prior to quantizing them 
when writing them to disk.
   
   Recall results on my M1 with the `glove-100-angular` data set (all using 
`maxConn`: 16, `beamWidth` 100, `numCandidates`: 100, `k`: 10, single segment):
   | Configuration | Recall | Average Query Duration |
   |---|---|-| 
   | Pre-patch no quantization | 0.78762 | 0.68 ms |
   | Pre-patch with quantization | 8.717E-5 | 0.45 ms |
   | Post-patch no quantization | 0.78762 | 0.70 ms |
   | Post-patch with quantization | 0.66742 | 0.66 ms |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Normalize written scalar quantized vectors when using cosine similarity [lucene]

2023-11-07 Thread via GitHub


kevindrosendahl commented on code in PR #12780:
URL: https://github.com/apache/lucene/pull/12780#discussion_r1385606896


##
lucene/core/src/test/org/apache/lucene/codecs/lucene99/TestLucene99HnswQuantizedVectorsFormat.java:
##
@@ -38,8 +38,10 @@
 import org.apache.lucene.store.Directory;
 import org.apache.lucene.tests.index.BaseKnnVectorsFormatTestCase;
 import org.apache.lucene.util.ScalarQuantizer;
+import org.apache.lucene.util.VectorUtil;
 
 public class TestLucene99HnswQuantizedVectorsFormat extends 
BaseKnnVectorsFormatTestCase {

Review Comment:
   This test was always using normalized vectors as input. I changed it to use 
vectors that aren't guaranteed to be normalized, but kept all existing tests 
using normalized vectors so as not to change other tests incidentally.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-07 Thread via GitHub


kevindrosendahl commented on issue #12615:
URL: https://github.com/apache/lucene/issues/12615#issuecomment-1800243525

   > I haven't had a chance to read your branch yet, but hope to soon.
   
   Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to 
vamana is that it's actually fairly simple in the end. For the most part, I 
just removed all references to levels and changed the [diversity 
check](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L376-L404)
 to incorporate vamana's `alpha` parameter (essentially just adding a second, 
outer loop). There were a few small other implementation detail changes as well 
like [pruning down all the way to `M` when passing the overflow threshold 
instead of just removing one 
neighbor](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/util/vamana/VamanaGraphBuilder.java#L360-L369).
   
   Then on the storage size, you just [write vectors into the graph instead of 
into the `.vec` 
file](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsWriter.java#L703-L734),
 and [read from the right offsets to get the vector values as 
appropriate](https://github.com/kevindrosendahl/lucene/blob/075116396e8af35b0a19a277491be9acab150beb/lucene/core/src/java/org/apache/lucene/codecs/vectorsandbox/VectorSandboxVamanaVectorsReader.java#L598-L684).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] HnwsGraph creates disconnected components [lucene]

2023-11-07 Thread via GitHub


nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1800972146

   Hi @msokolov ! Thanks for clarifying. But I think it can help to remove the 
'less important' edge from both sides, since it frees up a degree of "other" 
node to accept a new connection without indulging into diversity check. Most of 
the time the problem that I saw with diversity check is that, the most recently 
added connection to the node itself turns out to be non-diverse and gets 
removed immediately. Because of this the new incoming node does not even get 
enough connections to begin with in the graph. Even in the case where I removed 
the diversity check while connecting incoming node (a) to neighbours (say b,c) 
and allow full max-conn number of connections, most of the connections from 
neighbours (b->a) to the nodes are removed because of diversity check, leaving 
mostly half connections from (a->b). Now when a new node say `x` tries to 
connect to `a` even that is kicked off by `a` because of diversity check. So 
keeping half connections sometimes acts against the connectedness n
 ature of hnsw.
   I have come up with a new huristic which seems to work well against 
disconnectedness, but I am yet to optimize it and conduct performance test on 
it. Will still update the algorithm here and post numbers later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801128536

   Something odd: I accidentally run the benchmark on another mac with `intel 
chip` and the result is disappointing (no improvements)
   ```
   use stable sort: false, sort 5169965 terms, took: 5322ms
   use stable sort: false, sort 5169965 terms, took: 5605ms
   use stable sort: false, sort 5169965 terms, took: 5457ms
   use stable sort: false, sort 5169965 terms, took: 5392ms
   use stable sort: false, sort 5169965 terms, took: 5379ms
   use stable sort: false, sort 5169965 terms, took: 5442ms
   use stable sort: false, sort 5169965 terms, took: 5262ms
   use stable sort: false, sort 5169965 terms, took: 5100ms
   use stable sort: false, sort 5169965 terms, took: 5057ms
   use stable sort: false, sort 5169965 terms, took: 5160ms
   use stable sort: false, sort 5169965 terms, took: 5207ms
   use stable sort: false, sort 3130385 terms, took: 2868ms
   use stable sort: true, sort 5169965 terms, took: 5083ms
   use stable sort: true, sort 5169965 terms, took: 5332ms
   use stable sort: true, sort 5169965 terms, took: 5322ms
   use stable sort: true, sort 5169965 terms, took: 5333ms
   use stable sort: true, sort 5169965 terms, took: 5295ms
   use stable sort: true, sort 5169965 terms, took: 5306ms
   use stable sort: true, sort 5169965 terms, took: 5317ms
   use stable sort: true, sort 5169965 terms, took: 5336ms
   use stable sort: true, sort 5169965 terms, took: 5284ms
   use stable sort: true, sort 5169965 terms, took: 5344ms
   use stable sort: true, sort 5169965 terms, took: 5289ms
   use stable sort: true, sort 3130385 terms, took: 2837ms
   ```
   
   But I can easily reproduce the speedup on the `M2 chip` mac that produced 
all benchmark results above:
   ```
   use stable sort: false, sort 5169965 terms, took: 2340ms
   use stable sort: false, sort 5169965 terms, took: 2197ms
   use stable sort: false, sort 5169965 terms, took: 2288ms
   use stable sort: false, sort 5169965 terms, took: 2320ms
   use stable sort: false, sort 5169965 terms, took: 2227ms
   use stable sort: false, sort 5169965 terms, took: 2238ms
   use stable sort: false, sort 5169965 terms, took: 2256ms
   use stable sort: false, sort 5169965 terms, took: 2287ms
   use stable sort: false, sort 5169965 terms, took: 2244ms
   use stable sort: false, sort 5169965 terms, took: 2240ms
   use stable sort: false, sort 5169965 terms, took: 2320ms
   use stable sort: false, sort 3130385 terms, took: 1082ms
   use stable sort: true, sort 5169965 terms, took: 1418ms
   use stable sort: true, sort 5169965 terms, took: 1303ms
   use stable sort: true, sort 5169965 terms, took: 1290ms
   use stable sort: true, sort 5169965 terms, took: 1349ms
   use stable sort: true, sort 5169965 terms, took: 1293ms
   use stable sort: true, sort 5169965 terms, took: 1329ms
   use stable sort: true, sort 5169965 terms, took: 1332ms
   use stable sort: true, sort 5169965 terms, took: 1307ms
   use stable sort: true, sort 5169965 terms, took: 1280ms
   use stable sort: true, sort 5169965 terms, took: 1336ms
   use stable sort: true, sort 5169965 terms, took: 1302ms
   use stable sort: true, sort 3130385 terms, took: 623ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801136851

   Something odd: I accidentally run the benchmark on another mac with `intel 
chip` and the result is disappointing (no obvious improvements)
   ```
   use stable sort: false, sort 5169965 terms, took: 5377ms
   use stable sort: false, sort 5169965 terms, took: 5293ms
   use stable sort: false, sort 5169965 terms, took: 5347ms
   use stable sort: false, sort 5169965 terms, took: 5285ms
   use stable sort: false, sort 5169965 terms, took: 5307ms
   use stable sort: false, sort 5169965 terms, took: 5373ms
   use stable sort: false, sort 5169965 terms, took: 5416ms
   use stable sort: false, sort 5169965 terms, took: 5321ms
   use stable sort: false, sort 5169965 terms, took: 5343ms
   use stable sort: false, sort 5169965 terms, took: 5278ms
   use stable sort: false, sort 5169965 terms, took: 5323ms
   
   use stable sort: true, sort 5169965 terms, took: 5261ms
   use stable sort: true, sort 5169965 terms, took: 5329ms
   use stable sort: true, sort 5169965 terms, took: 5304ms
   use stable sort: true, sort 5169965 terms, took: 5268ms
   use stable sort: true, sort 5169965 terms, took: 5331ms
   use stable sort: true, sort 5169965 terms, took: 5306ms
   use stable sort: true, sort 5169965 terms, took: 5232ms
   use stable sort: true, sort 5169965 terms, took: 5245ms
   use stable sort: true, sort 5169965 terms, took: 5248ms
   use stable sort: true, sort 5169965 terms, took: 5218ms
   use stable sort: true, sort 5169965 terms, took: 5193ms
   ```
   
   But I can easily reproduce the speedup on the `M2 chip` mac that produced 
all benchmark results above:
   ```
   use stable sort: false, sort 5169965 terms, took: 2340ms
   use stable sort: false, sort 5169965 terms, took: 2197ms
   use stable sort: false, sort 5169965 terms, took: 2288ms
   use stable sort: false, sort 5169965 terms, took: 2320ms
   use stable sort: false, sort 5169965 terms, took: 2227ms
   use stable sort: false, sort 5169965 terms, took: 2238ms
   use stable sort: false, sort 5169965 terms, took: 2256ms
   use stable sort: false, sort 5169965 terms, took: 2287ms
   use stable sort: false, sort 5169965 terms, took: 2244ms
   use stable sort: false, sort 5169965 terms, took: 2240ms
   use stable sort: false, sort 5169965 terms, took: 2320ms
   
   use stable sort: true, sort 5169965 terms, took: 1418ms
   use stable sort: true, sort 5169965 terms, took: 1303ms
   use stable sort: true, sort 5169965 terms, took: 1290ms
   use stable sort: true, sort 5169965 terms, took: 1349ms
   use stable sort: true, sort 5169965 terms, took: 1293ms
   use stable sort: true, sort 5169965 terms, took: 1329ms
   use stable sort: true, sort 5169965 terms, took: 1332ms
   use stable sort: true, sort 5169965 terms, took: 1307ms
   use stable sort: true, sort 5169965 terms, took: 1280ms
   use stable sort: true, sort 5169965 terms, took: 1336ms
   use stable sort: true, sort 5169965 terms, took: 1302ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


gf2121 commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801199102

   I checked linux x86, no obvious speedup too. Maybe the stable style somewhat 
helped the arm CPU cache
   ```
   use stable sort: false, sort 5169965 terms, took: 4900ms
   use stable sort: false, sort 5169965 terms, took: 5373ms
   use stable sort: false, sort 5169965 terms, took: 4998ms
   use stable sort: false, sort 5169965 terms, took: 5345ms
   use stable sort: false, sort 5169965 terms, took: 5051ms
   use stable sort: false, sort 5169965 terms, took: 4906ms
   use stable sort: false, sort 5169965 terms, took: 4869ms
   use stable sort: false, sort 5169965 terms, took: 4840ms
   use stable sort: false, sort 5169965 terms, took: 4816ms
   use stable sort: false, sort 5169965 terms, took: 4760ms
   use stable sort: false, sort 5169965 terms, took: 4903ms
   
   use stable sort: true, sort 5169965 terms, took: 4883ms
   use stable sort: true, sort 5169965 terms, took: 4852ms
   use stable sort: true, sort 5169965 terms, took: 4920ms
   use stable sort: true, sort 5169965 terms, took: 4863ms
   use stable sort: true, sort 5169965 terms, took: 4836ms
   use stable sort: true, sort 5169965 terms, took: 4917ms
   use stable sort: true, sort 5169965 terms, took: 4836ms
   use stable sort: true, sort 5169965 terms, took: 4900ms
   use stable sort: true, sort 5169965 terms, took: 4888ms
   use stable sort: true, sort 5169965 terms, took: 4921ms
   use stable sort: true, sort 5169965 terms, took: 4835ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-07 Thread via GitHub


dweiss commented on PR #12775:
URL: https://github.com/apache/lucene/pull/12775#issuecomment-1801236300

   Maybe Apple chips are tuned for sorting (they need to sort out what to do 
with all these income bills, after all)? :)
   
   And seriously - thank you for checking on different hardware. There may be 
no single solution to fit them all - as the PR with vectorization benchmarks 
clearly shows...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager) - TopFieldCollectorManager & TopScoreDocCollectorManager [lucene]

2023-11-07 Thread via GitHub


zacharymorn commented on PR #240:
URL: https://github.com/apache/lucene/pull/240#issuecomment-1801269361

   Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick 
up the latest from `main`, as well as revert some changes to save them for 
follow-up PRs that address other collectors. This PR is now focused on the 
following:
   * Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
   * Refactor some tests to use the above collector manager
   
   Could you please take a look, and let me know if you have any feedback?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org