[GitHub] [lucene] rmuir commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-10-31 Thread GitBox


rmuir commented on issue #11676:
URL: https://github.com/apache/lucene/issues/11676#issuecomment-1296934205

   Why so quick to jump to wall time? Please, no wall time, for any reason 
whatsoever.
   
   Surely, nanoTime can be used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] xingdong015 commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47

2022-10-31 Thread GitBox


xingdong015 commented on issue #11887:
URL: https://github.com/apache/lucene/issues/11887#issuecomment-1296990074

   It looks like gradle initialization takes a lot of time
   
   
![image](https://user-images.githubusercontent.com/11306681/199004045-2d675f82-ffdd-43c8-9ce6-0197681a7b6b.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009428049


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java:
##
@@ -60,7 +64,11 @@ private List>> 
findSimilarDictionaryEntries(
   String word, WordCase originalCase) {
 Comparator>> natural = Comparator.naturalOrder();
 PriorityQueue>> roots = new 
PriorityQueue<>(natural.reversed());
-EntryFilter filter = new EntryFilter(dictionary);
+
+char[] excludeFlags = dictionary.allNonSuggestibleFlags();
+FlagEnumerator.Lookup flagLookup = dictionary.flagLookup;
+IntPredicate isSuggestible = formId -> !flagLookup.hasAnyFlag(formId, 
excludeFlags);

Review Comment:
   This is basically the inlined EntryFilter class with one for loop removed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009429083


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java:
##
@@ -70,10 +78,10 @@ char transformChar(char c) {
   }
 };
 
-dictionary.words.processSuggestibleWords(
+processSuggestibleWords(
 Math.max(1, word.length() - MAX_ROOT_LENGTH_DIFF),
 word.length() + MAX_ROOT_LENGTH_DIFF,
-(rootChars, forms) -> {
+(rootChars, formSupplier) -> {

Review Comment:
   Instead of `IntsRef` we now pass a `Supplier` to allow for lazier 
deserialization and exclusion of unnecessary information (morphological data 
ids)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009429597


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java:
##
@@ -87,45 +95,31 @@ char transformChar(char c) {
 
   sc += commonPrefix(word, rootChars) - 
longerWorsePenalty(word.length(), rootChars.length);
 
-  if (roots.size() == MAX_ROOTS && sc <= roots.peek().score) {
+  boolean overflow = roots.size() == MAX_ROOTS;
+  if (overflow && sc <= roots.peek().score) {
 return;
   }
 
   speller.checkCanceled.run();
 
   String root = rootChars.toString();
-  int suitable = filter.findSuitableFormIndex(forms, 0);
-  do {
-roots.add(new Weighted<>(new Root<>(root, forms.ints[forms.offset 
+ suitable]), sc));
-suitable = filter.findSuitableFormIndex(forms, suitable + 
filter.formStep);
-  } while (suitable > 0);
-  while (roots.size() > MAX_ROOTS) {
-roots.poll();
+  IntsRef forms = formSupplier.get();
+  for (int i = 0; i < forms.length; i++) {

Review Comment:
   Here's the loop from the inlined `EntryFilter`, it's now a bit simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009431207


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java:
##
@@ -179,11 +186,7 @@ void processSuggestibleWords(
 }
 
 if (mightMatch) {
-  int dataLength = in.readVInt();
-  if (forms.ints.length < dataLength) {
-forms.ints = new int[dataLength];
-  }
-  readForms(forms, in, dataLength);

Review Comment:
   Form id reading is moved into LazyFormReader



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009432252


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java:
##
@@ -54,7 +55,8 @@ class WordStorage {
   private static final int COLLISION_MASK = 0x40;
   private static final int SUGGESTIBLE_MASK = 0x20;
   private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;
-
+  private final int maxEntryLength;

Review Comment:
   maxEntryLength is used to allow passing Integer.MAX_VALUE as the maximal 
interesting length (to iterate the whole dictionary) without allocating an 
array of that size



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009433258


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java:
##
@@ -54,7 +55,8 @@ class WordStorage {
   private static final int COLLISION_MASK = 0x40;
   private static final int SUGGESTIBLE_MASK = 0x20;
   private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;
-
+  private final int maxEntryLength;
+  private final boolean hasCustomMorphData;

Review Comment:
   hasCustomMorphData is used to skip morphological data ids (if present) 
during the data deserialization, so that the cached flattened data is 2x smaller



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009433258


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java:
##
@@ -54,7 +55,8 @@ class WordStorage {
   private static final int COLLISION_MASK = 0x40;
   private static final int SUGGESTIBLE_MASK = 0x20;
   private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1;
-
+  private final int maxEntryLength;
+  private final boolean hasCustomMorphData;

Review Comment:
   hasCustomMorphData is used to skip morphological data ids (if present) 
during the data deserialization, so that the cached flattened data is 2x 
smaller. That data isn't used during iteration anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on PR #11893:
URL: https://github.com/apache/lucene/pull/11893#issuecomment-1297113922

   With the cache, about 2x memory is used (~850MB for ~190 dictionaries). The 
caching gives me about 1.5x speedup for en/ru/de.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009435738


##
lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestPerformance.java:
##
@@ -86,7 +86,7 @@ public void de() throws Exception {
 
   @Test
   public void de_suggest() throws Exception {
-checkSuggestionPerformance("de", 100);
+checkSuggestionPerformance("de", 150);

Review Comment:
   We can now suggest for more words in reasonable time, yay!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


benwtrent commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297146958

   I opened a PR for ann-benchmarks: 
https://github.com/erikbern/ann-benchmarks/pull/315
   
   I tested PyLucene locally, comparing it to @msokolov's "batch" methodology 
(writing to disk and spinning up a Java process). PyLucene provided faster 
numbers in both Batch and iterative on my m1 mac. This indicates that the 
overhead experienced is acceptable for now. 
   
   There are improvements we can make to PyLucene (native numpy support is a 
big one). But, it seems prudent to get measurements recorded in ann-benchmarks 
and iteratively improve our python over time :).
   
   Also, its for Lucene 9.1. I am going to see about getting a newer PyLucene 
version built hitting the latest release. We can update our benchmark 
implementation to hit the new version when its available.
   
   @msokolov @jtibshirani IDK if y'all want to review the ann-benchmarks 
implementation or not. Let me know what you think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


jtibshirani commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297299587

   Thanks @benwtrent, it's great to see that PyLucene works well and has low 
overhead! It feels more solid than what we were doing before.
   
   +1 to preparing a new version. As I mentioned on the PR, we made a big 
correction to the algorithm in Lucene 9.2 
(https://issues.apache.org/jira/browse/LUCENE-10527). It would be too bad if 
results came out and Lucene did much worse than hnswlib because it was missing 
this fix!
   
   It'd also be great to compare the results against hnswlib as part of the 
submission. We can double-check that recall is the same for a given set of 
parameters. This would give confidence we've interpreted all the parameters 
right in the ann-benchmarks runner. I'm happy to help with this since I have 
hnswlib set up locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-10-31 Thread GitBox


vigyasharma commented on issue #11676:
URL: https://github.com/apache/lucene/issues/11676#issuecomment-1297303476

   Sorry, I meant it'll add dependence on `nanoTime()`. I thought we use 
wallTime to refer to both `currentTimeInMillis` and `nanoTime`. 
   
   If nanotime is acceptable, I can use something that wraps the time calls, 
and override them for testing, like `QueryProfilerTimer`. It will average out 
the time over all previous calls, instead of reporting the last call time. But 
maybe that's okay, given there is anyway noise in such measurements.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] reta commented on pull request #11875: Usability improvements for timeout support in IndexSearcher

2022-10-31 Thread GitBox


reta commented on PR #11875:
URL: https://github.com/apache/lucene/pull/11875#issuecomment-1297316017

   > Looks good to me. I'll wait for a few days before merging, in case people 
have comments/concerns with the public visibility for `TimeLimitingBulkScorer`
   
   Thanks a lot @vigyasharma !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]

2022-10-31 Thread GitBox


rmuir commented on issue #11676:
URL: https://github.com/apache/lucene/issues/11676#issuecomment-1297328617

   > Sorry, I meant it'll add dependence on `nanoTime()`. I thought we use 
wallTime to refer to both `currentTimeInMillis` and `nanoTime`.
   
   nanoTime (at least on linux) uses the monotonic clock. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47

2022-10-31 Thread GitBox


rmuir commented on issue #11887:
URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297341214

   I don't think profiler is helpful because test is not doing anything, except 
sleeping on `Object.wait`. I used `jstack` while the test was hung:
   
   ```
   "TEST-TestDocumentsWriterStallControl.testRandom-seed#[B83F4990EF501F47]" 
#20 prio=5 os_prio=0 cpu=118.76ms elapsed=100.21s tid=0x7fff9405cc50 
nid=0x37b3 in Object.wait()  [0x7fffb5ffd000]
  java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(java.base@17.0.5/Native Method)
- waiting on 
at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304)
- locked <0xfcbdeee0> (a 
org.apache.lucene.index.TestDocumentsWriterStallControl$1)
at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372)
at 
org.apache.lucene.index.TestDocumentsWriterStallControl.join(TestDocumentsWriterStallControl.java:313)
at 
org.apache.lucene.index.TestDocumentsWriterStallControl.testRandom(TestDocumentsWriterStallControl.java:89)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47

2022-10-31 Thread GitBox


rmuir commented on issue #11887:
URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297345527

   It runs much faster with this patch:
   ```
   --- 
a/lucene/core/src/test/org/apache/lucene/index/TestDocumentsWriterStallControl.java
   +++ 
b/lucene/core/src/test/org/apache/lucene/index/TestDocumentsWriterStallControl.java
   @@ -62,7 +62,7 @@ public class TestDocumentsWriterStallControl extends 
LuceneTestCase {
@Override
public void run() {
   
   -  int iters = atLeast(1000);
   +  int iters = atLeast(100);
  for (int j = 0; j < iters; j++) {
ctrl.updateStalled(random().nextInt(stallProbability) == 0);
if (random().nextInt(5) == 0) { // thread 0 only updates
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


dweiss commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009622594


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java:
##
@@ -647,8 +671,23 @@ Root findStem(
 
 if (!hasGoodSuggestions && dictionary.maxNGramSuggestions > 0) {
   List generated =
-  new GeneratingSuggester(suggestionSpeller)
-  .suggest(dictionary.toLowerCase(word), wordCase, suggestions);
+  new GeneratingSuggester(suggestionSpeller) {
+@Override
+void processSuggestibleWords(
+int minLength, int maxLength, BiConsumer> processor) {
+  if (cacheSuggestibleEntries) {
+SuggestibleEntryCache cache = suggestibleCache;
+if (cache == null) {
+  // a benign race:
+  // 
https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#wishful-benign-is-resilient

Review Comment:
   yeah, it's fine... although unless it's a crucial loop I'd just mark that 
field volatile so that advanced knowledge of jmm doesn't have to be explained 
to whoever would like to look at that code later on :)



##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java:
##
@@ -72,10 +77,29 @@ public Hunspell(Dictionary dictionary) {
* or suggestion generation by throwing an exception
*/
   public Hunspell(Dictionary dictionary, TimeoutPolicy policy, Runnable 
checkCanceled) {
+this(dictionary, policy, checkCanceled, new Stemmer(dictionary), false);
+  }
+
+  private Hunspell(
+  Dictionary dictionary,
+  TimeoutPolicy policy,
+  Runnable checkCanceled,
+  Stemmer stemmer,
+  boolean cacheSuggestibleEntries) {
 this.dictionary = dictionary;
 this.policy = policy;
 this.checkCanceled = checkCanceled;
-stemmer = new Stemmer(dictionary);
+this.stemmer = stemmer;
+this.cacheSuggestibleEntries = cacheSuggestibleEntries;
+  }
+
+  /**
+   * Returns a copy of this Hunspell instance with better suggestion 
performance but using more

Review Comment:
   This looks fine to me, although perhaps you'd consider an API similar to 
that used in codecs, where you pass a tradeoff enum hint (MEMORY, SPEED) and 
get an instance that tries to fulfill these hints?  The drawback is that it'd 
be backward-incompatible, unless you make it a new (static) factory method? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ovalhub commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


ovalhub commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297370722

   
   On Mon, 31 Oct 2022, Benjamin Trent wrote:
   
   > I tested PyLucene locally, comparing it to @msokolov's "batch" methodology 
   > (writing to disk and spinning up a Java process). PyLucene provided faster 
   > numbers in both Batch and iterative on my m1 mac. This indicates that the 
   > overhead experienced is acceptable for now.
   >
   > There are improvements we can make to PyLucene (native numpy support is a 
   > big one). But, it seems prudent to get measurements recorded in 
   > ann-benchmarks and iteratively improve our python over time :).
   
   What do you mean with "native numpy" ? Are you under the impression that 
   PyLucene is a Python implementation of Lucene ? It's not. It's an 
   auto-generated CPython wrapper around an auto-generated C++ wrapper 
   that calls Java Lucene via JNI and that embeds a Java VM into a Python VM. 
   PyLucene _is_ Java Lucene, unchanged, embedded in a Python VM.
   
   Andi..
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47

2022-10-31 Thread GitBox


rmuir commented on issue #11887:
URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297392725

   The condition where this test takes minutes isn't that rare, I ran the test 
10 times and hit the slow condition 3 out of 10 executions:
   * 151s
   * 158s
   * 32s
   
   With the patch, I ran the test 50 times and it was always reasonable. Will 
make a PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir opened a new pull request, #11894: Tone down TestDocumentsWriterStallControl.testRandom, so it does not take minutes

2022-10-31 Thread GitBox


rmuir opened a new pull request, #11894:
URL: https://github.com/apache/lucene/pull/11894

   The current test has ~ minute runtimes approximately 30% of the time.
   
   Closes #11887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


benwtrent commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297430223

   @ovalhub `numpy` collections are already native. To use them, I have to pull 
them into python collections and then cast them to be native again. 
   
   Example:
   ```
   X = X.tolist()
   doc.add(KnnVectorField("knn", JArray('float')(X), fieldType))
   ```
   The `.toList()` is weird to me as `X` is already a natively backed float 
array, just a numpy object. 
   
   Admittedly, I may be misunderstanding how the communication between Python 
-> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy 
unnecessarily pulls an already native collection into a generic python list.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


benwtrent commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297440415

   > It'd also be great to compare the results against hnswlib as part of the 
submission. We can double-check that recall is the same for a given set of 
parameters. This would give confidence we've interpreted all the parameters 
right in the ann-benchmarks runner. I'm happy to help with this since I have 
hnswlib set up locally.
   
   Ah, for sure. That would be useful. We would have to wait for a new version 
of PyLucene to be built for this though as right now we still have the bug in: 
https://issues.apache.org/jira/browse/LUCENE-10527


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ovalhub commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]

2022-10-31 Thread GitBox


ovalhub commented on issue #10665:
URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297463120

   
   On Mon, 31 Oct 2022, Benjamin Trent wrote:
   
   > @ovalhub `numpy` collections are already native. To use them, I have to 
   > pull them into python collections and then cast them to be native again.
   >
   > Example: ``` X = X.tolist() doc.add(KnnVectorField("knn", 
   > JArray('float')(X), fieldType)) ``` The `.toList()` is weird to me as `X` 
   > is already a natively backed float array, just a numpy object.
   >
   > Admittedly, I may be misunderstanding how the communication between Python 
   > -> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy 
   > unnecessarily pulls an already native collection into a generic python 
   > list.
   
   When you make a JArray with "JArray('float')(X), fieldType))" the 
   JArray code in JCC's JArray.h will iterate 'X' "as a sequence", 
   calling PySequence_GetItem(sequence, i) and fill a Java float array with the 
   values returned.
   If native numpy collections are exposed to Python as sequences (ie, ordered 
   things that can be accessed with int indexes), then you should not have to 
   call .tolist() on it before using it with PyLucene/JCC.
   It could also be that numpy's .tolist() just creates a wrapper object around 
   the native array to expose it as a Python sequence ? (I don't know numpy).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)

2022-10-31 Thread GitBox


donnerpeter commented on code in PR #11893:
URL: https://github.com/apache/lucene/pull/11893#discussion_r1009734992


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java:
##
@@ -72,10 +77,29 @@ public Hunspell(Dictionary dictionary) {
* or suggestion generation by throwing an exception
*/
   public Hunspell(Dictionary dictionary, TimeoutPolicy policy, Runnable 
checkCanceled) {
+this(dictionary, policy, checkCanceled, new Stemmer(dictionary), false);
+  }
+
+  private Hunspell(
+  Dictionary dictionary,
+  TimeoutPolicy policy,
+  Runnable checkCanceled,
+  Stemmer stemmer,
+  boolean cacheSuggestibleEntries) {
 this.dictionary = dictionary;
 this.policy = policy;
 this.checkCanceled = checkCanceled;
-stemmer = new Stemmer(dictionary);
+this.stemmer = stemmer;
+this.cacheSuggestibleEntries = cacheSuggestibleEntries;
+  }
+
+  /**
+   * Returns a copy of this Hunspell instance with better suggestion 
performance but using more

Review Comment:
   I've considered this, but then again, there can be different time-memory 
tradeoffs in difference places (and I'm planning another one soon), so I'd name 
them more specifically.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-10-31 Thread GitBox


benwtrent commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1009762545


##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,505 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,
+Lucene95HnswVectorsFormat.VERSION_CURRENT,
+state.segmentInfo.getId(),
+state.segmentSuffix);
+readFields(meta, state.fieldInfos);
+  } catch (Throwable exception) {
+priorE = exception;
+  } finally {
+CodecUtil.checkFooter(meta, priorE);
+  }
+}
+return versionMeta;
+  }
+
+  private static IndexInput openDataInput(
+  SegmentReadState state, int versionMeta, String fileExtension, String 
codecName)
+  throws IOException {
+String fileName =
+IndexFileNames.segmentFileName(state.segmentInfo.name, 
state.segmentSuffix, fileExtension);
+IndexInput in = state.directory.openInput(fileName, state.context);
+boolean success = false;
+try {
+  int versionVectorData =
+  CodecUtil.checkIndexHeader(
+  in,
+  codecName,
+  Lucene95HnswVectorsFormat.VERSION_START,
+  Lucene95HnswVectorsFormat.VERSION_CURRENT,
+  state.segmentInfo.getId(),
+  state.segmentSuffix);
+  if (versionMeta != versionVe

[GitHub] [lucene] jtibshirani commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections

2022-10-31 Thread GitBox


jtibshirani commented on code in PR #11860:
URL: https://github.com/apache/lucene/pull/11860#discussion_r1009765612


##
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/package-info.java:
##
@@ -0,0 +1,422 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Lucene 9.3 file format.

Review Comment:
   Just noticed, this should say Lucene 9.4 file format (probably just a typo 
from the existing file).



##
lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java:
##
@@ -0,0 +1,520 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.codecs.lucene95;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.codecs.CodecUtil;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.index.*;
+import org.apache.lucene.search.ScoreDoc;
+import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.search.TotalHits;
+import org.apache.lucene.store.ChecksumIndexInput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.store.IndexInput;
+import org.apache.lucene.util.Bits;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.RamUsageEstimator;
+import org.apache.lucene.util.hnsw.HnswGraph;
+import org.apache.lucene.util.hnsw.HnswGraphSearcher;
+import org.apache.lucene.util.hnsw.NeighborQueue;
+import org.apache.lucene.util.packed.DirectMonotonicReader;
+import org.apache.lucene.util.packed.PackedInts;
+
+/**
+ * Reads vectors from the index segments along with index data structures 
supporting KNN search.
+ *
+ * @lucene.experimental
+ */
+public final class Lucene95HnswVectorsReader extends KnnVectorsReader {
+
+  private final FieldInfos fieldInfos;
+  private final Map fields = new HashMap<>();
+  private final IndexInput vectorData;
+  private final IndexInput vectorIndex;
+
+  Lucene95HnswVectorsReader(SegmentReadState state) throws IOException {
+this.fieldInfos = state.fieldInfos;
+int versionMeta = readMetadata(state);
+boolean success = false;
+try {
+  vectorData =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME);
+  vectorIndex =
+  openDataInput(
+  state,
+  versionMeta,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION,
+  Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME);
+  success = true;
+} finally {
+  if (success == false) {
+IOUtils.closeWhileHandlingException(this);
+  }
+}
+  }
+
+  private int readMetadata(SegmentReadState state) throws IOException {
+String metaFileName =
+IndexFileNames.segmentFileName(
+state.segmentInfo.name, state.segmentSuffix, 
Lucene95HnswVectorsFormat.META_EXTENSION);
+int versionMeta = -1;
+try (ChecksumIndexInput meta = 
state.directory.openChecksumInput(metaFileName, state.context)) {
+  Throwable priorE = null;
+  try {
+versionMeta =
+CodecUtil.checkIndexHeader(
+meta,
+Lucene95HnswVectorsFormat.META_CODEC_NAME,
+Lucene95HnswVectorsFormat.VERSION_START,

[GitHub] [lucene] jtibshirani commented on pull request #239: LUCENE-10040: Handle deletions in nearest vector search

2022-10-31 Thread GitBox


jtibshirani commented on PR #239:
URL: https://github.com/apache/lucene/pull/239#issuecomment-1297711772

   @harishankar-gopalan sorry for the slow response! Your overall understanding 
is right. In Lucene, deletions are handled by marking a document as deleted 
using a 'tombstone'. The index structures are not actually updated (this 
includes the HNSW graph). 
   
   In response to your questions...
   1. Yes, when there are deletions, we make sure to expand the search space to 
retrieve top `k`. We use a similar strategy that many vector search engines use 
for kNN with filtering. During the HNSW search, we make sure to exclude deleted 
nodes from the final candidate set, but deleted nodes are still used for 
traversing hte graph.
   2. No, there is no work in that direction. This is because Lucene segments 
are never updated after they are first written (except under rare 
circumstances). Lucene's immutable segment model is core to its design, and 
it's the reason we use 'tombstones' instead of modifying the graph in place.
   
   For segment merges, we combine the vectors across all segments and build a 
new graph from scratch. We make sure to skip over deleted documents during this 
merge, so they have no effect on the time it takes to build the graph. This 
merge process is quite expensive and we're brainstorming ways of making it 
faster (https://github.com/apache/lucene/issues/11354).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11875: Usability improvements for timeout support in IndexSearcher

2022-10-31 Thread GitBox


jpountz commented on PR #11875:
URL: https://github.com/apache/lucene/pull/11875#issuecomment-1298076313

   Sorry for the lag I'm on vacation.
   
   The problem with "this class may be useful outside of Lucene" to me is that 
it could apply to any class in Lucene. We did indeed make some classes public 
on this criterion in the past (e.g FixedBitSet), but I would like the bar to be 
high, is there really not a better way? Is the need to make this class public a 
sign that the functionality is not exposed the right way?
   
   For the case described above by Vigya, I can think of two alternatives we 
might want to consider instead:
- Copy the code for this time-limiting bulk scorer instead of using the 
Lucene class. If there is only a couple users of Lucene who would benefit from 
this class being public, maybe it's a better trade-off to let them take full 
ownership of this code to allow Lucene to keep treating it as an implementation 
detail?
- Figure out different refactors that would help OpenSearch leverage 
timeout support from IndexSearcher as-is. E.g. if it would fold the logic to 
use live docs to lead iteration in the weight wrapper, would that work?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org