[GitHub] [lucene] rmuir commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]
rmuir commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1296934205 Why so quick to jump to wall time? Please, no wall time, for any reason whatsoever. Surely, nanoTime can be used. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] xingdong015 commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47
xingdong015 commented on issue #11887: URL: https://github.com/apache/lucene/issues/11887#issuecomment-1296990074 It looks like gradle initialization takes a lot of time  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009428049 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java: ## @@ -60,7 +64,11 @@ private List>> findSimilarDictionaryEntries( String word, WordCase originalCase) { Comparator>> natural = Comparator.naturalOrder(); PriorityQueue>> roots = new PriorityQueue<>(natural.reversed()); -EntryFilter filter = new EntryFilter(dictionary); + +char[] excludeFlags = dictionary.allNonSuggestibleFlags(); +FlagEnumerator.Lookup flagLookup = dictionary.flagLookup; +IntPredicate isSuggestible = formId -> !flagLookup.hasAnyFlag(formId, excludeFlags); Review Comment: This is basically the inlined EntryFilter class with one for loop removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009429083 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java: ## @@ -70,10 +78,10 @@ char transformChar(char c) { } }; -dictionary.words.processSuggestibleWords( +processSuggestibleWords( Math.max(1, word.length() - MAX_ROOT_LENGTH_DIFF), word.length() + MAX_ROOT_LENGTH_DIFF, -(rootChars, forms) -> { +(rootChars, formSupplier) -> { Review Comment: Instead of `IntsRef` we now pass a `Supplier` to allow for lazier deserialization and exclusion of unnecessary information (morphological data ids) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009429597 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java: ## @@ -87,45 +95,31 @@ char transformChar(char c) { sc += commonPrefix(word, rootChars) - longerWorsePenalty(word.length(), rootChars.length); - if (roots.size() == MAX_ROOTS && sc <= roots.peek().score) { + boolean overflow = roots.size() == MAX_ROOTS; + if (overflow && sc <= roots.peek().score) { return; } speller.checkCanceled.run(); String root = rootChars.toString(); - int suitable = filter.findSuitableFormIndex(forms, 0); - do { -roots.add(new Weighted<>(new Root<>(root, forms.ints[forms.offset + suitable]), sc)); -suitable = filter.findSuitableFormIndex(forms, suitable + filter.formStep); - } while (suitable > 0); - while (roots.size() > MAX_ROOTS) { -roots.poll(); + IntsRef forms = formSupplier.get(); + for (int i = 0; i < forms.length; i++) { Review Comment: Here's the loop from the inlined `EntryFilter`, it's now a bit simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009431207 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java: ## @@ -179,11 +186,7 @@ void processSuggestibleWords( } if (mightMatch) { - int dataLength = in.readVInt(); - if (forms.ints.length < dataLength) { -forms.ints = new int[dataLength]; - } - readForms(forms, in, dataLength); Review Comment: Form id reading is moved into LazyFormReader -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009432252 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java: ## @@ -54,7 +55,8 @@ class WordStorage { private static final int COLLISION_MASK = 0x40; private static final int SUGGESTIBLE_MASK = 0x20; private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1; - + private final int maxEntryLength; Review Comment: maxEntryLength is used to allow passing Integer.MAX_VALUE as the maximal interesting length (to iterate the whole dictionary) without allocating an array of that size -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009433258 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java: ## @@ -54,7 +55,8 @@ class WordStorage { private static final int COLLISION_MASK = 0x40; private static final int SUGGESTIBLE_MASK = 0x20; private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1; - + private final int maxEntryLength; + private final boolean hasCustomMorphData; Review Comment: hasCustomMorphData is used to skip morphological data ids (if present) during the data deserialization, so that the cached flattened data is 2x smaller -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009433258 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/WordStorage.java: ## @@ -54,7 +55,8 @@ class WordStorage { private static final int COLLISION_MASK = 0x40; private static final int SUGGESTIBLE_MASK = 0x20; private static final int MAX_STORED_LENGTH = SUGGESTIBLE_MASK - 1; - + private final int maxEntryLength; + private final boolean hasCustomMorphData; Review Comment: hasCustomMorphData is used to skip morphological data ids (if present) during the data deserialization, so that the cached flattened data is 2x smaller. That data isn't used during iteration anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on PR #11893: URL: https://github.com/apache/lucene/pull/11893#issuecomment-1297113922 With the cache, about 2x memory is used (~850MB for ~190 dictionaries). The caching gives me about 1.5x speedup for en/ru/de. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009435738 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/hunspell/TestPerformance.java: ## @@ -86,7 +86,7 @@ public void de() throws Exception { @Test public void de_suggest() throws Exception { -checkSuggestionPerformance("de", 100); +checkSuggestionPerformance("de", 150); Review Comment: We can now suggest for more words in reasonable time, yay! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
benwtrent commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297146958 I opened a PR for ann-benchmarks: https://github.com/erikbern/ann-benchmarks/pull/315 I tested PyLucene locally, comparing it to @msokolov's "batch" methodology (writing to disk and spinning up a Java process). PyLucene provided faster numbers in both Batch and iterative on my m1 mac. This indicates that the overhead experienced is acceptable for now. There are improvements we can make to PyLucene (native numpy support is a big one). But, it seems prudent to get measurements recorded in ann-benchmarks and iteratively improve our python over time :). Also, its for Lucene 9.1. I am going to see about getting a newer PyLucene version built hitting the latest release. We can update our benchmark implementation to hit the new version when its available. @msokolov @jtibshirani IDK if y'all want to review the ann-benchmarks implementation or not. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
jtibshirani commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297299587 Thanks @benwtrent, it's great to see that PyLucene works well and has low overhead! It feels more solid than what we were doing before. +1 to preparing a new version. As I mentioned on the PR, we made a big correction to the algorithm in Lucene 9.2 (https://issues.apache.org/jira/browse/LUCENE-10527). It would be too bad if results came out and Lucene did much worse than hnswlib because it was missing this fix! It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]
vigyasharma commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1297303476 Sorry, I meant it'll add dependence on `nanoTime()`. I thought we use wallTime to refer to both `currentTimeInMillis` and `nanoTime`. If nanotime is acceptable, I can use something that wraps the time calls, and override them for testing, like `QueryProfilerTimer`. It will average out the time over all previous calls, instead of reporting the last call time. But maybe that's okay, given there is anyway noise in such measurements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #11875: Usability improvements for timeout support in IndexSearcher
reta commented on PR #11875: URL: https://github.com/apache/lucene/pull/11875#issuecomment-1297316017 > Looks good to me. I'll wait for a few days before merging, in case people have comments/concerns with the public visibility for `TimeLimitingBulkScorer` Thanks a lot @vigyasharma ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11676: Can TimeLimitingBulkScorer exponentially grow the window size? [LUCENE-10640]
rmuir commented on issue #11676: URL: https://github.com/apache/lucene/issues/11676#issuecomment-1297328617 > Sorry, I meant it'll add dependence on `nanoTime()`. I thought we use wallTime to refer to both `currentTimeInMillis` and `nanoTime`. nanoTime (at least on linux) uses the monotonic clock. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47
rmuir commented on issue #11887: URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297341214 I don't think profiler is helpful because test is not doing anything, except sleeping on `Object.wait`. I used `jstack` while the test was hung: ``` "TEST-TestDocumentsWriterStallControl.testRandom-seed#[B83F4990EF501F47]" #20 prio=5 os_prio=0 cpu=118.76ms elapsed=100.21s tid=0x7fff9405cc50 nid=0x37b3 in Object.wait() [0x7fffb5ffd000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.5/Native Method) - waiting on at java.lang.Thread.join(java.base@17.0.5/Thread.java:1304) - locked <0xfcbdeee0> (a org.apache.lucene.index.TestDocumentsWriterStallControl$1) at java.lang.Thread.join(java.base@17.0.5/Thread.java:1372) at org.apache.lucene.index.TestDocumentsWriterStallControl.join(TestDocumentsWriterStallControl.java:313) at org.apache.lucene.index.TestDocumentsWriterStallControl.testRandom(TestDocumentsWriterStallControl.java:89) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47
rmuir commented on issue #11887: URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297345527 It runs much faster with this patch: ``` --- a/lucene/core/src/test/org/apache/lucene/index/TestDocumentsWriterStallControl.java +++ b/lucene/core/src/test/org/apache/lucene/index/TestDocumentsWriterStallControl.java @@ -62,7 +62,7 @@ public class TestDocumentsWriterStallControl extends LuceneTestCase { @Override public void run() { - int iters = atLeast(1000); + int iters = atLeast(100); for (int j = 0; j < iters; j++) { ctrl.updateStalled(random().nextInt(stallProbability) == 0); if (random().nextInt(5) == 0) { // thread 0 only updates ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
dweiss commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009622594 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java: ## @@ -647,8 +671,23 @@ Root findStem( if (!hasGoodSuggestions && dictionary.maxNGramSuggestions > 0) { List generated = - new GeneratingSuggester(suggestionSpeller) - .suggest(dictionary.toLowerCase(word), wordCase, suggestions); + new GeneratingSuggester(suggestionSpeller) { +@Override +void processSuggestibleWords( +int minLength, int maxLength, BiConsumer> processor) { + if (cacheSuggestibleEntries) { +SuggestibleEntryCache cache = suggestibleCache; +if (cache == null) { + // a benign race: + // https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#wishful-benign-is-resilient Review Comment: yeah, it's fine... although unless it's a crucial loop I'd just mark that field volatile so that advanced knowledge of jmm doesn't have to be explained to whoever would like to look at that code later on :) ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java: ## @@ -72,10 +77,29 @@ public Hunspell(Dictionary dictionary) { * or suggestion generation by throwing an exception */ public Hunspell(Dictionary dictionary, TimeoutPolicy policy, Runnable checkCanceled) { +this(dictionary, policy, checkCanceled, new Stemmer(dictionary), false); + } + + private Hunspell( + Dictionary dictionary, + TimeoutPolicy policy, + Runnable checkCanceled, + Stemmer stemmer, + boolean cacheSuggestibleEntries) { this.dictionary = dictionary; this.policy = policy; this.checkCanceled = checkCanceled; -stemmer = new Stemmer(dictionary); +this.stemmer = stemmer; +this.cacheSuggestibleEntries = cacheSuggestibleEntries; + } + + /** + * Returns a copy of this Hunspell instance with better suggestion performance but using more Review Comment: This looks fine to me, although perhaps you'd consider an API similar to that used in codecs, where you pass a tradeoff enum hint (MEMORY, SPEED) and get an instance that tries to fulfill these hints? The drawback is that it'd be backward-incompatible, unless you make it a new (static) factory method? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ovalhub commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
ovalhub commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297370722 On Mon, 31 Oct 2022, Benjamin Trent wrote: > I tested PyLucene locally, comparing it to @msokolov's "batch" methodology > (writing to disk and spinning up a Java process). PyLucene provided faster > numbers in both Batch and iterative on my m1 mac. This indicates that the > overhead experienced is acceptable for now. > > There are improvements we can make to PyLucene (native numpy support is a > big one). But, it seems prudent to get measurements recorded in > ann-benchmarks and iteratively improve our python over time :). What do you mean with "native numpy" ? Are you under the impression that PyLucene is a Python implementation of Lucene ? It's not. It's an auto-generated CPython wrapper around an auto-generated C++ wrapper that calls Java Lucene via JNI and that embeds a Java VM into a Python VM. PyLucene _is_ Java Lucene, unchanged, embedded in a Python VM. Andi.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11887: TestDocumentsWriterStallControl takes over a minute with -Ptests.seed=B83F4990EF501F47
rmuir commented on issue #11887: URL: https://github.com/apache/lucene/issues/11887#issuecomment-1297392725 The condition where this test takes minutes isn't that rare, I ran the test 10 times and hit the slow condition 3 out of 10 executions: * 151s * 158s * 32s With the patch, I ran the test 50 times and it was always reasonable. Will make a PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request, #11894: Tone down TestDocumentsWriterStallControl.testRandom, so it does not take minutes
rmuir opened a new pull request, #11894: URL: https://github.com/apache/lucene/pull/11894 The current test has ~ minute runtimes approximately 30% of the time. Closes #11887 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
benwtrent commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297430223 @ovalhub `numpy` collections are already native. To use them, I have to pull them into python collections and then cast them to be native again. Example: ``` X = X.tolist() doc.add(KnnVectorField("knn", JArray('float')(X), fieldType)) ``` The `.toList()` is weird to me as `X` is already a natively backed float array, just a numpy object. Admittedly, I may be misunderstanding how the communication between Python -> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy unnecessarily pulls an already native collection into a generic python list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
benwtrent commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297440415 > It'd also be great to compare the results against hnswlib as part of the submission. We can double-check that recall is the same for a given set of parameters. This would give confidence we've interpreted all the parameters right in the ann-benchmarks runner. I'm happy to help with this since I have hnswlib set up locally. Ah, for sure. That would be useful. We would have to wait for a new version of PyLucene to be built for this though as right now we still have the bug in: https://issues.apache.org/jira/browse/LUCENE-10527 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ovalhub commented on issue #10665: Benchmark KNN search with ann-benchmarks [LUCENE-9625]
ovalhub commented on issue #10665: URL: https://github.com/apache/lucene/issues/10665#issuecomment-1297463120 On Mon, 31 Oct 2022, Benjamin Trent wrote: > @ovalhub `numpy` collections are already native. To use them, I have to > pull them into python collections and then cast them to be native again. > > Example: ``` X = X.tolist() doc.add(KnnVectorField("knn", > JArray('float')(X), fieldType)) ``` The `.toList()` is weird to me as `X` > is already a natively backed float array, just a numpy object. > > Admittedly, I may be misunderstanding how the communication between Python > -> `native` -> Java is done. But it SEEMS that calling `tolist` for numpy > unnecessarily pulls an already native collection into a generic python > list. When you make a JArray with "JArray('float')(X), fieldType))" the JArray code in JCC's JArray.h will iterate 'X' "as a sequence", calling PySequence_GetItem(sequence, i) and fill a Java float array with the values returned. If native numpy collections are exposed to Python as sequences (ie, ordered things that can be accessed with int indexes), then you should not have to call .tolist() on it before using it with PyLucene/JCC. It could also be that numpy's .tolist() just creates a wrapper object around the native array to expose it as a Python sequence ? (I don't know numpy). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11893: hunspell: allow for faster dictionary iteration during 'suggest' by using more memory (opt-in)
donnerpeter commented on code in PR #11893: URL: https://github.com/apache/lucene/pull/11893#discussion_r1009734992 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Hunspell.java: ## @@ -72,10 +77,29 @@ public Hunspell(Dictionary dictionary) { * or suggestion generation by throwing an exception */ public Hunspell(Dictionary dictionary, TimeoutPolicy policy, Runnable checkCanceled) { +this(dictionary, policy, checkCanceled, new Stemmer(dictionary), false); + } + + private Hunspell( + Dictionary dictionary, + TimeoutPolicy policy, + Runnable checkCanceled, + Stemmer stemmer, + boolean cacheSuggestibleEntries) { this.dictionary = dictionary; this.policy = policy; this.checkCanceled = checkCanceled; -stemmer = new Stemmer(dictionary); +this.stemmer = stemmer; +this.cacheSuggestibleEntries = cacheSuggestibleEntries; + } + + /** + * Returns a copy of this Hunspell instance with better suggestion performance but using more Review Comment: I've considered this, but then again, there can be different time-memory tradeoffs in difference places (and I'm planning another one soon), so I'd name them more specifically. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
benwtrent commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1009762545 ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java: ## @@ -0,0 +1,505 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene95; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.index.*; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.hnsw.HnswGraph; +import org.apache.lucene.util.hnsw.HnswGraphSearcher; +import org.apache.lucene.util.hnsw.NeighborQueue; +import org.apache.lucene.util.packed.DirectMonotonicReader; +import org.apache.lucene.util.packed.PackedInts; + +/** + * Reads vectors from the index segments along with index data structures supporting KNN search. + * + * @lucene.experimental + */ +public final class Lucene95HnswVectorsReader extends KnnVectorsReader { + + private final FieldInfos fieldInfos; + private final Map fields = new HashMap<>(); + private final IndexInput vectorData; + private final IndexInput vectorIndex; + + Lucene95HnswVectorsReader(SegmentReadState state) throws IOException { +this.fieldInfos = state.fieldInfos; +int versionMeta = readMetadata(state); +boolean success = false; +try { + vectorData = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME); + vectorIndex = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private int readMetadata(SegmentReadState state) throws IOException { +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, state.segmentSuffix, Lucene95HnswVectorsFormat.META_EXTENSION); +int versionMeta = -1; +try (ChecksumIndexInput meta = state.directory.openChecksumInput(metaFileName, state.context)) { + Throwable priorE = null; + try { +versionMeta = +CodecUtil.checkIndexHeader( +meta, +Lucene95HnswVectorsFormat.META_CODEC_NAME, +Lucene95HnswVectorsFormat.VERSION_START, +Lucene95HnswVectorsFormat.VERSION_CURRENT, +state.segmentInfo.getId(), +state.segmentSuffix); +readFields(meta, state.fieldInfos); + } catch (Throwable exception) { +priorE = exception; + } finally { +CodecUtil.checkFooter(meta, priorE); + } +} +return versionMeta; + } + + private static IndexInput openDataInput( + SegmentReadState state, int versionMeta, String fileExtension, String codecName) + throws IOException { +String fileName = +IndexFileNames.segmentFileName(state.segmentInfo.name, state.segmentSuffix, fileExtension); +IndexInput in = state.directory.openInput(fileName, state.context); +boolean success = false; +try { + int versionVectorData = + CodecUtil.checkIndexHeader( + in, + codecName, + Lucene95HnswVectorsFormat.VERSION_START, + Lucene95HnswVectorsFormat.VERSION_CURRENT, + state.segmentInfo.getId(), + state.segmentSuffix); + if (versionMeta != versionVe
[GitHub] [lucene] jtibshirani commented on a diff in pull request #11860: GITHUB-11830 Better optimize storage for vector connections
jtibshirani commented on code in PR #11860: URL: https://github.com/apache/lucene/pull/11860#discussion_r1009765612 ## lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene94/package-info.java: ## @@ -0,0 +1,422 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Lucene 9.3 file format. Review Comment: Just noticed, this should say Lucene 9.4 file format (probably just a typo from the existing file). ## lucene/core/src/java/org/apache/lucene/codecs/lucene95/Lucene95HnswVectorsReader.java: ## @@ -0,0 +1,520 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.codecs.lucene95; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.index.*; +import org.apache.lucene.search.ScoreDoc; +import org.apache.lucene.search.TopDocs; +import org.apache.lucene.search.TotalHits; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Bits; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.RamUsageEstimator; +import org.apache.lucene.util.hnsw.HnswGraph; +import org.apache.lucene.util.hnsw.HnswGraphSearcher; +import org.apache.lucene.util.hnsw.NeighborQueue; +import org.apache.lucene.util.packed.DirectMonotonicReader; +import org.apache.lucene.util.packed.PackedInts; + +/** + * Reads vectors from the index segments along with index data structures supporting KNN search. + * + * @lucene.experimental + */ +public final class Lucene95HnswVectorsReader extends KnnVectorsReader { + + private final FieldInfos fieldInfos; + private final Map fields = new HashMap<>(); + private final IndexInput vectorData; + private final IndexInput vectorIndex; + + Lucene95HnswVectorsReader(SegmentReadState state) throws IOException { +this.fieldInfos = state.fieldInfos; +int versionMeta = readMetadata(state); +boolean success = false; +try { + vectorData = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_DATA_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_DATA_CODEC_NAME); + vectorIndex = + openDataInput( + state, + versionMeta, + Lucene95HnswVectorsFormat.VECTOR_INDEX_EXTENSION, + Lucene95HnswVectorsFormat.VECTOR_INDEX_CODEC_NAME); + success = true; +} finally { + if (success == false) { +IOUtils.closeWhileHandlingException(this); + } +} + } + + private int readMetadata(SegmentReadState state) throws IOException { +String metaFileName = +IndexFileNames.segmentFileName( +state.segmentInfo.name, state.segmentSuffix, Lucene95HnswVectorsFormat.META_EXTENSION); +int versionMeta = -1; +try (ChecksumIndexInput meta = state.directory.openChecksumInput(metaFileName, state.context)) { + Throwable priorE = null; + try { +versionMeta = +CodecUtil.checkIndexHeader( +meta, +Lucene95HnswVectorsFormat.META_CODEC_NAME, +Lucene95HnswVectorsFormat.VERSION_START,
[GitHub] [lucene] jtibshirani commented on pull request #239: LUCENE-10040: Handle deletions in nearest vector search
jtibshirani commented on PR #239: URL: https://github.com/apache/lucene/pull/239#issuecomment-1297711772 @harishankar-gopalan sorry for the slow response! Your overall understanding is right. In Lucene, deletions are handled by marking a document as deleted using a 'tombstone'. The index structures are not actually updated (this includes the HNSW graph). In response to your questions... 1. Yes, when there are deletions, we make sure to expand the search space to retrieve top `k`. We use a similar strategy that many vector search engines use for kNN with filtering. During the HNSW search, we make sure to exclude deleted nodes from the final candidate set, but deleted nodes are still used for traversing hte graph. 2. No, there is no work in that direction. This is because Lucene segments are never updated after they are first written (except under rare circumstances). Lucene's immutable segment model is core to its design, and it's the reason we use 'tombstones' instead of modifying the graph in place. For segment merges, we combine the vectors across all segments and build a new graph from scratch. We make sure to skip over deleted documents during this merge, so they have no effect on the time it takes to build the graph. This merge process is quite expensive and we're brainstorming ways of making it faster (https://github.com/apache/lucene/issues/11354). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11875: Usability improvements for timeout support in IndexSearcher
jpountz commented on PR #11875: URL: https://github.com/apache/lucene/pull/11875#issuecomment-1298076313 Sorry for the lag I'm on vacation. The problem with "this class may be useful outside of Lucene" to me is that it could apply to any class in Lucene. We did indeed make some classes public on this criterion in the past (e.g FixedBitSet), but I would like the bar to be high, is there really not a better way? Is the need to make this class public a sign that the functionality is not exposed the right way? For the case described above by Vigya, I can think of two alternatives we might want to consider instead: - Copy the code for this time-limiting bulk scorer instead of using the Lucene class. If there is only a couple users of Lucene who would benefit from this class being public, maybe it's a better trade-off to let them take full ownership of this code to allow Lucene to keep treating it as an implementation detail? - Figure out different refactors that would help OpenSearch leverage timeout support from IndexSearcher as-is. E.g. if it would fold the logic to use live docs to lead iteration in the weight wrapper, would that work? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org