[PR] [9_10] Mark TermInSetQuery ctors with varargs terms as deprecated [lucene]
slow-J opened a new pull request, #12864: URL: https://github.com/apache/lucene/pull/12864 Issue: https://github.com/apache/lucene/issues/12243. The deprecated items here are being removed in `main` branch through PR:https://github.com/apache/lucene/pull/12837. For methods calling this, also adding new methods taking a collection param and deprecating previous ones with varargs param: SortedSetDocValuesField#newSlowSetQuery, SortedDocValuesField#newSlowSetQuery, KeywordField#newSetQuery -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix intermittently failing TestParallelLeafReader [lucene]
ChrisHegarty opened a new pull request, #12865: URL: https://github.com/apache/lucene/pull/12865 This commit fixes the intermittently failing TestParallelLeafReader. The ParallelLeafReader requires the document order to be consistent across indexes - each document contains the union of the fields of all documents with the same document number. The test asserts this. But now, with MockRandomMergePolicy potentially reversing the doc ID order while merging, this invalidates the assumption of the test indexes and assertions. The solution is to just ensure that no merging actually happens in these tiny test indexes. Similar root cause as #12850. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Prevent extra similarity computation for single-level graphs [lucene]
kaivalnp opened a new pull request, #12866: URL: https://github.com/apache/lucene/pull/12866 ### Description [`#findBestEntryPoint`](https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L151) is used to determine the entry point for the last level of HNSW search It finds the single best-scoring node from [all upper levels](https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L159) - but performs an [unnecessary computation](https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L157) (along with [recording one visited node](https://github.com/apache/lucene/blob/4bc7850465dfac9dc0638d9ee782007883869ffe/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L154)) when the graph just has 1 level (so the entry node is just the overall graph's entry node) Also added a test to demonstrate this (fails without the changes in PR) -- where we visit `graph.size() + 1` nodes when the `topK` is high (should be a maximum of `graph.size()`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix intermittently failing TestParallelLeafReader [lucene]
ChrisHegarty commented on PR #12865: URL: https://github.com/apache/lucene/pull/12865#issuecomment-1836024282 The test was previously seen to fail about 1 in every couple of hundred runs, with: ``` org.junit.ComparisonFailure: expected: but was: at __randomizedtesting.SeedInfo.seed([D07DE7201C057238:E432507488980B27]:0) at org.junit.Assert.assertEquals(Assert.java:117) at org.junit.Assert.assertEquals(Assert.java:146) at org.apache.lucene.index.TestParallelLeafReader.queryTest(TestParallelLeafReader.java:264) at org.apache.lucene.index.TestParallelLeafReader.testQueries(TestParallelLeafReader.java:49) ... ``` After the change, the test has not failed in several hundreds of thousands of runs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add ParentJoin KNN support [lucene]
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1836025921 @david-sitsky sorry for the confusion, it was renamed `DiversifyingChildren*KnnVectorQuery` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
benwtrent commented on code in PR #12866: URL: https://github.com/apache/lucene/pull/12866#discussion_r1412053821 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -100,19 +100,10 @@ private static void search( HnswGraphSearcher graphSearcher, Bits acceptOrds) throws IOException { -int initialEp = graph.entryNode(); -if (initialEp == -1) { - return; +int ep = graphSearcher.findBestEntryPoint(scorer, graph, knnCollector); Review Comment: I think you should still check `collector.earlyTerminated()` after finding the best entry point as we increment the visited count after each check. Meaning we could have visited too many by one, and there is no point in continuing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1836064137 I did 5 benchmark runs for 4 configurations. To avoid making this comment way too large, I'll just report the averages across the 5 runs per configuration. It looks like there are some genuine differences between the baseline and cadidate, but I'm not sure the differences between different candidate configurations are significant. How do we trade off between the extra miliseconds and the memory savings? ``` baseline recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 25021.00 post-filter 0.5420.23 10 0 64 250 100 43624 1.00 post-filter 0.5120.28 20 0 64 250 100 111019 1.00 post-filter candidate with INITIAL_CAPACITY == 10 recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 24981.00 post-filter 0.5420.23 10 0 64 250 100 45063 1.00 post-filter 0.5120.28 20 0 64 250 100 116683 1.00 post-filter candidate with INITIAL_CAPACITY == 100 recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 25281.00 post-filter 0.5420.23 10 0 64 250 100 46055 1.00 post-filter 0.5120.28 20 0 64 250 100 115407 1.00 post-filter candidate with INITIAL_CAPACITY == 1000 recall latency nDocfanout maxConn beamWidth visited index ms 0.7200.16 1 0 64 250 100 25181.00 post-filter 0.5420.23 10 0 64 250 100 46617 1.00 post-filter 0.5120.29 20 0 64 250 100 118359 1.00 post-filter ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
benwtrent commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1836068375 @stefanvodita > How do we trade off between the extra miliseconds and the memory savings? It would be good to know the actual memory savings. I don't know how to measure the trade-off without know what we are trading off for. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Reconcile changelog 9.9.0 section [lucene]
ChrisHegarty opened a new pull request, #12867: URL: https://github.com/apache/lucene/pull/12867 Reconcile the changelog between branch_9_9 and main. This change just reorders a number of entries in _main_ to match that of branch_9_9. As identified by Mike's script, #12860 There are still a couple of minor inconsistencies between the changelog entries, but since they require a little more than just reordering, I'll separate them out. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add simple tool to diff entries in lucene's CHANGES.txt that should be identical [lucene]
ChrisHegarty commented on code in PR #12860: URL: https://github.com/apache/lucene/pull/12860#discussion_r1412073640 ## dev-tools/scripts/diff_lucene_changes.py: ## @@ -0,0 +1,79 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import subprocess +import sys +import tempfile +import urllib.request + +''' +A simple tool to see diffs between main's version of CHANGES.txt entries for +a given release vs the stable branch's version. It's best to keep these 1) +identical and 2) matching what changes were actually backported to be honest +to users and avoid future annoying conflicts on backport. +''' + +# e.g. python3 -u diff_lucene_changes.py branch_9_9 main 9.9.0 + +# + +def get_changes_url(branch_name): + if os.path.isdir(branch_name): +url = f'file://{branch_name}/lucene/CHANGES.txt' + else: +url = f'https://raw.githubusercontent.com/apache/lucene/{branch_name}/lucene/CHANGES.txt' + print(f'NOTE: resolving {branch_name} --> {url}') + return url + +def extract_release_section(changes_txt, release_name): + open('/x/tmp/foo.txt', 'wb').write(changes_txt) Review Comment: Did you mean just `/tmp` here? I had to modify this to run on my Mac. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add static function in TaskExecutor to retrieve the results for a collection of Future [lucene]
shubhamvishu commented on PR #12798: URL: https://github.com/apache/lucene/pull/12798#issuecomment-1836093934 Exactly @javanna Do you think it would make sense to have a new `FutureUtil` class and add this function there? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412105704 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: I was new to this class, just curious what happen if INITIAL_CAPACITY > maxSize. I thought maxSize is the maximum size that we ever needed. But looking at previous code, it seems the array can still be expanded even if we already fully pre-allocated it with maxSize, from the line `node = ArrayUtil.grow(node);`. Is it just because it was initial grow on-demand, then change to full prelocation, but didn't clean that up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412105704 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: I was new to this class, just curious what happen if INITIAL_CAPACITY > maxSize. I thought maxSize is the maximum size that we ever needed. But looking at previous code, it seems the array can still be expanded even if we already fully pre-allocated it with maxSize, from the line `node = ArrayUtil.grow(node);`. Is it just because it was initial grow on-demand, then change to full prelocation, but didn't clean that up. If it's indeed the maximum size that we ever need, maybe `Math.min(INITIAL_CAPACITY, maxSize)` would be better. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
kaivalnp commented on code in PR #12866: URL: https://github.com/apache/lucene/pull/12866#discussion_r1412139289 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -100,19 +100,10 @@ private static void search( HnswGraphSearcher graphSearcher, Bits acceptOrds) throws IOException { -int initialEp = graph.entryNode(); -if (initialEp == -1) { - return; +int ep = graphSearcher.findBestEntryPoint(scorer, graph, knnCollector); Review Comment: Makes sense! Added now.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
Pulkitg64 commented on PR #12857: URL: https://github.com/apache/lucene/pull/12857#issuecomment-1836188173 Thanks @shubhamvishu for taking a look. > I went through the change but I didn't understand how are we not reusing the bitset in the current approach. We do wrap the BitSetIterator with a FilteredDocIdSetIterator when there are deleted docs right which would eventually use the bitset to advance the inner iterator(See [this](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/FilteredDocIdSetIterator.java#L72-L87)). Sorry! I think, I should have used different title for this PR. The part in the current approach, which I am trying to optimize is that when the iterator is of ```BitSetIterator``` instance and ```live docs``` are not null. So in current approach we create a new BitSet while taking live docs into consideration. But this bitset creation is a linear time complexity process, because to create bitset we need to iterate over all matched docs. This BitSet creation is not required as we can wrap both matched docs bitset and live docs bitset under single Bits instance which can be later used directly during approximate search. So instead of creating new Bitset, we are computing if a document is valid for searching or not at runtime. This saves us time to create new BitSet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reconcile changelog 9.9.0 section [lucene]
ChrisHegarty merged PR #12867: URL: https://github.com/apache/lucene/pull/12867 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412155078 ## lucene/core/src/java/org/apache/lucene/util/ArrayUtil.java: ## @@ -330,6 +330,29 @@ public static int[] growExact(int[] array, int newLength) { return copy; } + /** + * Returns an array whose size is at least {@code minLength}, generally over-allocating + * exponentially, but never allocating more than {@code maxLength} elements. + */ + public static int[] growInRange(int[] array, int minLength, int maxLength) { +if (array.length >= minLength) { + return array; +} +if (minLength > maxLength) { + throw new IllegalArgumentException( + "requested minimum array length " + + minLength + + " is larger than requested maximum array length " + + maxLength); +} Review Comment: I think it would be strange that we allow minSize to be > maxSize, that could introduce latent bugs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412167183 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: This seems to come from the `M` parameter, which is maximum number of connection (neighbors) a node can have, which is M on upper level and M*2 on level 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
benwtrent commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412176786 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: We should never ever allow `INITIAL_CAPACITY` to be used to create these arrays when it is larger than `maxSize`. This would be a serious bug IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
msokolov commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412182912 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: I think it's some leftover stuff from before, yes. Although it's also possible we have a bug in our maxSize initialization? I don't think we do though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
msokolov commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412186499 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: I really don't understand why `candidate with INITIAL_CAPACITY == 1000` would be slower. Is it GC-related? Have you been able to look at JFR at all? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Jvm Crashes occassionaly with Lucene 8.10.0, JDK 11.0.15+10 [lucene]
msokolov commented on issue #12863: URL: https://github.com/apache/lucene/issues/12863#issuecomment-1836250307 If the JVM crashes, it's generally considered a JVM bug. I'll note there is a more recent JDK 11 release - they seem to be up to 11.0.20. Have you considered upgrading? These point release generally include bug fixes, although I can't speak to whether this particular issue has been addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412215432 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: Could it be due to the over-allocation that every node now needs to allocate 1000 size array whether they need it or not? I think that's the more reason we should use min(INITIAL_SIZE, maxSize) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
dungba88 commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412215432 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: Could it be due to the over-allocation that every node now needs to allocate 1000 size array whether they need it or not? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add simple tool to diff entries in lucene's CHANGES.txt that should be identical [lucene]
mikemccand commented on code in PR #12860: URL: https://github.com/apache/lucene/pull/12860#discussion_r1412235383 ## dev-tools/scripts/diff_lucene_changes.py: ## @@ -0,0 +1,79 @@ +#!/usr/bin/env python3 +# -*- coding: utf-8 -*- +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import re +import subprocess +import sys +import tempfile +import urllib.request + +''' +A simple tool to see diffs between main's version of CHANGES.txt entries for +a given release vs the stable branch's version. It's best to keep these 1) +identical and 2) matching what changes were actually backported to be honest +to users and avoid future annoying conflicts on backport. +''' + +# e.g. python3 -u diff_lucene_changes.py branch_9_9 main 9.9.0 + +# + +def get_changes_url(branch_name): + if os.path.isdir(branch_name): +url = f'file://{branch_name}/lucene/CHANGES.txt' + else: +url = f'https://raw.githubusercontent.com/apache/lucene/{branch_name}/lucene/CHANGES.txt' + print(f'NOTE: resolving {branch_name} --> {url}') + return url + +def extract_release_section(changes_txt, release_name): + open('/x/tmp/foo.txt', 'wb').write(changes_txt) Review Comment: Er, woops, this was not supposed to be committed -- it was just there for temporary debugging -- I'll remove! Good catch, thanks @ChrisHegarty ;) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Jvm Crashes occassionaly with Lucene 8.10.0, JDK 11.0.15+10 [lucene]
rmuir commented on issue #12863: URL: https://github.com/apache/lucene/issues/12863#issuecomment-1836306192 I don't think its a jvm bug. This is what happens when you try to read from a closed indexreader that is backed by mmap. because we've unmapped the byte buffer when `close()` was called -> SIGSEGV. So it is a bug in the application (accessing close'd indexreader), but it causes the JVM crash. It is a tale as old as time, you can find numerous threads on it elsewhere. I'm not gonna repeat it all again. We make a best effort to detect such abuses and prevent this, but its not possible to do safely without using @uschindler 's Panama-based provider and a modern JVM. So either: * fix your code not to access close'd indexreaders * upgrade to recent lucene and java and use panama provider. You will get AlreadyClosedException or similar instead. * call `setUseUnmap(false)` on your MMapDirectory. The downside is much disk and address space will now be tied to the java garbage collector... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Jvm Crashes occassionaly with Lucene 8.10.0, JDK 11.0.15+10 [lucene]
rmuir closed issue #12863: Jvm Crashes occassionaly with Lucene 8.10.0, JDK 11.0.15+10 URL: https://github.com/apache/lucene/issues/12863 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
kaivalnp commented on PR #12866: URL: https://github.com/apache/lucene/pull/12866#issuecomment-1836321336 Thanks @benwtrent :) > But a CHANGES entry for Lucene 9.10 I did not see a section for "9.10" in the [current CHANGES.txt](https://github.com/apache/lucene/blob/b231e5b213200f4049dcc7364cc1452d11cd3f2c/lucene/CHANGES.txt) -- any idea when this will be added? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
benwtrent commented on PR #12866: URL: https://github.com/apache/lucene/pull/12866#issuecomment-1836322275 @kaivalnp you can add it :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Jvm Crashes occassionaly with Lucene 8.10.0, JDK 11.0.15+10 [lucene]
uschindler commented on issue #12863: URL: https://github.com/apache/lucene/issues/12863#issuecomment-1836326667 Thanks @rmuir, I can confirm the above pattern happens when you access an IndexReader/IndexSearcher form another thread if it was closed. To see AlreadyClosedExceptions you need to upgrade to at least Lucene ~9.5 (better use latest 9.9) and use Java 19 or later. This configuration will give you a AlreadyClosedException instead of crashing your JVM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible error in TestLucene90HnswVectorsFormat.testIndexedValueNotAliased [lucene]
benwtrent commented on issue #12840: URL: https://github.com/apache/lucene/issues/12840#issuecomment-1836360899 fixed via: https://github.com/apache/lucene/pull/12848 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Reproducible error in TestLucene90HnswVectorsFormat.testIndexedValueNotAliased [lucene]
benwtrent closed issue #12840: Reproducible error in TestLucene90HnswVectorsFormat.testIndexedValueNotAliased URL: https://github.com/apache/lucene/issues/12840 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
kaivalnp commented on PR #12866: URL: https://github.com/apache/lucene/pull/12866#issuecomment-1836394910 Replicated [this commit](https://github.com/apache/lucene/commit/94b879a5) (which added the 9.9.0 entry) for 9.10.0 Please let me know if I missed something.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Trying using Murmurhash 3 for bloom filters [lucene]
shubhamvishu opened a new pull request, #12868: URL: https://github.com/apache/lucene/pull/12868 ### Description We are currently using Murmurhash 2([MurmurHash64.java](https://github.com/apache/lucene/blob/main/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/MurmurHash64.java)) in the bloom filters implementation in lucene where we also have Murmurhash 3 (the latest one available in the [MurmurHash](https://en.wikipedia.org/wiki/MurmurHash) family of hash functions) and provides better performance, avalanche effect etc. It provides 128-bit variant which Murmurhash 2 doesn't. This PR aims to use Murmurhash 3 with bloom filters and see if that helps. Note : We are already using murmur hash 3 in [BytesRefHash](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/BytesRefHash.java#L412-L414) implementation (#) Next steps : Run `luceneutil` benchmarks and share the results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make NeighborArray fixed size [lucene]
msokolov closed issue #11783: Make NeighborArray fixed size URL: https://github.com/apache/lucene/issues/11783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Make NeighborArray fixed size [lucene]
msokolov commented on issue #11783: URL: https://github.com/apache/lucene/issues/11783#issuecomment-1836436845 resolving this since I think we did it and are now going back to more dynamic allocation strategy LOL -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve Javadoc [lucene]
mikemccand commented on PR #12508: URL: https://github.com/apache/lucene/pull/12508#issuecomment-1836443822 Woops, thank you for your attention to detail @lukas-vlcek! And sorry for the crazy long time to respond. I'll merge this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Improve Javadoc [lucene]
mikemccand merged PR #12508: URL: https://github.com/apache/lucene/pull/12508 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Changed byte to int for prefix_length_key [lucene]
msokolov merged PR #12507: URL: https://github.com/apache/lucene/pull/12507 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-1836499432 Below are the `luceneutil` benchmark results for `wikimediumall`. Looks all flat and good to me. ``` TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value LowIntervalsOrdered6.08 (5.0%)6.00 (6.3%) -1.3% ( -12% - 10%) 0.463 TermDTSort 97.64 (4.6%) 96.36 (5.6%) -1.3% ( -11% -9%) 0.417 OrHighHigh 28.51 (5.7%) 28.28 (5.7%) -0.8% ( -11% - 11%) 0.651 HighTermMonthSort 2287.85 (2.9%) 2270.76 (3.3%) -0.7% ( -6% -5%) 0.447 MedTermDayTaxoFacets 14.70 (4.6%) 14.60 (4.0%) -0.7% ( -8% -8%) 0.620 BrowseDayOfYearTaxoFacets3.94 (28.1%)3.91 (28.0%) -0.7% ( -44% - 77%) 0.941 Respell 54.80 (1.8%) 54.45 (2.2%) -0.6% ( -4% -3%) 0.307 BrowseDateTaxoFacets3.92 (27.5%)3.89 (27.4%) -0.6% ( -43% - 74%) 0.942 BrowseMonthSSDVFacets4.45 (10.1%)4.43 (9.7%) -0.6% ( -18% - 21%) 0.845 MedSpanNear 91.41 (2.6%) 90.86 (3.7%) -0.6% ( -6% -5%) 0.554 MedIntervalsOrdered6.61 (3.7%)6.57 (4.4%) -0.5% ( -8% -7%) 0.671 HighIntervalsOrdered6.73 (3.8%)6.69 (4.2%) -0.5% ( -8% -7%) 0.691 HighTermTitleSort 121.44 (4.0%) 120.97 (4.6%) -0.4% ( -8% -8%) 0.778 HighSpanNear5.31 (2.6%)5.29 (3.2%) -0.4% ( -6% -5%) 0.701 LowSpanNear 50.68 (2.6%) 50.51 (3.3%) -0.3% ( -6% -5%) 0.731 OrHighMed 46.96 (3.3%) 46.83 (3.2%) -0.3% ( -6% -6%) 0.788 OrHighNotLow 235.93 (5.2%) 235.59 (6.5%) -0.1% ( -11% - 12%) 0.937 HighSloppyPhrase4.70 (4.7%)4.70 (5.6%) -0.1% ( -9% - 10%) 0.931 BrowseMonthTaxoFacets4.15 (35.3%)4.15 (35.3%) -0.1% ( -52% - 108%) 0.991 Fuzzy2 54.22 (1.1%) 54.15 (1.3%) -0.1% ( -2% -2%) 0.745 OrNotHighLow 517.03 (1.6%) 516.41 (1.6%) -0.1% ( -3% -3%) 0.816 Fuzzy1 38.59 (1.2%) 38.56 (1.3%) -0.1% ( -2% -2%) 0.808 BrowseRandomLabelSSDVFacets2.77 (7.5%)2.77 (7.5%) -0.0% ( -14% - 16%) 0.986 AndHighHigh 29.06 (3.2%) 29.06 (3.1%) -0.0% ( -6% -6%) 0.991 AndHighMed 55.89 (2.9%) 55.91 (2.7%) 0.0% ( -5% -5%) 0.970 OrHighLow 319.86 (2.2%) 320.00 (2.0%) 0.0% ( -4% -4%) 0.945 IntNRQ 21.94 (2.5%) 21.96 (3.4%) 0.1% ( -5% -6%) 0.905 LowPhrase 136.01 (4.1%) 136.20 (3.8%) 0.1% ( -7% -8%) 0.908 OrNotHighMed 208.82 (3.6%) 209.20 (3.9%) 0.2% ( -7% -8%) 0.879 OrNotHighHigh 241.45 (3.8%) 241.96 (4.9%) 0.2% ( -8% -9%) 0.880 OrHighNotHigh 191.17 (4.6%) 191.65 (5.7%) 0.3% ( -9% - 11%) 0.878 AndHighLow 300.40 (3.0%) 301.26 (2.7%) 0.3% ( -5% -6%) 0.754 BrowseDateSSDVFacets0.90 (12.4%)0.91 (10.9%) 0.3% ( -20% - 26%) 0.933 BrowseRandomLabelTaxoFacets3.34 (24.3%)3.36 (24.6%) 0.4% ( -39% - 65%) 0.963 OrHighNotMed 215.89 (5.5%) 216.69 (6.3%) 0.4% ( -10% - 12%) 0.843 Wildcard 61.64 (2.0%) 61.91 (2.2%) 0.4% ( -3% -4%) 0.501 HighTermTitleBDVSort5.43 (4.3%)5.46 (4.5%) 0.5% ( -8% -9%) 0.729 PKLookup 136.14 (1.9%) 136.97 (1.9%) 0.6% ( -3% -4%) 0.310 HighTermDayOfYearSort 195.65 (4.1%) 196.91 (4.6%) 0.6% ( -7% -9%) 0.641 MedTerm 455.25 (5.3%) 458.30 (5.8%) 0.7% ( -9% - 12%) 0.702 LowSloppyPhrase 30.11 (2.0%) 30.36 (1.8%) 0.8% ( -2% -4%) 0.171 HighTerm 367.19 (6.1%) 370.21 (6.3%) 0.8% ( -10% - 14%) 0.6
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
kaivalnp commented on code in PR #12857: URL: https://github.com/apache/lucene/pull/12857#discussion_r1412393818 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -118,13 +118,38 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw return NO_RESULTS; } -BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -int cost = acceptDocs.cardinality(); +DocIdSetIterator iterator = scorer.iterator(); +BitSet bitSet = +iterator instanceof BitSetIterator +? ((BitSetIterator) iterator).getBitSet() +: BitSet.of(iterator, maxDoc); +Bits acceptDocs = +new Bits() { + @Override + public boolean get(int index) { +return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index); + } + + @Override + public int length() { +return bitSet.cardinality(); + } Review Comment: `length()` is more like the [maximum doc](https://github.com/apache/lucene/blob/0e96b9cd8c1267acee02684bcac5be89d60b3bf4/lucene/core/src/java/org/apache/lucene/util/Bits.java#L73) you can request, this should be `bitSet.length()`? ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -118,13 +118,38 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw return NO_RESULTS; } -BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -int cost = acceptDocs.cardinality(); +DocIdSetIterator iterator = scorer.iterator(); +BitSet bitSet = +iterator instanceof BitSetIterator +? ((BitSetIterator) iterator).getBitSet() +: BitSet.of(iterator, maxDoc); +Bits acceptDocs = +new Bits() { + @Override + public boolean get(int index) { +return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index); Review Comment: nit: Can we make this cleaner by `return (liveDocs == null || liveDocs.get(index)) && bitSet.get(index)` ? ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -118,13 +118,38 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw return NO_RESULTS; } -BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -int cost = acceptDocs.cardinality(); +DocIdSetIterator iterator = scorer.iterator(); +BitSet bitSet = +iterator instanceof BitSetIterator +? ((BitSetIterator) iterator).getBitSet() +: BitSet.of(iterator, maxDoc); +Bits acceptDocs = +new Bits() { Review Comment: Perhaps we can wrap the `BitSet` in a new `Bits` only for the case we're trying to optimize (when `iterator instanceof BitSetIterator`) -- not changing the `BitSet.of` flow like @shubhamvishu also mentioned? ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -118,13 +118,38 @@ private TopDocs getLeafResults(LeafReaderContext ctx, Weight filterWeight) throw return NO_RESULTS; } -BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -int cost = acceptDocs.cardinality(); +DocIdSetIterator iterator = scorer.iterator(); +BitSet bitSet = +iterator instanceof BitSetIterator +? ((BitSetIterator) iterator).getBitSet() +: BitSet.of(iterator, maxDoc); +Bits acceptDocs = +new Bits() { + @Override + public boolean get(int index) { +return liveDocs != null ? liveDocs.get(index) & bitSet.get(index) : bitSet.get(index); + } + + @Override + public int length() { +return bitSet.cardinality(); + } +}; + +FilteredDocIdSetIterator filteredDocIdSetIterator = +new FilteredDocIdSetIterator(new BitSetIterator(bitSet, maxDoc)) { + @Override + protected boolean match(int doc) { +return liveDocs == null || liveDocs.get(doc); + } +}; + +int cost = acceptDocs.length(); Review Comment: The `cost` here determines what limit to set for `#approximateSearch` If we use `acceptDocs.length()`, this will be equal to `maxDoc` (and always complete graph search without falling back to exact search, even when we want to..) Perhaps this should be `acceptDocs.cardinality()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscri
Re: [PR] Initial impl of MMapDirectory for Java 22 [lucene]
uschindler commented on PR #12706: URL: https://github.com/apache/lucene/pull/12706#issuecomment-1836540146 I tested the new code - and at the same time commenting out the Java 21 fallback code to check for "closed" in the `IllegalStateException` message here: https://github.com/apache/lucene/blob/5c9152917614ecee0eef7612ab71c24c9d92529c/lucene/core/src/java21/org/apache/lucene/store/MemorySegmentIndexInput.java#L116-L121 With JDK 22 build 23 the test fails because it rethrows the `IllegalStateException` due to the `isAlive()` check didn't work. With build 26 beasting the test worked for hours: ``` $ ./gradlew :lucene:core:beast --tests TestMMapDirectory.testAceWithThreads -Ptests.multiplier=500 -Ptests.dups=100 ``` (build 23 or earlier Java versions like 21 failed on first or second build) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Prevent extra similarity computation for single-level graphs [lucene]
benwtrent merged PR #12866: URL: https://github.com/apache/lucene/pull/12866 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Simplifying text area stream in Luke- ticket 12809 [lucene]
pratikshelarkar opened a new pull request, #12869: URL: https://github.com/apache/lucene/pull/12869 Hi, Simplifying text area stream in Luke- ticket 12809 This is my first contribution to lucene. Can you please review my code and advice. I'll try my best to add this enhancement. Thanks, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Simplifying TextAreaPrintStream in Luke [lucene]
pratikshelarkar commented on issue #12809: URL: https://github.com/apache/lucene/issues/12809#issuecomment-1836594465 Hi, Can you please review my PR for this ticket? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on PR #12844: URL: https://github.com/apache/lucene/pull/12844#issuecomment-1836599463 I did some memory profiling and it doesn't look promising. Let's take initial capacity 100 as an example. Total allocation is 3.41GB compared to 1.29GB on the baseline, and peak heap usage is 946MiB compared to 547MiB on baseline. It's worse for smaller initial capacities and larger capacities devolve into the baseline behavior. I might try a few runs on a different data-set, but maybe these neighbor arrays tend to be mostly used in the general case. Baseline: [baseline.jfr.zip](https://github.com/apache/lucene/files/13530935/baseline.jfr.zip) https://github.com/apache/lucene/assets/41467371/a2a09c9c-ef91-4227-a4f8-655404b2cb21";> Candidate: [candidate-100.jfr.zip](https://github.com/apache/lucene/files/13530936/candidate-100.jfr.zip) https://github.com/apache/lucene/assets/41467371/80c2141a-fc6f-4624-9f7a-fb264ad36ef5";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412454853 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: > what happen if INITIAL_CAPACITY > maxSize Great point. I fixed this and tested with `INITIAL_CAPACITY == 1000` again. In the profiler, it looks the same as running the baseline. I assume nodes tend not to have 1000 neighbors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
shubhamvishu commented on PR #12857: URL: https://github.com/apache/lucene/pull/12857#issuecomment-1836604882 @kaivalnp We could use the `acceptDocs.cardinality()` when its a `BitSetIterator` to get the upper bound which might have some deletes but that would still change the decision sometimes of whether to go for exact search or not. Since we don't know how many of those docs are live but we do know the num of deletes in the segment(we don't know the intersections of these two). One thing that might be tried is to come up with some heuristic that adds some penalty to the cost based on the num of deletes in the segment (i.e. `ctx.reader().numDeletedDocs()/ctx.reader().maxDoc()`). Like maybe if there are 10% deletes we could for eg decrease the cost by 10% or maybe 5%. This might help in cases where we miss falling back to exact search. Though this would need some thorough benchmarking to see what works best. On separate note, I'm thinking if there is some use case where we don't require to know this cost upfront and directly go for approximate search only for instance. Currently, this optimization only kicks in when the iterator is of `BitSetIterator` but if its possible to ignore this cost step or get this cost by some other heuristic/approximation then we could completely make it completely lazily evaluated using `DISI#advance(docid)` for those use cases. @msokolov @benwtrent Maybe you could share your thoughts on this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
benwtrent commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412475261 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: @stefanvodita with a maxConn of `64`, nodes will at most have `128` neighbors (maybe `129` as when we exceed the accounted size, we will then remove one before storing). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
stefanvodita commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412502016 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: Oh, I see. Is there some recommended value or normal range for maxConn? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce growInRange to reduce array overallocation [lucene]
zhaih commented on code in PR #12844: URL: https://github.com/apache/lucene/pull/12844#discussion_r1412533096 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -32,17 +32,20 @@ * @lucene.internal */ public class NeighborArray { + private static final int INITIAL_CAPACITY = 10; private final boolean scoresDescOrder; + private final int maxSize; private int size; float[] score; int[] node; private int sortedNodeSize; public final ReadWriteLock rwlock = new ReentrantReadWriteLock(true); public NeighborArray(int maxSize, boolean descOrder) { -node = new int[maxSize]; -score = new float[maxSize]; +node = new int[INITIAL_CAPACITY]; Review Comment: I think it's just due to having consuming more memory -> more GC cycles needed -> higher latency. So I rechecked the code, when we insert a node, we will first collect `beamWidth` number of candidates, and then try to diversely add those candidates to the neighborArray. So I think: * in case that `beamWidth > maxSize`, we can just init this with `maxSize` and done, because it's likely in a larger graph that the first fill will directly fill the `NeighborArray` to full and there's no point on resizing it with any init size. * in case that `beamWidth < maxSize`, we can just init this with `beamWidth` such that the init fill will likely fill the array in a nearly full state? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
benwtrent commented on PR #12857: URL: https://github.com/apache/lucene/pull/12857#issuecomment-1836748924 Is our goal memory usage or speed? We could use `FixedBitSet#intersectionCount` and keep from having to create a new bit set that is the intersection. I am honestly not sure if the implementation here is any faster than just creating the bit set upfront and checking it. During search, you now have to check two bitsets now instead of one. If the filter happens to be `< number of docs visited in a typical search`, your implementation here seems like it would be slower. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Reuse BitSet when there are deleted documents in the index instead of creating new BitSet [lucene]
benwtrent commented on PR #12857: URL: https://github.com/apache/lucene/pull/12857#issuecomment-1836750284 Broad feedback: any "optimizations" without benchmarking aren't optimizations, they are just guesses. I am curious to see if this helps CPU usage in anyway. I could see it helping memory usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add ParentJoin KNN support [lucene]
david-sitsky commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1836848114 > @david-sitsky sorry for the confusion, it was renamed `DiversifyingChildren*KnnVectorQuery` Ah.. no worries, thanks. We should update the changelog https://lucene.apache.org/core/9_8_0/changes/Changes.html#v9.8.0.new_features since it is still referring to the old classnames. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Try using Murmurhash 3 for bloom filters [lucene]
shubhamvishu commented on PR #12868: URL: https://github.com/apache/lucene/pull/12868#issuecomment-1837045741 @mikemccand rightly pointed out that `luceneutil` doesn't use bloom filter postings format by default and we should enable it for `id` field and rerun the benchmarks to see the impact. Will rerun and share the results here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org