Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]
dweiss merged PR #13177: URL: https://github.com/apache/lucene/pull/13177 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support disabling IndexSearcher.maxClauseCount with a value of -1 [lucene]
jpountz commented on PR #13178: URL: https://github.com/apache/lucene/pull/13178#issuecomment-1993797886 I'm curious what problem you are trying to address. It looks like you're trying to avoid the overhead of checking the number of clauses, but intuitively this wouldn't help much as we have other operations that need to run in linear time with the number of clauses in the query anyway such as creating the query in the first place, rewriting it, creating a weight, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make BP work on indexes that have blocks. [lucene]
jpountz merged PR #13125: URL: https://github.com/apache/lucene/pull/13125 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Improve Lucene's I/O concurrency [lucene]
jpountz opened a new issue, #13179: URL: https://github.com/apache/lucene/issues/13179 ### Description Currently, Lucene's I/O concurrency is bound by the search concurrency. If `IndexSearcher` runs on N threads, then Lucene will never perform more than N I/Os concurrently. Unless you significantly overprovision your search thread pool - which is bad for other reasons, Lucene will bottleneck on I/O latency without even maxing out the IOPS of the host. I don't think that Lucene should fully embrace asynchronousness in its APIs, or query evaluation would become overly complicated. But I still expect that we have a lot of room for improvement to allow each search thread to perform multiple I/Os concurrently under the hood when needed. Some examples: - When running a query on two terms, e.g. `apache OR lucene`, could the I/O lookups in the `tim` file (terms dictionary) be performed concurrently for both terms? - When running a query on two terms and start offsets in the `doc` file (postings) have been resolved, could we start loading the first bytes from these postings lists from disk concurrently? - When fetching the top N=100 stored documents that match a query, could we load bytes from the `fdt` file (stored fields) for all these documents concurrently? This would require API changes in our `Directory` APIs, and some low-level `IndexReader` APIs (`TermsEnum`, `StoredFieldsReader`?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] gh-13147: use dense bit-encoding for frequent terms [lucene]
jpountz commented on PR #13153: URL: https://github.com/apache/lucene/pull/13153#issuecomment-1994160216 Have you seen interesting performance numbers with this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1523142360 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java: ## @@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() { * @param beamWidth the size of the queue maintained during graph construction. */ public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) { -this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null); +super("Lucene99HnswVectorsFormat"); +if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) { + throw new IllegalArgumentException( + "maxConn must be positive and less than or equal to " + + MAXIMUM_MAX_CONN + + "; maxConn=" + + maxConn); +} +if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) { + throw new IllegalArgumentException( + "beamWidth must be positive and less than or equal to " + + MAXIMUM_BEAM_WIDTH + + "; beamWidth=" + + beamWidth); +} +this.maxConn = maxConn; +this.beamWidth = beamWidth; +this.mergeExec = null; +this.numMergeWorkers = 1; Review Comment: > Oh I see, you haven't yet put in the change for auto figure out number of workers I am wanting to do that separately. It will require its own benchmarking, etc. and this change is good even without choosing an automatic split for the hnsw graph merge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce IORunnable to fix failure in TestIndexWriterOnDiskFull.testAddIndexOnDiskFull [lucene]
easyice commented on PR #13172: URL: https://github.com/apache/lucene/pull/13172#issuecomment-1994345997 Thanks @dweiss , I found a seed `837FF885325AC743` that can reproduce this failure on the main branch, but it's not stable, it may fail once in about 10 runs. with this patch, I ran it 100 times and it was all successful. `-Dtests.seed=837FF885325AC743 -Dtests.nightly=true` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994372130 @zhaih @jpountz I am going to create a separate issue around making HNSW worker slicing automatic. It will require a bunch of its own benchmarking and work and honestly seems orthogonal to this particular change as HNSW merging still works fine when statically setting the number of workers & using the CMS thread pool. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994373820 Thanks for sharing performance numbers @benwtrent, very interesting. Also double checking if you saw my above comment: https://github.com/apache/lucene/pull/13124#pullrequestreview-1930418114, I'd have more confidence about the way merge concurrency is exposed if we took advantage of it in more than one place. If there's some difficulty with it, I'm happy with looking into it in a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994399018 @jpountz would you prefer something like the original patch from @dweiss ? I can submit the merging actions independently to the intra-merge executor. Anything more (like figuring out how to make postings merge truly parallel), will be quite a lift for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994458100 Yes exactly, something very simple, mostly to exercise intra-merge concurrency with more than just vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Add new VectorScorer interface to vector value iterators [lucene]
benwtrent opened a new pull request, #13181: URL: https://github.com/apache/lucene/pull/13181 With quantized vectors, and with current vectors, we separate out the "scoring" vs. "iteration", requiring the user to always iterate the raw vectors and provide their own similarity function. While this is flexible, it creates frustration in: - Just iterating and scoring, especially since the field already has a similarity function stored...Why can't we just know which one to use and use it! - Iterating and scoring quantized vectors. By default it would be good to be able to iterate and score quantized vectors (e.g. without going through the HNSW graph). I see two options on providing this: - A new top level thing on the LeafReader (getVectorScorer or something). - Extend the vector value iterators to be able to return a scorer given some vector value (what this PR demonstrates). This is a POC, not all interfaces are supplied. I am opening what I have for discussion. Folks who might be interested: @jpountz @msokolov @mccullocht -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Regarding the frequency used for scoring sloppy phrase queries. [lucene]
jpountz commented on issue #13152: URL: https://github.com/apache/lucene/issues/13152#issuecomment-1994488165 I agree that the penalty feels too high. We have challenges with queries like this one because there is no good theoretical basis for the right way to score such queries (at least to my knowledge), which forces to make guesses. The one that is implemented at the moment is certainly not great, but at least it's parameter-free. I would imagine that something like `k / (k + matchLength)` would work better for you but then it requires us to expose a configuration option which isn't great either. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use group-varint encode the positions [lucene]
jpountz commented on PR #12842: URL: https://github.com/apache/lucene/pull/12842#issuecomment-1994545075 It looks like `writeGroupVInt` has room for improvement. Can we improve it by making it look a bit more like the read logic? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
jpountz commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1994666062 Would we be subject to the same issue if/when 3+ different implementations of `DocIdSetIterator` get used in `IntersectVisitor#visit`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Making vector comparisons pluggable [lucene]
benwtrent opened a new issue, #13182: URL: https://github.com/apache/lucene/issues/13182 ### Description Opening an issue to continue discussion originating here: https://github.com/apache/lucene/pull/13076#issuecomment-1930363479 Making vector similarities pluggable via SPI will enable users to provide their own specialized similarities without the additional burden of Lucene core having to provide BWC for all the various similarity functions (e.g. hamming, jaccard, cosine). It is probably best that the plug-in-play aspect is placed on `FieldInfo`, though this would be a bit of work as `FieldInfo` isn't currently pluggable. Attaching it directly to a particular Vector Format would place undue burden on users, requiring a new format for any field that desires a separate similarity. While I am not 100% sure how to add it to FieldInfo, I do want to try and figure out the API for such a change. When used within a particular vector format, the following scenarios would be useful: - Indexing, comparing on-heap vectors accessible via ordinals - Merging, comparing off-heap vectors, possibly reading them directly on-heap via ordinals - During search, an on-heap user provided vector being compared with off-heap vectors via ordinals (potentially reading them on-heap). Some of the optimizations discussed here: https://github.com/apache/lucene/pull/12703 show some significant gains in being able to simply have a memory segment and an ordinal offset. While we are not there yet in Lucene, it indicates that we shouldn’t force the API to reading off-heap vectors into a `float[]` or `byte[]` arrays.. I was thinking of something *like* this. 100%, this is not finalized, just wanting to start the discussion. public abstract class VectorSimilarityInterface implements NamedSPILoader.NamedSPI { private static final class Holder { private static final NamedSPILoader LOADER = new NamedSPILoader<>(VectorSimilarityInterface.class); private Holder() {} static NamedSPILoader getLoader() { if (LOADER == null) { throw new IllegalStateException( "You tried to lookup a VectorSimilarityInterface name before all formats could be initialized. " + "This likely happens if you call VectorSimilarityInterface#forName from a VectorSimilarityInterface's ctor."); } return LOADER; } } public static VectorSimilarityInterface forName(String name) { return VectorSimilarityInterface.Holder.getLoader().lookup(name); } private final String name; protected VectorSimilarityInterface(String name) { NamedSPILoader.checkServiceName(name); this.name = name; } @Override public String getName() { return name; } // Comparing an "on heap" query with vectorValues that may or may not be on-heap // Maybe we don't need this and the `byte[]` version as we could hide the "on-heap query" // in an "IdentityRandomAccessVectorValues" which only returns the query vector... public abstract VectorScorer getVectorScorer(RandomAccessVectorValues vectorValues, float[] target) throws Exception; public abstract VectorComparator getFloatVectorComparator(RandomAccessVectorValues vectorValues) throws Exception; public abstract VectorScorer getVectorScorer(RandomAccessVectorValues vectorValues, byte[] target) throws Exception; public abstract VectorComparator getByteVectorComparator(RandomAccessVectorValues vectorValues) throws Exception; static interface VectorScorer extends Closeable { float score(int targetOrd); } static interface VectorComparator { float compare(int vectorOrd1, int vectorOrd2); } } It looks like the SPI injection could occur in `FieldInfosFormat#read` & `FieldInfosFormat#write` (though a new one would have to be built `Lucene911FieldInfosFormat` or something). This would also include a new codec as the field format will change. I am not 100% sold on how this API looks myself. I don't think `RandomAccessVectorValues` is 100% the correct API as it either exposes too much (e.g. `ordToDoc`) or too little (for off-heap, we don't get access to the MemorySegment nor files...). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1994921965 > Yes exactly, something very simple, mostly to exercise intra-merge concurrency with more than just vectors. Latest commit adds `TaskExecutor` actions to merge to allow different merging field types to be merged in parallel. I needed to use `TaskExecutor` as it abstracts out the exception handling (all these methods throw `IOException`). I still need to benchmark it with Lucene-util -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
zhaih commented on code in PR #13124: URL: https://github.com/apache/lucene/pull/13124#discussion_r1523620861 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsFormat.java: ## @@ -152,7 +154,25 @@ public Lucene99HnswVectorsFormat() { * @param beamWidth the size of the queue maintained during graph construction. */ public Lucene99HnswVectorsFormat(int maxConn, int beamWidth) { -this(maxConn, beamWidth, DEFAULT_NUM_MERGE_WORKER, null); +super("Lucene99HnswVectorsFormat"); +if (maxConn <= 0 || maxConn > MAXIMUM_MAX_CONN) { + throw new IllegalArgumentException( + "maxConn must be positive and less than or equal to " + + MAXIMUM_MAX_CONN + + "; maxConn=" + + maxConn); +} +if (beamWidth <= 0 || beamWidth > MAXIMUM_BEAM_WIDTH) { + throw new IllegalArgumentException( + "beamWidth must be positive and less than or equal to " + + MAXIMUM_BEAM_WIDTH + + "; beamWidth=" + + beamWidth); +} +this.maxConn = maxConn; +this.beamWidth = beamWidth; +this.mergeExec = null; +this.numMergeWorkers = 1; Review Comment: Yeah I agree, so for now we'll just fix it at 8? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
antonha commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1995011184 > Would we be subject to the same issue if/when 3+ different implementations of `DocIdSetIterator` get used in `IntersectVisitor#visit`? Yes. Your question makes me think that maybe we should not add a new `DocIdSetIterator` for this PR - maybe that was your thought as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
jpountz commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1995156682 Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on PR #13124: URL: https://github.com/apache/lucene/pull/13124#issuecomment-1995183835 @jpountz ok, that naive attempt failed as norms & terms apparently need to be merged in order (one or the other would fail due to missing files...). I am not sure if this is true with other things. But when I adjusted for this (latest commit) I did see a speed improvement. My test was over IndexGeoNames, I did no segment merging and flushed every 12MB to ensure I got many segments. I also set CMS merge threads to `8` threads. Baseline forcemerge(1): 17793 ms Candidates forcemerge(1): 13739 ms -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Change BP reordering logic to help support document blocks later on. [lucene]
rishabhmaurya commented on code in PR #13123: URL: https://github.com/apache/lucene/pull/13123#discussion_r1523704731 ## lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java: ## @@ -341,116 +344,94 @@ protected void compute() { */ private boolean shuffle( ForwardIndex forwardIndex, -IntsRef left, -IntsRef right, +IntsRef docIDs, +int midPoint, int[] leftDocFreqs, int[] rightDocFreqs, -float[] gains, +float[] biases, int iter) throws IOException { - assert left.ints == right.ints; - assert left.offset + left.length == right.offset; - // Computing gains is typically a bottleneck, because each iteration needs to iterate over all - // postings to recompute gains, and the total number of postings is usually one order of + // Computing biases is typically a bottleneck, because each iteration needs to iterate over + // all postings to recompute biases, and the total number of postings is usually one order of // magnitude or more larger than the number of docs. So we try to parallelize it. - ComputeGainsTask leftGainsTask = - new ComputeGainsTask( - left.ints, - gains, - left.offset, - left.offset + left.length, + new ComputeBiasTask( + docIDs.ints, + biases, + docIDs.offset, + docIDs.offset + docIDs.length, leftDocFreqs, rightDocFreqs, threadLocal, - depth); - ComputeGainsTask rightGainsTask = - new ComputeGainsTask( - right.ints, - gains, - right.offset, - right.offset + right.length, - rightDocFreqs, - leftDocFreqs, - threadLocal, - depth); - if (shouldFork(docIDs.length, docIDs.ints.length)) { -invokeAll(leftGainsTask, rightGainsTask); - } else { -leftGainsTask.compute(); -rightGainsTask.compute(); + depth) + .compute(); + + float maxLeftBias = Float.NEGATIVE_INFINITY; + for (int i = docIDs.offset; i < midPoint; ++i) { +maxLeftBias = Math.max(maxLeftBias, biases[i]); + } + float minRightBias = Float.POSITIVE_INFINITY; + for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end; ++i) { +minRightBias = Math.min(minRightBias, biases[i]); + } + float gain = maxLeftBias - minRightBias; + // This uses the simulated annealing proposed by Mackenzie et al in "Tradeoff Options for + // Bipartite Graph Partitioning" by comparing the gain of swapping the doc from the left side + // that is most attracted to the right and the doc from the right side that is most attracted + // to the left against `iter` rather than zero. + if (gain <= iter) { +return false; } - class ByDescendingGainSorter extends IntroSorter { + new IntroSelector() { int pivotDoc; -float pivotGain; +float pivotBias; @Override protected void setPivot(int i) { - pivotDoc = left.ints[i]; - pivotGain = gains[i]; + pivotDoc = docIDs.ints[i]; + pivotBias = biases[i]; } @Override protected int comparePivot(int j) { - // Compare in reverse order to get a descending sort - int cmp = Float.compare(gains[j], pivotGain); + int cmp = Float.compare(pivotBias, biases[j]); Review Comment: @jpountz let me know your thoughts on it, happy to contribute and improve it further. Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
antonha commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-1996041622 @jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the `IntersectVisitor#visit` method - which is good in the sense that it would probably be fine to merge it now, but bad in the sense that it would make adding a third hurt performance. Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove halt() call in TestSimpleServer (part of TestStressNRTReplication [lucene]
rmuir commented on PR #13177: URL: https://github.com/apache/lucene/pull/13177#issuecomment-1996150774 Thanks for doing this. I was disappointed that java doesn't allow this: ``` jshell> ProcessHandle.current().destroyForcibly() | Exception java.lang.IllegalStateException: destroy of current process not allowed |at ProcessHandleImpl.destroyProcess (ProcessHandleImpl.java:369) |at ProcessHandleImpl.destroyForcibly (ProcessHandleImpl.java:392) |at (#2:1) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Replace Collections.synchronizedSet() with ConcurrentHashMap.newKeySet() [lucene]
github-actions[bot] commented on PR #13142: URL: https://github.com/apache/lucene/pull/13142#issuecomment-1996171442 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make Lucene90 postings format to write FST off heap [lucene]
github-actions[bot] commented on PR #12985: URL: https://github.com/apache/lucene/pull/12985#issuecomment-1996171619 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org