Re: [I] RegExp::toAutomaton no longer minimizes [lucene]
ChrisHegarty commented on issue #13706: URL: https://github.com/apache/lucene/issues/13706#issuecomment-2325951431 💙 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]
javanna commented on issue #11754: URL: https://github.com/apache/lucene/issues/11754#issuecomment-2326111901 Still relevant, I think the factor is the number of segments perhaps and how they get searched concurrently (e.g. number of slices), because each slice gets its own collector with its own queue, and that seems to blow up the heap space in certain conditions. I am observing the same problem in #13542 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for intra-segment search concurrency [lucene]
javanna commented on code in PR #13542: URL: https://github.com/apache/lucene/pull/13542#discussion_r1741835108 ## lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollectorManager.java: ## @@ -28,17 +31,77 @@ */ public class TotalHitCountCollectorManager implements CollectorManager { + + /** + * Internal state shared across the different collectors that this collector manager creates. It + * tracks leaves seen as an argument of {@link Collector#getLeafCollector(LeafReaderContext)} + * calls, to ensure correctness: if the first partition of a segment early terminates, count has + * been already retrieved for the entire segment hence subsequent partitions of the same segment + * should also early terminate. If the first partition of a segment computes hit counts, + * subsequent partitions of the same segment should do the same, to prevent their counts from + * being retrieve from {@link LRUQueryCache} (which returns counts for the entire segment) + */ + private final Map seenContexts = new HashMap<>(); Review Comment: Scratch that, @original-brownbear suggested a different approach (now included in the PR) that does not require a `synchronized` block. With this, it feels like we may get around without a user option to disable the overhead when it's not needed. The flag becomes odd because the searcher knows whether there are partitions or not and could act accordingly, but the collector manager has no way currently to know which searcher uses it, or get info from it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add support for intra-segment search concurrency [lucene]
javanna commented on PR #13542: URL: https://github.com/apache/lucene/pull/13542#issuecomment-2326200273 I have lowered the number of partitions we may end up creating in tests per segment to `5`, as we could end up with thousands of those. That improves GC issues with `TestBoolean2`, but one remains, and it has been observed before even without intra-segment concurrency support, see #11754 . I tried different things to lower the amount of slices generated, but I was not able to eliminate the problem entirely, and that makes sense given the problem existed before and I believe it is a factor of the number of collectors performing the search (one per slice, hence factor of the number of slices). This PR will end up increasing the chance that `TestBoolean2#testRandom` fails when extreme slicing is tested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Operations#sameLanguage has true negatives? [lucene]
jpountz opened a new issue, #13709: URL: https://github.com/apache/lucene/issues/13709 ### Description The following test fails, even though both automata accept the same language and are determinized. Looking at the implementation, it looks like it may assume that both automata are not only determinized but also minimal? ```java public void testSameLanguage() { // automaton that accepts [ab]* with 2 states Automaton a = new Automaton(); int state0 = a.createState(); int state1 = a.createState(); a.setAccept(state0, true); a.setAccept(state1, true); a.addTransition(state0, state1, 'a'); a.addTransition(state1, state0, 'b'); // automaton that accepts [ab]* with a single state Automaton b = new Automaton(); state0 = b.createState(); b.setAccept(state0, true); b.addTransition(state0, state0, 'a'); b.addTransition(state0, state0, 'b'); assertTrue(Operations.sameLanguage(a, b)); } ``` Here are what these two automata look like: a  b  ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Operations#sameLanguage has true negatives? [lucene]
rmuir commented on issue #13709: URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326646660 doesn't look like they accept the same language to me. for example the automaton "b" accepts string `aa` but not automaton "a". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Operations#sameLanguage has true negatives? [lucene]
rmuir commented on issue #13709: URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326656149 no worries, somewhat related to this, you may indeed find bugs if you have assertions disabled due to the smelliness of sameLanguage, please see https://github.com/apache/lucene/pull/13708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Operations#sameLanguage has true negatives? [lucene]
jpountz commented on issue #13709: URL: https://github.com/apache/lucene/issues/13709#issuecomment-2326652907 Lol, this is embarrassing. Sorry for the noise. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Operations#sameLanguage has true negatives? [lucene]
jpountz closed issue #13709: Operations#sameLanguage has true negatives? URL: https://github.com/apache/lucene/issues/13709 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]
jainankitk opened a new issue, #13710: URL: https://github.com/apache/lucene/issues/13710 ### Description The `Weight` implementation in `PointRangeQuery#createWeight` is anonymous class making it difficult to extend or reuse specific logic from the class. Not to mention the class being more ~400 lines of code which makes good case for named class anyways IMO. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]
jainankitk opened a new pull request, #13711: URL: https://github.com/apache/lucene/pull/13711 ### Description Moves the anonymous `Weight` implementation in `PointRangeQuery#createWeight` to named class for better extensibility and resusability. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]
jpountz commented on PR #13711: URL: https://github.com/apache/lucene/pull/13711#issuecomment-2326913351 I see from the linked issue that you would like to extend `PointRangeQuery`, but in general we don't like to think of our queries as being extensible. I wonder if you could do what you need through composition rather than extension? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up advancing within a block. [lucene]
gsmiller commented on PR #13692: URL: https://github.com/apache/lucene/pull/13692#issuecomment-2326975500 Interesting. I'm not quite clear on what the difference is in the analysis you posted [here](https://github.com/apache/lucene/pull/13692#issuecomment-2324897665) vs. what got merged? Clearly a tricky balance here getting something that works well for both dense and sparse advancing cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]
jainankitk commented on PR #13711: URL: https://github.com/apache/lucene/pull/13711#issuecomment-2327098073 > I see from the linked issue that you would like to extend `PointRangeQuery`, but in general we don't like to think of our queries as being extensible. I wonder if you could do what you need through composition rather than extension? @jpountz - Thanks for your feedback. While I agree with using composition rather than extension for `PointRangeQuery` itself, `Weight` implementation itself within `PointRangeQuery` will be more reusable (via composition instead of extension) by being a non-anonymous class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene912; + +import java.io.IOException; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.VectorScorer; + +/** + * A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset + * for binarization quantization scores. + * + * @lucene.experimental + */ +public abstract class BinarizedByteVectorValues extends DocIdSetIterator { + public abstract float getDistanceToCentroid() throws IOException; + + /** Returns the cluster ID for the vector in the range [0, 255] */ + public abstract short clusterId() throws IOException; + + public static byte encodeClusterIdToByte(short clusterId) { +byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId - 256); +return bClusterId; + } + + public static short decodeClusterIdFromByte(byte bClusterId) { +short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) (bClusterId + 256); +return clusterId; + } + + public abstract float getMagnitude() throws IOException; + + public abstract float getOOQ() throws IOException; + + public abstract float getNormOC() throws IOException; + + public abstract float getODotC() throws IOException; Review Comment: I would really like if we just did `float[]` and the callers (which are scorers) know how to use it. The downside now is that if something calls `getODotC()` but it wasn't available, this would blow up right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]
benwtrent commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742505038 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene912; + +import java.io.IOException; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.VectorScorer; + +/** + * A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset + * for binarization quantization scores. + * + * @lucene.experimental + */ +public abstract class BinarizedByteVectorValues extends DocIdSetIterator { + public abstract float getDistanceToCentroid() throws IOException; + + /** Returns the cluster ID for the vector in the range [0, 255] */ + public abstract short clusterId() throws IOException; + + public static byte encodeClusterIdToByte(short clusterId) { +byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId - 256); +return bClusterId; + } + + public static short decodeClusterIdFromByte(byte bClusterId) { +short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) (bClusterId + 256); +return clusterId; + } + + public abstract float getMagnitude() throws IOException; + + public abstract float getOOQ() throws IOException; + + public abstract float getNormOC() throws IOException; + + public abstract float getODotC() throws IOException; Review Comment: I would really like if we just did `float[]` and the callers (which are scorers) know how to use it. The downside now is that if something calls `getODotC()` but it wasn't available, this would blow up right? @john-wagster -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] move Operations.sameLanguage/subsetOf to AutomatonTestUtil in test-framework [lucene]
dweiss commented on code in PR #13708: URL: https://github.com/apache/lucene/pull/13708#discussion_r1742548065 ## lucene/core/src/java/org/apache/lucene/util/automaton/StatePair.java: ## @@ -35,9 +35,14 @@ * @lucene.experimental */ public class StatePair { + // only mike knows what it does (do not expose) Review Comment: lol. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Bulk Scorer For ToParentBlockJoinQuery [lucene]
jpountz commented on code in PR #13697: URL: https://github.com/apache/lucene/pull/13697#discussion_r1742643399 ## lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java: ## @@ -440,6 +477,99 @@ private String formatScoreExplanation(int matches, int start, int end, ScoreMode } } + private abstract static class BatchAwareLeafCollector extends FilterLeafCollector { +public BatchAwareLeafCollector(LeafCollector in) { + super(in); +} + +public void endBatch() throws IOException {} + } + + private static class BlockJoinBulkScorer extends BulkScorer { +private final BulkScorer childBulkScorer; +private final ScoreMode scoreMode; +private final BitSet parents; +private final int parentsLength; + +public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode scoreMode, BitSet parents) { + this.childBulkScorer = childBulkScorer; + this.scoreMode = scoreMode; + this.parents = parents; + this.parentsLength = parents.length(); +} + +@Override +public int score(LeafCollector collector, Bits acceptDocs, int min, int max) +throws IOException { + // Subtract one because max is exclusive w.r.t. score but inclusive w.r.t prevSetBit + int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1); + int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1); + if (lastParent == prevParent) { +// No parent docs in this range +return max; + } + + BatchAwareLeafCollector wrappedCollector = wrapCollector(collector); + childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, lastParent + 1); + wrappedCollector.endBatch(); + + // If we've scored the last parent in the bit set, return NO_MORE_DOCS to indicate we are done + // scoring + return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max; +} + +@Override +public long cost() { + return childBulkScorer.cost(); +} + +// TODO: Need to resolve parent doc IDs in multi-reader space? +private BatchAwareLeafCollector wrapCollector(LeafCollector collector) { + return new BatchAwareLeafCollector(collector) { +private final Score currentParentScore = new Score(scoreMode); Review Comment: You need to extend it to override `setMinCompetitiveScore` and delegate to `scorer` when appropriate, if you want dynamic pruning to work. ## lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java: ## @@ -440,6 +477,99 @@ private String formatScoreExplanation(int matches, int start, int end, ScoreMode } } + private abstract static class BatchAwareLeafCollector extends FilterLeafCollector { +public BatchAwareLeafCollector(LeafCollector in) { + super(in); +} + +public void endBatch() throws IOException {} + } + + private static class BlockJoinBulkScorer extends BulkScorer { +private final BulkScorer childBulkScorer; +private final ScoreMode scoreMode; +private final BitSet parents; +private final int parentsLength; + +public BlockJoinBulkScorer(BulkScorer childBulkScorer, ScoreMode scoreMode, BitSet parents) { + this.childBulkScorer = childBulkScorer; + this.scoreMode = scoreMode; + this.parents = parents; + this.parentsLength = parents.length(); +} + +@Override +public int score(LeafCollector collector, Bits acceptDocs, int min, int max) +throws IOException { + // Subtract one because max is exclusive w.r.t. score but inclusive w.r.t prevSetBit + int lastParent = parents.prevSetBit(Math.min(parentsLength, max) - 1); + int prevParent = min == 0 ? -1 : parents.prevSetBit(min - 1); + if (lastParent == prevParent) { +// No parent docs in this range +return max; + } + + BatchAwareLeafCollector wrappedCollector = wrapCollector(collector); + childBulkScorer.score(wrappedCollector, acceptDocs, prevParent + 1, lastParent + 1); + wrappedCollector.endBatch(); + + // If we've scored the last parent in the bit set, return NO_MORE_DOCS to indicate we are done + // scoring + return lastParent + 1 >= parentsLength ? NO_MORE_DOCS : max; +} + +@Override +public long cost() { + return childBulkScorer.cost(); +} + +// TODO: Need to resolve parent doc IDs in multi-reader space? +private BatchAwareLeafCollector wrapCollector(LeafCollector collector) { + return new BatchAwareLeafCollector(collector) { +private final Score currentParentScore = new Score(scoreMode); Review Comment: By the way, can you add a test that dynamic pruning works with your new bulk scorer? See e.g. `TestMaxScoreBulkScorer#testBasicsWithTwoDisjunctionClauses` for an example of a similar test. ## lucene/join/src/java/org/apache/lucene/search/join/ToParentBlockJoinQuery.java: ## @@ -275,6 +286,53 @@ public
Re: [PR] Add a Better Binary Quantizer (RaBitQ) format for dense vectors [lucene]
john-wagster commented on code in PR #13651: URL: https://github.com/apache/lucene/pull/13651#discussion_r1742666476 ## lucene/core/src/java/org/apache/lucene/codecs/lucene912/BinarizedByteVectorValues.java: ## @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.codecs.lucene912; + +import java.io.IOException; +import org.apache.lucene.index.ByteVectorValues; +import org.apache.lucene.search.DocIdSetIterator; +import org.apache.lucene.search.VectorScorer; + +/** + * A version of {@link ByteVectorValues}, but additionally retrieving score correction values offset + * for binarization quantization scores. + * + * @lucene.experimental + */ +public abstract class BinarizedByteVectorValues extends DocIdSetIterator { + public abstract float getDistanceToCentroid() throws IOException; + + /** Returns the cluster ID for the vector in the range [0, 255] */ + public abstract short clusterId() throws IOException; + + public static byte encodeClusterIdToByte(short clusterId) { +byte bClusterId = clusterId <= 127 ? (byte) clusterId : (byte) (clusterId - 256); +return bClusterId; + } + + public static short decodeClusterIdFromByte(byte bClusterId) { +short clusterId = bClusterId >= 0 ? (short) bClusterId : (short) (bClusterId + 256); +return clusterId; + } + + public abstract float getMagnitude() throws IOException; + + public abstract float getOOQ() throws IOException; + + public abstract float getNormOC() throws IOException; + + public abstract float getODotC() throws IOException; Review Comment: Was naively following the pattern to get things working. This had getMagnitude and getDistanceToCentroid in it prior to adding getOOQ, NormOC, and ODotC. For MIP it only needs OOQ, NormC, and ODotC and essentially ignores (actually overlaps with the other values or 0f). So I don't think it blows up as is but it is storing an extra unnecessary float for Euclidean now. But I could be missing something. Two thoughts I had and hadn't quite gotten to yet. I think I agree with your thoughts here that this should deal with arbitrary float (or byte) values. I think where we had last landed was some way to decode those values on both read and write where the Scorer? or Format? (via reflection? to avoid versioning problems?) would supply to the Reader and Writer how to serialize and deserialize appropriately based on the similarity function a set of bytes which would eliminate some overhead there (1 byte for Euclidean). I would buy we could update these interfaces to deal with an array of floats in the meantime depending on if quantization of these corrective factors is not a high priority; I was operating right now under the assumption that we do want to try to refactor to store a minimum overhead for each vector (2 bytes / vector in the "Euclidean" and 3 bytes in the "MIP" use-case). Second thought was I think for all of these values we want to swap them out with 1 byte quantized forms and without having started that work yet it seems pretty straightforward in my head to do so, which is probably why I'm seeing it as part of this first pass. Was thinking to do both of things together and refactor this to return a getCorrectiveBytes instead of these individual corrective factors and put the burden on the caller to decode them. There is a separate place where specifically for query (not indexing) instead of pulling through a set of corrective float[] factors we are pulling through an object QueryFactors now, which may be is worth discussing further (but is a different place in the code)? I did this because we could optimize one of the floats to a short and with the belief that we could / want to optimize the other values as well (as well as greatly improving readability). Although subsequent conversations about those with the team suggest that we don't care about quantizing these values (at least not as much) because the query quantization that gets serialized is only serialized to a temporary file. I digress. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to
Re: [I] TestBoolean2.testRandomQueries fails in CI due to eating up heap space [lucene]
rmuir commented on issue #11754: URL: https://github.com/apache/lucene/issues/11754#issuecomment-2327418150 @javanna if you can reproduce it again, can you try `System.out.println(mulFactor)` at the end of the BeforeClass method? OOM in the stacktrace comes from creating priorityqueue of `1000 * mulFactor`. Then if we look at `@BeforeClass` where the `mulFactor` gets set, it isn't even constant, it is set dynamically, with exponential (!) growth like this: ```java do { addIndexes(); ... mulFactor *= 2; } while (docCount < 3000 * NUM_FILLER_DOCS); ``` I think the fact it grows index and pq size (mulFactor) exponentially, may be the reason for the trouble. half of the time (NUM_FILLER_DOCS=4k) it creates 12M document index. in some rare situations (PRE_FILLER_DOCS small or zero) it looks like it may need to addIndexes() 12 times to do this and so I think `mulFactor` can get high such as `4K`? And I think that leads to asking for millions of results at once (1000 * 4096) and causes the OOM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Move anonymous Weight implementation in PointRangeQuery to named class [lucene]
jpountz commented on PR #13711: URL: https://github.com/apache/lucene/pull/13711#issuecomment-2327425891 Sorry, I don't think we should make Lucene's `Weight` implementations public. I looked up the OpenSearch issue, if I understand correctly, the problem you're trying to solve is that it's wasteful for `PointRangeQuery` to evaluate the whole range when it's only asked for the first 10 doc IDs that match the range query. I agree it's wasteful. We have the same problem on nightly benchmarks and the IntNRQ task. I wonder if there are better ways to do what you're after, e.g. adding a `TopDocs Weight#topk(int n, int totalHitsThreshold)` API that would default to collecting hits, and that some classes such as `PointRangeQuery` could override. As I'm writing this, I'm not convinced that it's actually a good idea. Using sparse indexing would likely be a better approach, especially if the index can be sorted, as this would produce good iterators that don't have this huge up-front cost of evaluating the query against the entire segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorOwner class (#13671) [lucene]
gsmiller commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1742791755 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -480,59 +462,61 @@ private void searchSequentially( } Query[] drillDownQueries = query.getDrillDownQueries(); -DrillSidewaysQuery dsq = -new DrillSidewaysQuery( -baseQuery, -// drillDownCollectorOwner, -// Don't pass drill down collector because drill down is collected by IndexSearcher -// itself. -// TODO: deprecate drillDown collection in DrillSidewaysQuery? -null, -drillSidewaysCollectorOwners, -drillDownQueries, -scoreSubDocsAtOnce()); - -searcher.search(dsq, drillDownCollectorOwner); -// This method doesn't return results as each dimension might have its own result type. -// But we call getResult to trigger results reducing, so that users don't have to worry about -// it. -drillDownCollectorOwner.getResult(); -if (drillSidewaysCollectorOwners != null) { - for (CollectorOwner sidewaysOwner : drillSidewaysCollectorOwners) { -sidewaysOwner.getResult(); +DrillSidewaysQuery dsq = +new DrillSidewaysQuery<>( +baseQuery, drillSidewaysCollectorManagers, drillDownQueries, scoreSubDocsAtOnce()); + +T collectorResult = searcher.search(dsq, drillDownCollectorManager); +List drillSidewaysResults = new ArrayList<>(drillDownDims.size()); +assert drillSidewaysCollectorManagers != null +: "Case without drill sideways dimensions is handled above"; +int numSlices = dsq.managedDrillSidewaysCollectors.size(); +for (int dim = 0; dim < drillDownDims.size(); dim++) { + List collectorsForDim = new ArrayList<>(numSlices); + for (int slice = 0; slice < numSlices; slice++) { + collectorsForDim.add(dsq.managedDrillSidewaysCollectors.get(slice).get(dim)); } + drillSidewaysResults.add( + dim, drillSidewaysCollectorManagers.get(dim).reduce(collectorsForDim)); } +return new Result<>(collectorResult, drillSidewaysResults); } - private void searchConcurrently( + private Result searchConcurrently( final DrillDownQuery query, - final CollectorOwner drillDownCollectorOwner, - final List> drillSidewaysCollectorOwners) + final CollectorManager drillDownCollectorManager, + final List> drillSidewaysCollectorManagers) throws IOException { Review Comment: Yeah, we could do that, but I think I'd be more inclined to leave it as-is for now and explore the idea of moving the "sideways" functionality into IndexSearcher as a supported pattern there, and fully deprecate `DrillSideways` as its own entry point. If we did that, we could leave out the concurrent version of this algorithm in IndexSearcher initially and it could be ported later if a valid use-case comes up? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorOwner class (#13671) [lucene]
gsmiller commented on code in PR #13702: URL: https://github.com/apache/lucene/pull/13702#discussion_r1742821648 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSideways.java: ## @@ -414,62 +399,59 @@ public ConcurrentDrillSidewaysResult search( /** * Search using DrillDownQuery with custom collectors. This method can be used with any {@link - * CollectorOwner}s. It doesn't return anything because it is expected that you read results from - * provided {@link CollectorOwner}s. - * - * To read the results, run {@link CollectorOwner#getResult()} for drill down and all drill - * sideways dimensions. - * - * Note: use {@link Collections#unmodifiableList(List)} to wrap {@code - * drillSidewaysCollectorOwners} to convince compiler that it is safe to use List here. - * - * Use {@link MultiCollectorManager} wrapped by {@link CollectorOwner} to collect both hits and - * facets for the entire query and/or for drill-sideways dimensions. + * CollectorManager}s. * - * TODO: Class CollectorOwner was created so that we can ignore CollectorManager type C, - * because we want each dimensions to be able to use their own types. Alternatively, we can use - * typesafe heterogeneous container and provide CollectorManager type for each dimension to this - * method? I do like CollectorOwner approach as it seems more intuitive? + * Note: Use {@link MultiCollectorManager} to collect both hits and facets for the entire query + * and/or for drill-sideways dimensions. You can also use it to wrap different types of {@link + * CollectorManager} for drill-sideways dimensions. */ - public void search( + public Result search( final DrillDownQuery query, - CollectorOwner drillDownCollectorOwner, - List> drillSidewaysCollectorOwners) + CollectorManager drillDownCollectorManager, + List> drillSidewaysCollectorManagers) throws IOException { -if (drillDownCollectorOwner == null) { +if (drillDownCollectorManager == null) { throw new IllegalArgumentException( "This search method requires client to provide drill down collector manager"); } -if (drillSidewaysCollectorOwners == null) { +if (drillSidewaysCollectorManagers == null) { Review Comment: Thanks! I personally think it might be a little aggressive to deprecate all these methods, especially so close to the planned 10.0 release. I'm worried we're overlooking legitimate needs/use-cases. Let's follow up with a general `DrillSideways` cleanup in 10.0? Whether that's moving everything into IndexSearcher or simplifying `DrillSideways`, I think we've got some options, but I'd prefer to take our time a little more. To this end, I reverted the latest commit that adds the deprecation on your branch (thanks for keeping it separate!) and am merging/back-porting. If you feel strongly that we should move for deprecation in 9.12 please let me know and we can still consider it (but as a separate commit). Does that sound reasonable? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [WIP] Multi-Vector support for HNSW search [lucene]
github-actions[bot] commented on PR #13525: URL: https://github.com/apache/lucene/pull/13525#issuecomment-2327674839 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method [lucene]
github-actions[bot] commented on PR #12862: URL: https://github.com/apache/lucene/pull/12862#issuecomment-2327675527 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorOwner class (#13671) [lucene]
gsmiller merged PR #13702: URL: https://github.com/apache/lucene/pull/13702 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Remove CollectorOwner class (#13671) [lucene]
gsmiller commented on PR #13702: URL: https://github.com/apache/lucene/pull/13702#issuecomment-2327689510 Got this merged on main but will work on the back port in the morning (need to work through some conflicts and running out of time today). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org