[GitHub] [lucene] stefanvodita commented on pull request #12454: Clean up ordinal map in default SSDV reader state
stefanvodita commented on PR #12454: URL: https://github.com/apache/lucene/pull/12454#issuecomment-1656692746 Thanks Greg! I’ve kept the hash map and did some clean-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #12454: Clean up ordinal map in default SSDV reader state
stefanvodita commented on code in PR #12454: URL: https://github.com/apache/lucene/pull/12454#discussion_r1278280338 ## lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java: ## @@ -233,13 +239,13 @@ private int createOneFlatFacetDimState(SortedSetDocValues dv, int dimStartOrd) /** Return the memory usage of this object in bytes. Negative values are illegal. */ @Override public long ramBytesUsed() { -synchronized (cachedOrdMaps) { - long bytes = 0; - for (OrdinalMap map : cachedOrdMaps.values()) { -bytes += map.ramBytesUsed(); +synchronized (cachedOrdMap) { Review Comment: We might not need this synchronization. Because the map has one entry at most and we're only writing it once, I don't think we can end up in an invalid state here, but it seemes wiser not to rely on that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #12354: Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query
stefanvodita commented on code in PR #12354: URL: https://github.com/apache/lucene/pull/12354#discussion_r1278283127 ## lucene/core/src/java/org/apache/lucene/search/TermQuery.java: ## @@ -264,11 +264,25 @@ public TermStates getTermStates() { /** Returns true iff other is equal to this. */ @Override public boolean equals(Object other) { -return sameClassAs(other) && term.equals(((TermQuery) other).term); +if (!sameClassAs(other)) { Review Comment: Lucene code generally prefers to avoid `!` for negations, so this would be `sameClassAs(other) == false`. ## lucene/core/src/java/org/apache/lucene/search/TermQuery.java: ## @@ -264,11 +264,25 @@ public TermStates getTermStates() { /** Returns true iff other is equal to this. */ @Override public boolean equals(Object other) { -return sameClassAs(other) && term.equals(((TermQuery) other).term); +if (!sameClassAs(other)) { + return false; +} +TermQuery otherTermQuery = (TermQuery) other; +if (!term.equals(otherTermQuery.term)) { + return false; +} +if (perReaderTermState != null && otherTermQuery.perReaderTermState != null) { + return perReaderTermState.docFreq() == otherTermQuery.perReaderTermState.docFreq(); Review Comment: What do you think of implementing `equals` for `TermStates` and delegating this part to it? I see `docFreq` can throw an exception and we could handle that case too. ## lucene/core/src/java/org/apache/lucene/search/TermQuery.java: ## @@ -264,11 +264,25 @@ public TermStates getTermStates() { /** Returns true iff other is equal to this. */ @Override public boolean equals(Object other) { -return sameClassAs(other) && term.equals(((TermQuery) other).term); +if (!sameClassAs(other)) { + return false; +} +TermQuery otherTermQuery = (TermQuery) other; +if (!term.equals(otherTermQuery.term)) { + return false; +} +if (perReaderTermState != null && otherTermQuery.perReaderTermState != null) { + return perReaderTermState.docFreq() == otherTermQuery.perReaderTermState.docFreq(); +} +return perReaderTermState == null && otherTermQuery.perReaderTermState == null; } @Override public int hashCode() { -return classHash() ^ term.hashCode(); +int hash = classHash() ^ term.hashCode(); +if (perReaderTermState != null) { + hash ^= Integer.hashCode(perReaderTermState.docFreq()); Review Comment: Similar to the question about `equals`, what if we implemented `TermStates.hashCode`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi opened a new pull request, #12470: fix error convert from utf32 to utf8
tang-hi opened a new pull request, #12470: URL: https://github.com/apache/lucene/pull/12470 ### Description fix error convert from utf32 to utf8 ISSUE #12458 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor
almogtavor commented on issue #12406: URL: https://github.com/apache/lucene/issues/12406#issuecomment-1656712504 @romseygeek @dweiss @uschindler @dsmiley @gsmiller I'd love to get feedback from you on the subject -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi closed pull request #12470: fix error convert from utf32 to utf8
tang-hi closed pull request #12470: fix error convert from utf32 to utf8 URL: https://github.com/apache/lucene/pull/12470 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
benwtrent commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656718447 OK, here (unless we have done these incorrectly), is the final Cohere test (IMO). This is mixing English and Japanese embeddings (just in case the cohere model encodes info for language in magnitude). To mix the language embeddings, I appended the embedded rows together, shuffled them and then ran the previous testing procedure. # EN-JA Reversed | Ordered | Random :-:|:-:|:-: ! || -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new issue, #12471: TestLucene60FieldInfosFormat.testRandom test failure
benwtrent opened a new issue, #12471: URL: https://github.com/apache/lucene/issues/12471 ### Description This has failed many times, I haven't yet dug into why yet. ``` org.apache.lucene.backward_codecs.lucene60.TestLucene60FieldInfosFormat > testRandom FAILED java.lang.IllegalArgumentException: bound must be positive at __randomizedtesting.SeedInfo.seed([BE0D2C0D0F74A9A6:CC410902BE141FD5]:0) at java.base/java.util.Random.nextInt(Random.java:322) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom.nextInt(Xoroshiro128PlusRandom.java:73) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.AssertingRandom.nextInt(AssertingRandom.java:87) ``` full trace ``` org.apache.lucene.backward_codecs.lucene60.TestLucene60FieldInfosFormat > testRandom FAILED java.lang.IllegalArgumentException: bound must be positive at __randomizedtesting.SeedInfo.seed([BE0D2C0D0F74A9A6:CC410902BE141FD5]:0) at java.base/java.util.Random.nextInt(Random.java:322) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom.nextInt(Xoroshiro128PlusRandom.java:73) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.AssertingRandom.nextInt(AssertingRandom.java:87) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.BaseFieldInfoFormatTestCase.randomFieldType(BaseFieldInfoFormatTestCase.java:358) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.BaseFieldInfoFormatTestCase.testRandom(BaseFieldInfoFormatTestCase.java:282) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(Statement
[GitHub] [lucene] benwtrent commented on issue #12471: TestLucene60FieldInfosFormat.testRandom test failure
benwtrent commented on issue #12471: URL: https://github.com/apache/lucene/issues/12471#issuecomment-1656720282 Verified this is caused by: https://github.com/apache/lucene/commit/119635ad808c38d6878c5897bcf16a3b97523d4d -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi opened a new pull request, #12472: Fix UTF32toUTF8 will produce invalid transition
tang-hi opened a new pull request, #12472: URL: https://github.com/apache/lucene/pull/12472 FIX ISSUE #12458 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] tang-hi commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode
tang-hi commented on issue #12458: URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544 I have discovered the bug. ```Java if (endUTF8.numBits(upto) == 5) { // special case -- avoid created unused edges (endUTF8 // doesn't accept certain byte sequences) -- there // are other cases we could optimize too: startCode = 194; } else { startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 1]); } ``` In the provided Java code snippet, there is an assumption that the codepoint will start at 0x80 when the number of bits is 6. However, this assumption is incorrect. In reality, when the length of the codepoint is 3 and the first byte is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this information on the following website: https://www.utf8-chartable.de/unicode-utf8-table.pl Here is the PR to fix that #12472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter merged pull request #12468: hunspell: check for aff file wellformedness more strictly
donnerpeter merged PR #12468: URL: https://github.com/apache/lucene/pull/12468 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent opened a new pull request, #12473: Fix randomly failing field info format tests
benwtrent opened a new pull request, #12473: URL: https://github.com/apache/lucene/pull/12473 The crux of the issue is that we attempt to generate a vector with at least size 1, but the EMPTY KnnVectorsField type has an upper limit of `0` (and thus is not a valid upper limit for random int). Since the empty field type already protects against usage, I figure making its limit `1` to allow random testing is appropriate. closes https://github.com/apache/lucene/issues/12471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
msokolov commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656797446 Q: just want to make sure I understand what the transformation is that you are testing here. Is it that you are taking non-unit vectors and making them into unit vectors by dividing by the max length in the corpus and then adding another dimension to force the vector to be unit length? And then comparing this to max-inner-product on vectors of varying length? If so, what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #12415: Optimize disjunction counts.
msokolov commented on code in PR #12415: URL: https://github.com/apache/lucene/pull/12415#discussion_r1278351078 ## lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java: ## @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.util.function.IntConsumer; + +/** Like {@link IntConsumer}, but may throw checked exceptions. */ +@FunctionalInterface +public interface CheckedIntConsumer { Review Comment: How about `s/throws IOException//g` and `s/IOException/IOExceptionUnchecked/g` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on issue #12463: Learned sorting algorithm for Lucene
msokolov commented on issue #12463: URL: https://github.com/apache/lucene/issues/12463#issuecomment-1656810622 https://github.com/anikristo/LearnedSort/ is GPLv3 licensed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12473: Fix randomly failing field info format tests
jpountz commented on PR #12473: URL: https://github.com/apache/lucene/pull/12473#issuecomment-1656838152 With this change, would it be possible that the field gets added to field infos permanently even though vectors may never be actually indexed? Even if it not possible, I like the fact that having the limit at zero prevents it for codecs that don't support vectors the same way that it does for codecs that support vectors. I would prefer fixing the test to only add vectors if the codec limit is above 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #12457: Improve MaxScoreBulkScorer partitioning logic.
jpountz merged PR #12457: URL: https://github.com/apache/lucene/pull/12457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #12415: Optimize disjunction counts.
jpountz commented on code in PR #12415: URL: https://github.com/apache/lucene/pull/12415#discussion_r1278368477 ## lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java: ## @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.util.function.IntConsumer; + +/** Like {@link IntConsumer}, but may throw checked exceptions. */ +@FunctionalInterface +public interface CheckedIntConsumer { Review Comment: I would like `LeafCollector#collect` to be a valid method reference that implements this functional interface, and I don't want to change the signature of `LeafCollector#collect`. If we remove the exception here, it would force the default implementation of `LeafCollector#collect(DocIdStream)` to change from this: ```java default void collect(DocIdStream stream) throws IOException { stream.forEach(this::collect); } ``` to that ```java default void collect(DocIdStream stream) throws IOException { stream.forEach(doc -> { try { collect(doc); } catch (IOException e) { throw new UncheckedIOException(e); } }); } ``` which I like less than introducing this functional interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita opened a new issue, #12474: Should more Facets implementations be concurrent?
stefanvodita opened a new issue, #12474: URL: https://github.com/apache/lucene/issues/12474 ### Description Faceting can happen in parallel for each segment, but most faceting implementations don't take advantage of this. I'm wondering if more faceting implementations should be concurrent. Maybe we can decide to run in sequential mode or concurrent mode dynamically based on the number of docs we facet over. For an example of concurrent facets, see`ConcurrentSortedSetDocValuesFacetCounts`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support
benwtrent commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656874923 Thanks for digging in @msokolov! > I'd like to have a clearer sense of the problem you're solving. This PR solves a similar, but different problem to: https://github.com/apache/lucene/issues/12313 Text embedding models have a token limit, so when processing larger text inputs, they need to be divided into passages. These passages share a common parent document. When users search for the "top-k" documents, they expect the initial parent document as the result, not just individual passages. A "multi-value" vector only partially solves this. The user will need to know the nearest passages across those documents to use in retrieval augmented generation. Multi-value fields cannot solve this as metadata needs to be associated with each vector to tie them to an originating passage, or better yet, some field containing the passage text itself. `join` seemed like a natural place to tackle this. - We get the top-k parent documents (e.g. the users larger chunk of text) - And can still get the nearest passages from that deduplicated set of parent documents Not to mention the nice flexibility we get (filtering on passage metadata, filtering on parent documents, hybrid scoring on the parent or child level, etc.) > What confuses me is, I would have expected something like `ToParentBlockJoinQuery(parentBitSet, KnnVectorQuery())` to more or less work already. The main issue is that it won't return the correct number of parent documents when the user requests the top-k parents based on their children vectors. If there are multiple children per parent, this approach may return fewer than k parent documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] searchivarius commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores
searchivarius commented on issue #12342: URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656883408 @msokolov the transformation also preserves the inner product up to a query-specific constant. >what is the meaning of the length of the vectors - where does it come from? Do the training processes generate non-unit vectors Yes, the training process often uses the inner/doct product and not the cosine similarity, so some information is being encoded by the vector length/magnitude. Typically, however, the variance isn't that huge. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #12415: Optimize disjunction counts.
msokolov commented on code in PR #12415: URL: https://github.com/apache/lucene/pull/12415#discussion_r1278450910 ## lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java: ## @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.util.function.IntConsumer; + +/** Like {@link IntConsumer}, but may throw checked exceptions. */ +@FunctionalInterface +public interface CheckedIntConsumer { Review Comment: My suggestion was global search and replace throughout Lucene, only half-serious -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #12434: Add ParentJoin KNN support
msokolov commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656953197 > The main issue is that it won't return the correct number of parent documents when the user requests the top-k parents based on their children vectors. If there are multiple children per parent, this approach may return fewer than k parent documents. Thanks I see now. So this is kind of similar in spirit to the existing problem where we want the top K documents (by vector distance) satisfying some constraints and we have to choose whether to find some higher number of nearest docs (K') in the hopes that at least K of them will satisfy the constraints (post-filtering), or whether to apply the filters while searching, guaranteeing top K. I just want to note that both approaches have merit; it's a tradeoff depending on how restrictive the filters are, but for not-very-restrictive filters, post-filtering can outperform. In this case I guess there is a similar tradeoff relating to how many child documents there typically are. If it's a small number (say c children per parent), it may be better to use KNN search with K' = c * K. It would be interesting to compare these two approaches to see if we can provide some guidance or even some kind of api that chooses? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #12434: Add ParentJoin KNN support
jpountz commented on PR #12434: URL: https://github.com/apache/lucene/pull/12434#issuecomment-1657050056 I agree that there is similarity in that in both cases it boils down to whether or not you can accept having less than `k` hits. However the degradation is brutal with filtering as you either need to evaluate the filter across the entire segment to load it into a bitset (not great for both runtime (if the filter cardinality is high) and memory usage) or linearly scan all filter matches (not great either). Here the degradation is much more graceful as you only pay some overhead for vectors that get collected. For filtering, I could see a case for requesting k'>k vectors and then do post filtering. For this case I think I would always want to use this feature, potentially combined with the `visitLimit` option to protect against worst-case conditions like a million child docs per parent that would make collisions frequent. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org