[GitHub] [lucene] stefanvodita commented on pull request #12454: Clean up ordinal map in default SSDV reader state

2023-07-29 Thread via GitHub


stefanvodita commented on PR #12454:
URL: https://github.com/apache/lucene/pull/12454#issuecomment-1656692746

   Thanks Greg! I’ve kept the hash map and did some clean-up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #12454: Clean up ordinal map in default SSDV reader state

2023-07-29 Thread via GitHub


stefanvodita commented on code in PR #12454:
URL: https://github.com/apache/lucene/pull/12454#discussion_r1278280338


##
lucene/facet/src/java/org/apache/lucene/facet/sortedset/DefaultSortedSetDocValuesReaderState.java:
##
@@ -233,13 +239,13 @@ private int createOneFlatFacetDimState(SortedSetDocValues 
dv, int dimStartOrd)
   /** Return the memory usage of this object in bytes. Negative values are 
illegal. */
   @Override
   public long ramBytesUsed() {
-synchronized (cachedOrdMaps) {
-  long bytes = 0;
-  for (OrdinalMap map : cachedOrdMaps.values()) {
-bytes += map.ramBytesUsed();
+synchronized (cachedOrdMap) {

Review Comment:
   We might not need this synchronization. Because the map has one entry at 
most and we're only writing it once, I don't think we can end up in an invalid 
state here, but it seemes wiser not to rely on that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #12354: Fix docFreq in score calculation after rewrite of boolean query consisting of blended query and boosted term query

2023-07-29 Thread via GitHub


stefanvodita commented on code in PR #12354:
URL: https://github.com/apache/lucene/pull/12354#discussion_r1278283127


##
lucene/core/src/java/org/apache/lucene/search/TermQuery.java:
##
@@ -264,11 +264,25 @@ public TermStates getTermStates() {
   /** Returns true iff other is equal to this. */
   @Override
   public boolean equals(Object other) {
-return sameClassAs(other) && term.equals(((TermQuery) other).term);
+if (!sameClassAs(other)) {

Review Comment:
   Lucene code generally prefers to avoid `!` for negations, so this would be 
`sameClassAs(other) == false`.



##
lucene/core/src/java/org/apache/lucene/search/TermQuery.java:
##
@@ -264,11 +264,25 @@ public TermStates getTermStates() {
   /** Returns true iff other is equal to this. */
   @Override
   public boolean equals(Object other) {
-return sameClassAs(other) && term.equals(((TermQuery) other).term);
+if (!sameClassAs(other)) {
+  return false;
+}
+TermQuery otherTermQuery = (TermQuery) other;
+if (!term.equals(otherTermQuery.term)) {
+  return false;
+}
+if (perReaderTermState != null && otherTermQuery.perReaderTermState != 
null) {
+  return perReaderTermState.docFreq() == 
otherTermQuery.perReaderTermState.docFreq();

Review Comment:
   What do you think of implementing `equals` for `TermStates` and delegating 
this part to it? I see `docFreq` can throw an exception and we could handle 
that case too.



##
lucene/core/src/java/org/apache/lucene/search/TermQuery.java:
##
@@ -264,11 +264,25 @@ public TermStates getTermStates() {
   /** Returns true iff other is equal to this. */
   @Override
   public boolean equals(Object other) {
-return sameClassAs(other) && term.equals(((TermQuery) other).term);
+if (!sameClassAs(other)) {
+  return false;
+}
+TermQuery otherTermQuery = (TermQuery) other;
+if (!term.equals(otherTermQuery.term)) {
+  return false;
+}
+if (perReaderTermState != null && otherTermQuery.perReaderTermState != 
null) {
+  return perReaderTermState.docFreq() == 
otherTermQuery.perReaderTermState.docFreq();
+}
+return perReaderTermState == null && otherTermQuery.perReaderTermState == 
null;
   }
 
   @Override
   public int hashCode() {
-return classHash() ^ term.hashCode();
+int hash = classHash() ^ term.hashCode();
+if (perReaderTermState != null) {
+  hash ^= Integer.hashCode(perReaderTermState.docFreq());

Review Comment:
   Similar to the question about `equals`, what if we implemented 
`TermStates.hashCode`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi opened a new pull request, #12470: fix error convert from utf32 to utf8

2023-07-29 Thread via GitHub


tang-hi opened a new pull request, #12470:
URL: https://github.com/apache/lucene/pull/12470

   ### Description
   
   
   fix error convert from utf32 to utf8
   ISSUE #12458 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor

2023-07-29 Thread via GitHub


almogtavor commented on issue #12406:
URL: https://github.com/apache/lucene/issues/12406#issuecomment-1656712504

   @romseygeek @dweiss @uschindler @dsmiley @gsmiller I'd love to get feedback 
from you on the subject


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi closed pull request #12470: fix error convert from utf32 to utf8

2023-07-29 Thread via GitHub


tang-hi closed pull request #12470: fix error convert from utf32 to utf8
URL: https://github.com/apache/lucene/pull/12470


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-29 Thread via GitHub


benwtrent commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656718447

   OK, here (unless we have done these incorrectly), is the final Cohere test 
(IMO). This is mixing English and Japanese embeddings (just in case the cohere 
model encodes info for language in magnitude). To mix the language embeddings, 
I appended the embedded rows together, shuffled them and then ran the previous 
testing procedure.
   
   # EN-JA
Reversed | Ordered | Random
   
:-:|:-:|:-:
   
!![image](https://github.com/apache/lucene/assets/4357155/68e80b3e-de5e-4744-a8a3-5fdf25c21b44)
   
|![image](https://github.com/apache/lucene/assets/4357155/de604446-9635-4a5b-9993-a4bd95f86ae4)|![image](https://github.com/apache/lucene/assets/4357155/2fc51b8b-f7ef-4594-a889-8dc1a4238389)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new issue, #12471: TestLucene60FieldInfosFormat.testRandom test failure

2023-07-29 Thread via GitHub


benwtrent opened a new issue, #12471:
URL: https://github.com/apache/lucene/issues/12471

   ### Description
   
   This has failed many times, I haven't yet dug into why yet.
   
   ```
   org.apache.lucene.backward_codecs.lucene60.TestLucene60FieldInfosFormat > 
testRandom FAILED
   java.lang.IllegalArgumentException: bound must be positive
   at 
__randomizedtesting.SeedInfo.seed([BE0D2C0D0F74A9A6:CC410902BE141FD5]:0)
   at java.base/java.util.Random.nextInt(Random.java:322)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom.nextInt(Xoroshiro128PlusRandom.java:73)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.AssertingRandom.nextInt(AssertingRandom.java:87)
   ```
   
full trace 
   
   ```
   org.apache.lucene.backward_codecs.lucene60.TestLucene60FieldInfosFormat > 
testRandom FAILED
   java.lang.IllegalArgumentException: bound must be positive
   at 
__randomizedtesting.SeedInfo.seed([BE0D2C0D0F74A9A6:CC410902BE141FD5]:0)
   at java.base/java.util.Random.nextInt(Random.java:322)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom.nextInt(Xoroshiro128PlusRandom.java:73)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.AssertingRandom.nextInt(AssertingRandom.java:87)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.BaseFieldInfoFormatTestCase.randomFieldType(BaseFieldInfoFormatTestCase.java:358)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.BaseFieldInfoFormatTestCase.testRandom(BaseFieldInfoFormatTestCase.java:282)
   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base/java.lang.reflect.Method.invoke(Method.java:568)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   at 
randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(Statement

[GitHub] [lucene] benwtrent commented on issue #12471: TestLucene60FieldInfosFormat.testRandom test failure

2023-07-29 Thread via GitHub


benwtrent commented on issue #12471:
URL: https://github.com/apache/lucene/issues/12471#issuecomment-1656720282

   Verified this is caused by: 
https://github.com/apache/lucene/commit/119635ad808c38d6878c5897bcf16a3b97523d4d


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi opened a new pull request, #12472: Fix UTF32toUTF8 will produce invalid transition

2023-07-29 Thread via GitHub


tang-hi opened a new pull request, #12472:
URL: https://github.com/apache/lucene/pull/12472

   FIX ISSUE #12458 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] tang-hi commented on issue #12458: UTF32toUTF8 can create automata that produce/accept invalid unicode

2023-07-29 Thread via GitHub


tang-hi commented on issue #12458:
URL: https://github.com/apache/lucene/issues/12458#issuecomment-1656725544

I have discovered the bug. 
   ```Java
   if (endUTF8.numBits(upto) == 5) {
   // special case -- avoid created unused edges (endUTF8
   // doesn't accept certain byte sequences) -- there
   // are other cases we could optimize too:
   startCode = 194;
 } else {
   startCode = endUTF8.byteAt(upto) & (~MASKS[endUTF8.numBits(upto) - 
1]);
}
   ```
   In the provided Java code snippet, there is an assumption that the codepoint 
will start at 0x80 when the number of bits is 6. However, this assumption is 
incorrect. In reality, when the length of the codepoint is 3 and the first byte 
is 0xE0, it will start from 0xA0. Similarly, when the length of the codepoint 
is 4 and the first byte is 0xF0, it will start from 0x90. You can verify this 
information on the following website: 
https://www.utf8-chartable.de/unicode-utf8-table.pl
   Here is the PR to fix that #12472 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] donnerpeter merged pull request #12468: hunspell: check for aff file wellformedness more strictly

2023-07-29 Thread via GitHub


donnerpeter merged PR #12468:
URL: https://github.com/apache/lucene/pull/12468


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent opened a new pull request, #12473: Fix randomly failing field info format tests

2023-07-29 Thread via GitHub


benwtrent opened a new pull request, #12473:
URL: https://github.com/apache/lucene/pull/12473

   The crux of the issue is that we attempt to generate a vector with at least 
size 1, but the EMPTY KnnVectorsField type has an upper limit of `0` (and thus 
is not a valid upper limit for random int). Since the empty field type already 
protects against usage, I figure making its limit `1` to allow random testing 
is appropriate.
   
   closes https://github.com/apache/lucene/issues/12471


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-29 Thread via GitHub


msokolov commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656797446

   Q: just want to make sure I understand what the transformation is that you 
are testing here. Is it that you are taking non-unit vectors and making them 
into unit vectors by dividing by the max length in the corpus and then adding 
another dimension to force the vector to be unit length? And then comparing 
this to max-inner-product on vectors of varying length? If so, what is the 
meaning of the length of the vectors - where does it come from? Do the training 
processes generate non-unit vectors?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #12415: Optimize disjunction counts.

2023-07-29 Thread via GitHub


msokolov commented on code in PR #12415:
URL: https://github.com/apache/lucene/pull/12415#discussion_r1278351078


##
lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java:
##
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.function.IntConsumer;
+
+/** Like {@link IntConsumer}, but may throw checked exceptions. */
+@FunctionalInterface
+public interface CheckedIntConsumer {

Review Comment:
   How about `s/throws IOException//g` and 
`s/IOException/IOExceptionUnchecked/g` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #12463: Learned sorting algorithm for Lucene

2023-07-29 Thread via GitHub


msokolov commented on issue #12463:
URL: https://github.com/apache/lucene/issues/12463#issuecomment-1656810622

   https://github.com/anikristo/LearnedSort/ is GPLv3 licensed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12473: Fix randomly failing field info format tests

2023-07-29 Thread via GitHub


jpountz commented on PR #12473:
URL: https://github.com/apache/lucene/pull/12473#issuecomment-1656838152

   With this change, would it be possible that the field gets added to field 
infos permanently even though vectors may never be actually indexed? Even if it 
not possible, I like the fact that having the limit at zero prevents it for 
codecs that don't support vectors the same way that it does for codecs that 
support vectors. I would prefer fixing the test to only add vectors if the 
codec limit is above 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12457: Improve MaxScoreBulkScorer partitioning logic.

2023-07-29 Thread via GitHub


jpountz merged PR #12457:
URL: https://github.com/apache/lucene/pull/12457


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #12415: Optimize disjunction counts.

2023-07-29 Thread via GitHub


jpountz commented on code in PR #12415:
URL: https://github.com/apache/lucene/pull/12415#discussion_r1278368477


##
lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java:
##
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.function.IntConsumer;
+
+/** Like {@link IntConsumer}, but may throw checked exceptions. */
+@FunctionalInterface
+public interface CheckedIntConsumer {

Review Comment:
   I would like `LeafCollector#collect` to be a valid method reference that 
implements this functional interface, and I don't want to change the signature 
of `LeafCollector#collect`. If we remove the exception here, it would force the 
default implementation of `LeafCollector#collect(DocIdStream)` to change from 
this:
   
   ```java
 default void collect(DocIdStream stream) throws IOException {
   stream.forEach(this::collect);
 }
   ```
   
   to that
   
   ```java
 default void collect(DocIdStream stream) throws IOException {
   stream.forEach(doc -> {
 try {
   collect(doc);
 } catch (IOException e) {
   throw new UncheckedIOException(e);
 }
   });
 }
   ```
   
   which I like less than introducing this functional interface.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita opened a new issue, #12474: Should more Facets implementations be concurrent?

2023-07-29 Thread via GitHub


stefanvodita opened a new issue, #12474:
URL: https://github.com/apache/lucene/issues/12474

   ### Description
   
   Faceting can happen in parallel for each segment, but most faceting 
implementations don't take advantage of this.
   
   I'm wondering if more faceting implementations should be concurrent. Maybe 
we can decide to run in sequential mode or concurrent mode dynamically based on 
the number of docs we facet over.
   
   For an example of concurrent facets, 
see`ConcurrentSortedSetDocValuesFacetCounts`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12434: Add ParentJoin KNN support

2023-07-29 Thread via GitHub


benwtrent commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656874923

   Thanks for digging in @msokolov!
   
   > I'd like to have a clearer sense of the problem you're solving.
   
   This PR solves a similar, but different problem to: 
https://github.com/apache/lucene/issues/12313 
   
   Text embedding models have a token limit, so when processing larger text 
inputs, they need to be divided into passages. These passages share a common 
parent document. When users search for the "top-k" documents, they expect the 
initial parent document as the result, not just individual passages.
   
   A "multi-value" vector only partially solves this. The user will need to 
know the nearest passages across those documents to use in retrieval augmented 
generation. Multi-value fields cannot solve this as metadata needs to be 
associated with each vector to tie them to an originating passage, or better 
yet, some field containing the passage text itself. `join` seemed like a 
natural place to tackle this.
   
- We get the top-k parent documents (e.g. the users larger chunk of text)
- And can still get the nearest passages from that deduplicated set of 
parent documents
   
   Not to mention the nice flexibility we get (filtering on passage metadata, 
filtering on parent documents, hybrid scoring on the parent or child level, 
etc.)
   
   > What confuses me is, I would have expected something like 
`ToParentBlockJoinQuery(parentBitSet, KnnVectorQuery())` to more or less work 
already.
   
   The main issue is that it won't return the correct number of parent 
documents when the user requests the top-k parents based on their children 
vectors. If there are multiple children per parent, this approach may return 
fewer than k parent documents.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] searchivarius commented on issue #12342: Prevent VectorSimilarity.DOT_PRODUCT from returning negative scores

2023-07-29 Thread via GitHub


searchivarius commented on issue #12342:
URL: https://github.com/apache/lucene/issues/12342#issuecomment-1656883408

   @msokolov the transformation also preserves the inner product up to a 
query-specific constant.
   
   >what is the meaning of the length of the vectors - where does it come from? 
Do the training processes generate non-unit vectors
   
   Yes, the training process often uses the inner/doct product and not the 
cosine similarity, so some information is being encoded by the vector 
length/magnitude. Typically, however, the variance isn't that huge. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #12415: Optimize disjunction counts.

2023-07-29 Thread via GitHub


msokolov commented on code in PR #12415:
URL: https://github.com/apache/lucene/pull/12415#discussion_r1278450910


##
lucene/core/src/java/org/apache/lucene/search/CheckedIntConsumer.java:
##
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.util.function.IntConsumer;
+
+/** Like {@link IntConsumer}, but may throw checked exceptions. */
+@FunctionalInterface
+public interface CheckedIntConsumer {

Review Comment:
   My suggestion was global search and replace throughout Lucene, only 
half-serious



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #12434: Add ParentJoin KNN support

2023-07-29 Thread via GitHub


msokolov commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1656953197

   > The main issue is that it won't return the correct number of parent 
documents when the user requests the top-k parents based on their children 
vectors. If there are multiple children per parent, this approach may return 
fewer than k parent documents.
   
   Thanks I see now. So this is kind of similar in spirit to the existing 
problem where we want the top K documents (by vector distance) satisfying some 
constraints and we have to choose whether to find some higher number of nearest 
docs (K') in the hopes that at least K of them will satisfy the constraints 
(post-filtering), or whether to apply the filters while searching, guaranteeing 
top K. I just want to note that both approaches have merit; it's a tradeoff 
depending on how restrictive the filters are, but for not-very-restrictive 
filters, post-filtering can outperform. In this case I guess there is a similar 
tradeoff relating to how many child documents there typically are. If it's a 
small number (say c children per parent), it may be better to use KNN search 
with K' = c * K.  It would be interesting to compare these two approaches to 
see if we can provide some guidance or even some kind of api that chooses?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12434: Add ParentJoin KNN support

2023-07-29 Thread via GitHub


jpountz commented on PR #12434:
URL: https://github.com/apache/lucene/pull/12434#issuecomment-1657050056

   I agree that there is similarity in that in both cases it boils down to 
whether or not you can accept having less than `k` hits. However the 
degradation is brutal with filtering as you either need to evaluate the filter 
across the entire segment to load it into a bitset (not great for both runtime 
(if the filter cardinality is high) and memory usage) or linearly scan all 
filter matches (not great either). Here the degradation is much more graceful 
as you only pay some overhead for vectors that get collected. For filtering, I 
could see a case for requesting k'>k vectors and then do post filtering. For 
this case I think I would always want to use this feature, potentially combined 
with the `visitLimit` option to protect against worst-case conditions like a 
million child docs per parent that would make collisions frequent.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org