Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-06-11 Thread via GitHub


javanna commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2159970909

   > Lucene should just make full use of the provided executor and that's that 
shouldn't it?
   
   Yes, I think so, but perhaps Lucene needs to provide general guidelines to 
users around what executor is suited and how it should be configured, what 
factors to take into account etc. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Optimize Japanese UserDictionary. [lucene]

2024-06-11 Thread via GitHub


bruno-roustant merged PR #13431:
URL: https://github.com/apache/lucene/pull/13431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Null exception occured when click on Luke desktop browser button [lucene]

2024-06-11 Thread via GitHub


slow-J commented on issue #13345:
URL: https://github.com/apache/lucene/issues/13345#issuecomment-2160422363

   Hi @wuth, please raise a pull request with your fix and it can get reviewed 
and released.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Strange ConcurrentMergeScheduler behavior with intra-merge threads [lucene]

2024-06-11 Thread via GitHub


benwtrent opened a new issue, #13478:
URL: https://github.com/apache/lucene/issues/13478

   ### Description
   
   It was noticed that the CMS intra-merge behavior was not fully tested. In an 
effort to do this, a change to override when the intra-merge scheduler is used 
has been drafted. https://github.com/apache/lucene/pull/13475
   
   
   This PR has exposed many existing assertions that may all be weird testing 
failures. But a couple are more worrisome. 
   
   In particular: 
   
   ```
   ./gradlew test --tests TestCodecHoldsOpenFiles.test 
-Dtests.seed=AA3C2FCDA0773874 -Dtests.locale=ar-PS 
-Dtests.timezone=Asia/Magadan -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   ```
   
   Will periodically fail with an NPE on the `sort`:
   
   ```
   org.apache.lucene.index.MergePolicy$MergeException: 
java.lang.NullPointerException: Cannot invoke 
"org.apache.lucene.search.Sort.getSort()" because the return value of 
"org.apache.lucene.index.LeafMetaData.getSort()" is null
   at __randomizedtesting.SeedInfo.seed([AA3C2FCDA0773874]:0)
   at 
app//org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:764)
   at 
app//org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:756)
   
   Caused by:
   java.lang.NullPointerException: Cannot invoke 
"org.apache.lucene.search.Sort.getSort()" because the return value of 
"org.apache.lucene.index.LeafMetaData.getSort()" is null
   at 
org.apache.lucene.index.SortingCodecReader.assertCreatedOnlyOnce(SortingCodecReader.java:739)
   at 
org.apache.lucene.index.SortingCodecReader.getOrCreate(SortingCodecReader.java:721)
   at 
org.apache.lucene.index.SortingCodecReader.getOrCreateDV(SortingCodecReader.java:759)
   at 
org.apache.lucene.index.SortingCodecReader$7.getNumeric(SortingCodecReader.java:571)
   at 
org.apache.lucene.codecs.DocValuesConsumer$1.getNumeric(DocValuesConsumer.java:201)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer$1.getSortedNumeric(Lucene90DocValuesConsumer.java:138)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.writeValues(Lucene90DocValuesConsumer.java:188)
   at 
org.apache.lucene.codecs.lucene90.Lucene90DocValuesConsumer.addNumericField(Lucene90DocValuesConsumer.java:133)
   at 
org.apache.lucene.tests.codecs.asserting.AssertingDocValuesFormat$AssertingDocValuesConsumer.addNumericField(AssertingDocValuesFormat.java:87)
   at 
org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:183)
   at 
org.apache.lucene.codecs.DocValuesConsumer.merge(DocValuesConsumer.java:142)
   at 
org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.merge(PerFieldDocValuesFormat.java:151)
   at 
org.apache.lucene.index.SegmentMerger.mergeDocValues(SegmentMerger.java:205)
   at 
org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:325)
   at 
org.apache.lucene.index.SegmentMerger.lambda$merge$1(SegmentMerger.java:155)
   at 
org.apache.lucene.search.TaskExecutor$TaskGroup.lambda$createTask$0(TaskExecutor.java:117)
   at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
   at 
org.apache.lucene.index.ConcurrentMergeScheduler$CachedExecutor.lambda$execute$0(ConcurrentMergeScheduler.java:982)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
   at java.base/java.lang.Thread.run(Thread.java:1583)
   ```
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Strange ConcurrentMergeScheduler behavior with intra-merge threads [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on issue #13478:
URL: https://github.com/apache/lucene/issues/13478#issuecomment-2160620126

   OK, the NPE in sort, I did some manual debugging via good ole 
`System.out.println`. This only happens in the assertion if the cache check is 
greater than 1, which does seem to happen with intra-merge concurrency. 
   
   However, if I undo all the concurrency and printout the assertion check 
lines (that pass in without concurrency), I still see that `sort: null`. This 
tells me that this assertion has always had an issue with `sort == null` and 
not double checking it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Adjust assertion check to not throw an NPE [lucene]

2024-06-11 Thread via GitHub


benwtrent opened a new pull request, #13479:
URL: https://github.com/apache/lucene/pull/13479

   It is possible when checking this assertion, that we could throw an NPE as 
`metaData.getSort()` can be `null`.
   
   Let's actually allow the assertions do their checks instead of throwing a 
scary NPE here.
   
   related: https://github.com/apache/lucene/issues/13478


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Adjust assertion check to not throw an NPE [lucene]

2024-06-11 Thread via GitHub


benwtrent merged PR #13479:
URL: https://github.com/apache/lucene/pull/13479


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Strange ConcurrentMergeScheduler behavior with intra-merge threads [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on issue #13478:
URL: https://github.com/apache/lucene/issues/13478#issuecomment-2160815181

   I think I know the issue with the parallel merging. This only happens when 
we use a SortingCodecReader. 
   
   The key issue is here: 
https://github.com/apache/lucene/commit/17c285d61743da0c06735e06235b20bd5aac4e14
   
   This adjusted the caching as to where:
   
   ```
   private  T getOrCreateNorms(String field, IOSupplier supplier) throws 
IOException {
   return getOrCreate(field, true, supplier);
 }
   
   private synchronized  T getOrCreate(String field, boolean norms, 
IOSupplier supplier)
 throws IOException {
   if ((field.equals(cachedField) && cacheIsNorms == norms) == false) {
 assert assertCreatedOnlyOnce(field, norms);
 cachedObject = supplier.get();
 cachedField = field;
 cacheIsNorms = norms;
   }
   assert cachedObject != null;
   return (T) cachedObject;
 }
   
 private  T getOrCreateDV(String field, IOSupplier supplier) throws 
IOException {
   return getOrCreate(field, false, supplier);
 }
   ```
   
   This will cause a weird race condition as when merging norms will call 
`getOrCreateNorms` and in parallel, we could be calling `getOrCreateDV`, either 
of which will overwrite the other, and then potentially double cache. 
   
   Parallel merging breaks these assumptions and could cause issues. 
   
   @iverase @jpountz I propose we remove intra-merging parallelism from norms, 
terms, and doc values and do a 9.11.1 release.
   
   We can incrementally add those back in the future.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [I] Strange ConcurrentMergeScheduler behavior with intra-merge threads [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on issue #13478:
URL: https://github.com/apache/lucene/issues/13478#issuecomment-2160939426

   >Parallel merging breaks these assumptions and could cause issues.
   
   Well, the assumptions are that its only accessed once. But now in parallel 
merging, it could be re-cached any number of times as the norms fields vs. the 
dv fields keep kicking eachother out of cache. 
   
   Seems like a bad idea regardless and I would like us to spend time making 
sure this is OK. 
   
   The proposal for a 9.11.1 is out of caution here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] TaskExecutor should not fork unnecessarily [lucene]

2024-06-11 Thread via GitHub


msokolov commented on PR #13472:
URL: https://github.com/apache/lucene/pull/13472#issuecomment-2160948229

   Would it make sense to provide a reference implementation factory method 
that creates a properly-configured threadpool, maybe using all available cores 
with whatever appropriate policies? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on code in PR #13469:
URL: https://github.com/apache/lucene/pull/13469#discussion_r1635050674


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) 
throws IOException {
 vectorData);
   }
 
+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, 
Bits acceptDocs) throws IOException {

Review Comment:
   @msokolov I suppose that is ok for folks that understand the depths of codec 
formats. But we are seriously complicating somebody's life that wants to use 
kNN and scalar quantized flat codec. Now they cannot use the knn query, but 
instead have to use some specialized query they come up with themselves. 
   
   I think we at least should have a "flat vector query"? or something? 
Something that always does exact search so that typical users don't have to do 
this iterating & casting stuff?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Mark COSINE VectorSimilarity function as deprecated [lucene]

2024-06-11 Thread via GitHub


Pulkitg64 commented on PR #13473:
URL: https://github.com/apache/lucene/pull/13473#issuecomment-2161002119

   Thanks @benwtrent for all the feedback and help. I will try to raise a 
followup PR to remove COSINE function if no one else beats me to it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


gsmiller commented on PR #13463:
URL: https://github.com/apache/lucene/pull/13463#issuecomment-2161023588

   @benwtrent ah, you're right. I only had a single segment. I played with 
making the write buffer really small but couldn't get more than one segment 
with that 100d enwiki dataset. I ran with cohere data along with a 12MB write 
buffer to try to reproduce your results. I'm probably doing something wrong 
still, but I at least confirmed I had more than one segment in my index (ended 
up producing 16 in my run). I'll post the results I got with that dataset here, 
but I'm not sure I trust them at this point given the low recall being reported 
(I suspect I just have something wrong with my setup):
   
   ```
   BASELINE
   recall  latency nDocfanout  maxConn beamWidth   visited index ms
   0.385   13.05   100 0   16  100 22696   255473  1.00
post-filter
   
   CANDIDATE
   recall  latency nDocfanout  maxConn beamWidth   visited index ms
   0.383   13.60   100 0   16  100 23645   249901  1.00
post-filter
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on PR #13463:
URL: https://github.com/apache/lucene/pull/13463#issuecomment-2161031174

   @gsmiller I have ran into those weird recall numbers in scenarios before:
   
- My vector data was corrupted and thus created many `0` valued vectors
- My dimensions were incorrectly configured and I was using 100 or 768 
dimensions when I should have been using 1024 or 384.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-06-11 Thread via GitHub


navneet1v commented on code in PR #13469:
URL: https://github.com/apache/lucene/pull/13469#discussion_r1635173335


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) 
throws IOException {
 vectorData);
   }
 
+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, 
Bits acceptDocs) throws IOException {

Review Comment:
   > @msokolov I suppose that is ok for folks that understand the depths of 
codec formats. But we are seriously complicating somebody's life that wants to 
use kNN and scalar quantized flat codec. Now they cannot use the knn query, but 
instead have to use some specialized query they come up with themselves.
   
   +1 on this comment. 
   
   
   
   > I think we at least should have a "flat vector query"? or something? 
Something that always does exact search so that typical users don't have to do 
this iterating & casting stuff?
   
   Rather than having a new query can we just reuse the old query where for a 
field we get the right KNNVectorsReader and hit the search interface of the 
reader that does the exact search or HNSW based search based on whatever 
implementation the KNNVectorsReader have used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-06-11 Thread via GitHub


msokolov commented on code in PR #13469:
URL: https://github.com/apache/lucene/pull/13469#discussion_r1635215676


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) 
throws IOException {
 vectorData);
   }
 
+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, 
Bits acceptDocs) throws IOException {

Review Comment:
   I hear you all, this seems a little weird. One thing is ... you cannot get 
access to this stuff at all unless you are willing to create a custom Codec. So 
by definition we have weeded out people who don't want to mess around with 
Codecs. Another thing is ... we don't really want to create a trap for people 
who are trying to find top N vector-distance documents matching some 
constraint.  They might think by blindly following the API trail/trap that they 
should create a BitSet representing their constraint and pass that in to a 
Knn*VectorQuery? When instead they would be better off executing their 
constraint as the primary Query and using the VectorValues API to do the 
scoring. I'm struggling to see the use case for this brute-force query; it just 
seems like a trap to me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Mark COSINE VectorSimilarity function as deprecated [lucene]

2024-06-11 Thread via GitHub


benwtrent merged PR #13473:
URL: https://github.com/apache/lucene/pull/13473


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Expose FlatVectorsFormat [lucene]

2024-06-11 Thread via GitHub


msokolov commented on code in PR #13469:
URL: https://github.com/apache/lucene/pull/13469#discussion_r1635387325


##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsReader.java:
##
@@ -217,6 +220,18 @@ public ByteVectorValues getByteVectorValues(String field) 
throws IOException {
 vectorData);
   }
 
+  @Override
+  public void search(String field, float[] target, KnnCollector knnCollector, 
Bits acceptDocs) throws IOException {

Review Comment:
   Maybe we can make a KnnVectorScoreQuery that wraps another Query, supplying 
the vector score. Actually I think you can already do this with 
FunctionScoreQuery + a DoubleValuesSource wrapping a KnnVectorScorer? Perhaps 
we should provide that as a single Query?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


mayya-sharipova commented on code in PR #13463:
URL: https://github.com/apache/lucene/pull/13463#discussion_r1635365499


##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -72,12 +72,13 @@ public float offer(float value) {
* Values must be sorted in ascending order.
*
* @param values a set of values to insert, must be sorted in ascending order
+   * @param len number of values from the {@code values} array to insert
* @return the new 'top' element in the queue.
*/
-  public float offer(float[] values) {
+  public float offer(float[] values, int len) {
 lock.lock();
 try {
-  for (int i = values.length - 1; i >= 0; i--) {
+  for (int i = len - 1; i >= 0; i--) {

Review Comment:
   As alternative, we don't need to break, and always offer all values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


benwtrent commented on PR #13463:
URL: https://github.com/apache/lucene/pull/13463#issuecomment-2161527974

   @mayya-sharipova are you concerned at all that the performance gains we 
thought we had seem to disappear with this bug fix?
   
   Could you retest to verify? What I am seeing locally is almost no 
performance gains when it comes to number of vectors visited now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[I] Scalar quantization extreme edge case of uniform vector values [lucene]

2024-06-11 Thread via GitHub


benwtrent opened a new issue, #13480:
URL: https://github.com/apache/lucene/issues/13480

   ### Description
   
   When quantizing vectors that have a uniform value, the quantiles can get 
really weird. Meaning both min and max quantiles are actually equivalent. 
Additionally, the scoring can be unexpected.
   
   Here is a test indicating this weirdness:
   
   ```
 public void testWeirdEdgeCase() throws IOException {
   float[] uniformVector = new float[128];
   for (int i = 0; i < uniformVector.length; i++) {
 uniformVector[i] = 1;
   }
   float[] query = new float[128];
   for (int i = 0; i < query.length; i++) {
 query[i] = 2;
   }
   ScalarQuantizer scalarQuantizer = 
ScalarQuantizer.fromVectors(fromFloats(new float[][] {uniformVector}),
 0.99f, 1, (byte) 7);
   byte[] quantized = new byte[128];
   float offset = scalarQuantizer.quantize(query, quantized, 
VectorSimilarityFunction.EUCLIDEAN);
   byte[] quantizedVector = new byte[128];
   float offsetVector = scalarQuantizer.quantize(uniformVector, 
quantizedVector, VectorSimilarityFunction.EUCLIDEAN);
   ScalarQuantizedVectorSimilarity quantizedSimilarity =
 ScalarQuantizedVectorSimilarity.fromVectorSimilarity(
   VectorSimilarityFunction.EUCLIDEAN, 
scalarQuantizer.getConstantMultiplier(), scalarQuantizer.getBits());
   float score1 = quantizedSimilarity.score(quantized, offset, 
quantizedVector, offsetVector);
   
   for (int i = 0; i < query.length; i++) {
 query[i] = 5;
   }
   offset = scalarQuantizer.quantize(query, quantized, 
VectorSimilarityFunction.EUCLIDEAN);
   float score2 = quantizedSimilarity.score(quantized, offset, 
quantizedVector, offsetVector);
   assertNotEquals(score1, score2);
 }
   ```
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] WIP - Add minimum number of segments to TieredMergePolicy [lucene]

2024-06-11 Thread via GitHub


jpountz commented on code in PR #13430:
URL: https://github.com/apache/lucene/pull/13430#discussion_r1635481301


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -522,21 +550,28 @@ private MergeSpecification doFindMerges(
 final List candidate = new ArrayList<>();

Review Comment:
   Sorry for the lag, I think I understand what you mean now. We may need to 
decrement the maximum number of segments we allow on the highest tier for every 
segment that is too large that we exclude from the set of segments eligible for 
merging?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


gsmiller commented on PR #13463:
URL: https://github.com/apache/lucene/pull/13463#issuecomment-2161723327

   @mayya-sharipova:


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


gsmiller commented on code in PR #13463:
URL: https://github.com/apache/lucene/pull/13463#discussion_r1635558572


##
lucene/core/src/java/org/apache/lucene/util/hnsw/BlockingFloatHeap.java:
##
@@ -72,12 +72,13 @@ public float offer(float value) {
* Values must be sorted in ascending order.
*
* @param values a set of values to insert, must be sorted in ascending order
+   * @param len number of values from the {@code values} array to insert
* @return the new 'top' element in the queue.
*/
-  public float offer(float[] values) {
+  public float offer(float[] values, int len) {
 lock.lock();
 try {
-  for (int i = values.length - 1; i >= 0; i--) {
+  for (int i = len - 1; i >= 0; i--) {

Review Comment:
   Yeah, I considered that as well and I don't really have a strong opinion 
either way. Offering everything without short-circuiting is probably a slightly 
cleaner/simpler solution so maybe that's a better way to go unless performance 
testing for some reason shows otherwise (but I find it had to imagine we'd see 
a big difference). That solution removes the need for the scratch array, which 
is nice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Fix global score update bug in MultiLeafKnnCollector [lucene]

2024-06-11 Thread via GitHub


gsmiller commented on PR #13463:
URL: https://github.com/apache/lucene/pull/13463#issuecomment-2161735059

   @benwtrent ++ to understanding the performance regression before pushing. I 
haven't made any more progress there personally. Agreed with waiting to merge 
until we understand what's going on there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



Re: [PR] Use SPI instead of Enum for VectorSimilarityFunctions [lucene]

2024-06-11 Thread via GitHub


github-actions[bot] commented on PR #13401:
URL: https://github.com/apache/lucene/pull/13401#issuecomment-2161833420

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[PR] Update document by docID [lucene]

2024-06-11 Thread via GitHub


vsop-479 opened a new pull request, #13481:
URL: https://github.com/apache/lucene/pull/13481

   ### Description
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org