[GitHub] [lucene] javanna merged pull request #12335: Don't generate stacktrace for TimeExceededException

2023-05-30 Thread via GitHub


javanna merged PR #12335:
URL: https://github.com/apache/lucene/pull/12335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12335: Don't generate stacktrace for TimeExceededException

2023-05-30 Thread via GitHub


javanna commented on PR #12335:
URL: https://github.com/apache/lucene/pull/12335#issuecomment-1567997528

   thanks @jimczi !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek commented on issue #12318: Async Usage of Lucene Monitor through a Reactive Programming based application

2023-05-30 Thread via GitHub


romseygeek commented on issue #12318:
URL: https://github.com/apache/lucene/issues/12318#issuecomment-1568018194

   Hi @almogtavor!  I think you're probably best off writing your own Matcher 
implementation here, and possibly extending Monitor as well, given that `match` 
is an inherently synchronous method.  The only IO operations happening here are 
in the internal searches, and the default implementation uses a 
ByteByfferDirectory in any case.  There are some synchronization points in the 
QueryIndex for when new queries are registered with the Monitor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] joshdevins commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-30 Thread via GitHub


joshdevins commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1568187706

   > SUM = the similarity score between the query and each vector is computed, 
all scores are summed to get the final score
   > SUM = every time we find a nearest neighbor vector to be added to the 
topK, if the document is already there, its score is updated summing the old 
and new score
   
   Just a note on the aggregation functions `max` and `sum`. Most commonly it 
seems that `max` is used as it is length independent. When using `sum`, the 
longer the original text of a document field, and thus the more passages it 
will have, the higher the `sum` of all matching passages will be since all 
passages will "match". I'm not sure if it will matter in the end, but my 
suggestion would be that if `sum` is used, one could optionally use a 
radius/similarity threshold be used to limit the advantage of longer texts, 
and/or allow using just a limited top-k passages of a document for `sum`.
   
   @alessandrobenedetti Do you have any good references/papers on approaches to 
re-aggregating passages into documents for SERPs? It seems that the art was 
abandoned a couple years ago with most approaches settling on `max` passage.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti commented on pull request #12314: Multi-value support for KnnVectorField

2023-05-30 Thread via GitHub


alessandrobenedetti commented on PR #12314:
URL: https://github.com/apache/lucene/pull/12314#issuecomment-1568397130

   > > SUM = the similarity score between the query and each vector is 
computed, all scores are summed to get the final score
   > > SUM = every time we find a nearest neighbor vector to be added to the 
topK, if the document is already there, its score is updated summing the old 
and new score
   > 
   > Just a note on the aggregation functions `max` and `sum`. Most commonly it 
seems that `max` is used as it is length independent. When using `sum`, the 
longer the original text of a document field, and thus the more passages it 
will have, the higher the `sum` of all matching passages will be since all 
passages will "match", thus biasing scoring towards documents with longer text. 
I'm not sure if it will matter in the end, but my suggestion would be that if 
`sum` is used, one could optionally use a radius/similarity threshold to limit 
the advantage of longer texts, and/or allow using just a limited top-k passages 
of a document for `sum`.
   > 
   > @alessandrobenedetti Do you have any good references/papers on approaches 
to re-aggregating passages into documents for SERPs? It seems that the art was 
abandoned a couple years ago with most approaches settling on `max` passage 
(which I see is the only method implemented for now).
   
   Hi @joshdevins ,
   the dual strategy(MAX/SUM) is implemented in an old commit, I removed it to 
build an initial smaller and cleaner pull request.
   Some of the feedback was to introduce strategies later on, and that's fine, 
I agree with that.
   
   I didn't have the time to dive deeper into the aggregation strategies so I 
don't have references yet.
   My main focus was to reach a working prototype and then iterate on the 
various components to make them better/deal.
   
   Your observation regarding 'SUM' is correct.
   In my 'naive' first implementation I used an approach where only the closer 
vectors you encounter are considered in the SUM when running an approximate 
search.
   You can take a look at the commits before the simplification if you are 
curious, but I believe it would be better to address this discussion when we 
introduce 'Strategies' again in a separate future PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] alessandrobenedetti merged pull request #12246: Set word2vec getSynonyms method thread-safe

2023-05-30 Thread via GitHub


alessandrobenedetti merged PR #12246:
URL: https://github.com/apache/lucene/pull/12246


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #12334: Fix searchafter query high latency when after value is out of range for segment

2023-05-30 Thread via GitHub


gsmiller commented on code in PR #12334:
URL: https://github.com/apache/lucene/pull/12334#discussion_r1210741210


##
lucene/core/src/java/org/apache/lucene/search/comparators/NumericComparator.java:
##
@@ -204,13 +200,21 @@ private void updateCompetitiveIterator() throws 
IOException {
 return;
   }
   if (reverse == false) {
-encodeBottom(maxValueAsBytes);
+if (queueFull) { // bottom is avilable only when queue is full
+  maxValueAsBytes = maxValueAsBytes == null ? new byte[bytesCount] : 
maxValueAsBytes;

Review Comment:
   Apologies in advance if I'm misunderstanding, but as the code is currently 
written, we also don't know if we'll ever _need_ these arrays. If the queue 
never fills, we could unnecessarily have allocated one of them. I think we 
still have enough information upfront though to eagerly allocate these like we 
do today? Is it just a question of being eager vs. lazy with these?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org