[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub


vsop-479 commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733041087

   @iverase Does it make sense to you if MatchState defined in other class, 
such as BKDReader or IntersectVisitor, and only leave the sorted dimension in 
IntersectVisitor' visit method as a parameter?
   I just implemented visitWithState in PointRangeQuery, and there is some 
speed up. I think it is same to other queries, such as 
LatLonPointDistanceQuery, which use a similar IntersectVisitor.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] sylph-eu commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub


sylph-eu commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733118806

   Last comment is already a couple of months old, so please let me clarify the 
status of this initiative. If there's a chance it's going to be merged? If 
there's any blocker or action item that prevents one from being merged?
   
   The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) 
are commonly deployed within enterprises, which makes them good candidates to 
experiment with vector search and commercial LLM-offerings, without deploying 
and maintaining specialized technologies. Max dimensionality of 1024, however, 
puts certain restrictions (similar thoughts are here 
https://arxiv.org/abs/2308.14963).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 opened a new pull request, #12586: Remove over-counting of deleted terms

2023-09-25 Thread via GitHub


gf2121 opened a new pull request, #12586:
URL: https://github.com/apache/lucene/pull/12586

   `BufferedUpdates` used to count deleted terms without deduplication to 
respect `IndexWriterConfig.setMaxBufferedDeleteTerms`. As 
`IndexWriterConfig.setMaxBufferedDeleteTerms` is removed since 
[LUCENE-7868](https://issues.apache.org/jira/browse/LUCENE-7868), the 
overcounting can be avoided now.
   
   Previous talk: https://github.com/apache/lucene/pull/12573/files#r1332924157


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 commented on a diff in pull request #12586: Remove over-counting of deleted terms

2023-09-25 Thread via GitHub


gf2121 commented on code in PR #12586:
URL: https://github.com/apache/lucene/pull/12586#discussion_r1335560326


##
lucene/core/src/java/org/apache/lucene/index/BufferedUpdates.java:
##
@@ -284,6 +276,13 @@  void 
forEachOrdered(DeletedTermConsumer consumer) throw
 public long ramBytesUsed() {
   return bytesUsed.get();
 }
+
+@Override
+public String toString() {

Review Comment:
   `DeletedTerms#toString` could be called when `VERBOSE_DELETES` enabled.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub


uschindler commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733223795

   Hi,
   actually this issue is already resolved, although the DEFAULT did not change 
(and won't change due to performance risks), see here: 
https://github.com/apache/lucene/pull/12436 - this PR allows users of Lucene to 
raise the limit (at least for HNSW codec) on codec level.
   
   To implement (on your own risk), create your own ` KnnVectorsFormat` and let 
it return a different number fog `getMaxDimensions()`. Then construct your own 
codec from it and index your data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-09-25 Thread via GitHub


uschindler commented on issue #11507:
URL: https://github.com/apache/lucene/issues/11507#issuecomment-1733230392

   @mayya-sharipova: Should we close this issue or are there any plans to also 
change the default maximum? I don't think so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #12581: Allow reading / writing binary stored fields as DataInput

2023-09-25 Thread via GitHub


iverase merged PR #12581:
URL: https://github.com/apache/lucene/pull/12581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase closed issue #12556: Allow reading binary stored values as DataInput

2023-09-25 Thread via GitHub


iverase closed issue #12556: Allow reading binary stored values as DataInput
URL: https://github.com/apache/lucene/issues/12556


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub


iverase commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1733441807

   My recommendation is to use the following method as you are just trying to 
flag if the visit method needs to keep processing points. 
   ```
 /** Similar to {@link IntersectVisitor#visit(int, byte[])} but data is 
visited in 
  * increasing order on the {@sortedDim}, and in the case of ties, in 
increasing docID order. 
  * Implementers can stop processing points on the leaf by returning false 
when for example the 
  * sorted dimension value is too high to be matched by the query. 
  * 
  * @return true if the visitor should continue visiting points on this 
leaf, otherwise false. 
  * */
   
default boolean visitWithSortedDim(int docID, byte[] packedValue, int 
sortedDim) throws IOException { 
  visit(docID, packedValue); 
   return true; 
  }
   ```
   
   I would remove the "inverse" case as the inverse visitors are implementation 
details and IMHO should not be part of the API. 
   
   >  just implemented visitWithState in PointRangeQuery, and there is some 
speed up
   
   We should run the performance test in the [lucene benchmarks](lucene 
benchmarks) to check if there are slowdowns due to the extra check for each 
visited point. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 opened a new pull request, #12587: Use radix sort to speed up the sorting of terms in TermInSetQuery

2023-09-25 Thread via GitHub


gf2121 opened a new pull request, #12587:
URL: https://github.com/apache/lucene/pull/12587

   ### Description
   
   Sort terms in TermInSetQuery with radix sort. This helps TermInSetQueries 
with a number of terms.
   
   ### Benchmark
   
   I made a simple benchmark on sorting `BytesRef[]` with random bytes to 
verify the improvements.
   
   https://bytedance.feishu.cn/sheets/G5dwsdvZ7hOxXftyfDkcvUkYnqB";
 data-importRangeRawData-range="'Sheet1'!A2:D12">
   
     | timsort ( took nanos ) | radixsort ( took nanos ) | took diff
   -- | -- | -- | --
   10 terms (16 bytes per term) | 1292 | 1083 | -16.18%
   100 terms (16 bytes per term) | 17959 | 11750 | -34.57%
   1000 terms (16 bytes per term) | 387916 | 50375 | -87.01%
   1 terms (16 bytes per term) | 5407208 | 1062500 | -80.35%
   10 terms (16 bytes per term) | 65577084 | 5404958 | -91.76%
   10 terms (256 bytes per term) | 3500 | 1750 | -50.00%
   100 terms (256 bytes per term) | 18000 | 11708 | -34.96%
   1000 terms (256 bytes per term) | 410959 | 52417 | -87.25%
   1 terms (256 bytes per term) | 5325666 | 1299125 | -75.61%
   10 terms (256 bytes per term) | 71316500 | 11346584 | -84.09%
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

2023-09-25 Thread via GitHub


jpountz commented on PR #12382:
URL: https://github.com/apache/lucene/pull/12382#issuecomment-1733490475

   FWIW I ran the benchmark from https://tantivy-search.github.io/bench/ and 
also observed a speedup on conjunctions, so I think that the speedup is indeed 
real.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #12382: Run top-level conjunctions of term queries with a specialized BulkScorer.

2023-09-25 Thread via GitHub


jpountz merged PR #12382:
URL: https://github.com/apache/lucene/pull/12382


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna opened a new pull request, #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub


javanna opened a new pull request, #12588:
URL: https://github.com/apache/lucene/pull/12588

   Until now, LuceneTestCase#newSearcher randomly associates the returned 
IndexSearcher instance with an executor that is ad-hoc created, which gets shut 
down when the index reader is closed.
   
   This has made us catch a couple of cases where we were not properly closing 
readers in tests. Most recently, we have been seeing test failures (OOM - 
unable to create thread) due to too many executor instances created as part of 
the same test. This is to be attributed to creating too many searcher instance, 
each one getting its separate executor, which all get shutdown at the end of 
the entire suite. The main offender for this is QueryUtils which creates a new 
searcher for each leaf reader, and the top-level reader gets closed in the 
AfterClass, hence all the executors will stay around for the entire duration of 
the test suite that relies on QueryUtils.
   
   This commit eagerly creates an executor in an additional before class method 
for LuceneTestCase, and associates that with each searcher that is supposed to 
get a non null executor.
   
   Note that the executor is shutdown in the after class to ensure that no 
threads leak in tests.
   
   This has the additional advantage that it removes the need to close the 
executor as part of an index reader close listener, which also requires the 
reader to have an associated reader cache helper.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub


rmuir commented on PR #12588:
URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733714024

   this is nice, much cleaner. I think i added the original cache-helper-hack. 
Just one question: can we reduce the number of threads used? I have 2 cores. 
shouldn't 2 be enough to occasionally find bugs rather than 8?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna commented on pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub


javanna commented on PR #12588:
URL: https://github.com/apache/lucene/pull/12588#issuecomment-1733731499

   Thanks @rmuir for looking, I lowered the number of threads


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #12589: Sometimes intersect the essential clause and the best non-essential clause.

2023-09-25 Thread via GitHub


jpountz opened a new pull request, #12589:
URL: https://github.com/apache/lucene/pull/12589

   The idea behind MAXSCORE is to run disjunctions as `+(essentialClause1 ... 
essentialClauseM) nonEssentialClause1 ... nonEssentialClauseN`, moving more and 
more clauses from the essential list to the non-essential list as the minimum 
competitive score increases. For instance, a query such as `the book of life` 
which I found in the Tantivy benchmark ends up running as `+book the of life` 
after some time, ie. with one required clause and other clauses optional. This 
is because matching `the`, `of` and `life` alone is not good enough for 
yielding a match.
   
   Here some statistics in that case:
- min competitive score: 3.4781857
- max_window_score(book): 2.8796153
- max_window_score(life): 2.037863
- max_window_score(the): 0.103848875
- max_window_score(of): 0.19427927
   
   Actually if you look at these statistics, we could do better, because a 
match may only be competitive if it matches both `book` and `life`, so this 
query could actually execute as `+book +life the of`, which may help evaluate 
fewer documents compared to `+book the of life`. Especially if you enable 
recursive graph bisection.
   
   This is what this PR tries to achieve: in the event when there is a single 
essential clause and matching all clauses but the best non-essential clause 
cannot produce a competitive match, then the scorer will only evaluate 
documents that match the intersection of the essential clause and the best 
non-essential clause.
   
   It's worth noting that this optimization would kick in very frequently on 
2-clauses disjunctions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #12589: Sometimes intersect the essential clause and the best non-essential clause.

2023-09-25 Thread via GitHub


jpountz commented on PR #12589:
URL: https://github.com/apache/lucene/pull/12589#issuecomment-1733875246

   Opening as a draft as I still need to figure out how to test this 
optimization.
   
   I tested on wikibigall where this yielded a good speedup. I would expect an 
even better speedup if recursive graph bisection is enabled.
   
   ```
   TaskQPS baseline  StdDevQPS 
my_modified_version  StdDevPct diff p-value
CountOrHighHigh   59.20 (15.6%)   57.96 
(14.8%)   -2.1% ( -28% -   33%) 0.671
 CountOrHighMed   91.87 (15.6%)   90.04 
(14.7%)   -2.0% ( -27% -   33%) 0.684
MedTerm  451.03  (9.1%)  446.68  
(8.1%)   -1.0% ( -16% -   17%) 0.730
LowTerm  751.75  (8.3%)  744.56  
(7.1%)   -1.0% ( -15% -   15%) 0.703
  MedPhrase   40.42  (7.8%)   40.05  
(7.9%)   -0.9% ( -15% -   16%) 0.717
   HighTerm  339.69 (10.6%)  337.52  
(9.5%)   -0.6% ( -18% -   21%) 0.844
 HighPhrase   35.00  (6.4%)   34.82  
(6.7%)   -0.5% ( -12% -   13%) 0.809
AndHighHigh   63.82  (4.2%)   63.55  
(2.9%)   -0.4% (  -7% -7%) 0.721
 AndHighMed  182.62  (3.6%)  181.94  
(2.4%)   -0.4% (  -6% -5%) 0.705
  LowPhrase   41.82  (4.8%)   41.69  
(5.6%)   -0.3% ( -10% -   10%) 0.848
  HighTermDayOfYearSort  234.81  (1.5%)  234.33  
(1.8%)   -0.2% (  -3% -3%) 0.697
 AndHighLow  823.51  (3.2%)  821.97  
(2.6%)   -0.2% (  -5% -5%) 0.846
  OrHighLow  630.34  (2.9%)  630.52  
(3.9%)0.0% (  -6% -6%) 0.979
Respell   66.16  (1.4%)   66.19  
(1.7%)0.0% (  -2% -3%) 0.938
 Fuzzy1  128.80  (1.4%)  128.97  
(1.5%)0.1% (  -2% -3%) 0.777
CountAndHighMed  123.28  (2.8%)  123.60  
(3.6%)0.3% (  -5% -6%) 0.803
 Fuzzy2  114.67  (1.3%)  114.97  
(1.5%)0.3% (  -2% -3%) 0.567
CountPhrase3.38  (9.0%)3.39  
(9.3%)0.3% ( -16% -   20%) 0.918
   CountAndHighHigh   40.81  (2.8%)   40.96  
(3.8%)0.4% (  -6% -7%) 0.742
   PKLookup  224.07  (2.3%)  225.12  
(2.7%)0.5% (  -4% -5%) 0.566
   Wildcard  144.23  (2.4%)  145.22  
(3.1%)0.7% (  -4% -6%) 0.451
  HighTermMonthSort 5090.41  (2.8%) 5142.25  
(4.2%)1.0% (  -5% -8%) 0.384
Prefix3  241.24  (4.3%)  244.33  
(4.4%)1.3% (  -7% -   10%) 0.368
  CountTerm17070.19  (5.1%)17289.73  
(4.9%)1.3% (  -8% -   11%) 0.429
 IntNRQ   88.47 (11.5%)   89.76 
(13.8%)1.5% ( -21% -   30%) 0.725
  OrHighMed  191.58  (3.5%)  206.42  
(3.7%)7.7% (   0% -   15%) 0.000
 OrHighHigh   60.88  (4.4%)   68.91  
(4.4%)   13.2% (   4% -   23%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] javanna merged pull request #12588: Shared executor for LuceneTestCase#newSearcher callers

2023-09-25 Thread via GitHub


javanna merged PR #12588:
URL: https://github.com/apache/lucene/pull/12588


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp opened a new pull request, #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


kaivalnp opened a new pull request, #12590:
URL: https://github.com/apache/lucene/pull/12590

   ### Context
   
   Vector search is performed in 
[`AbstractKnnVectorQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java),
 where individual HNSW searches are delegated to sub-classes via 
[`#approximateSearch`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L174).
 This is useful to implement custom functionality (like say pro-rating `k` 
across segments for performance, or requesting some additional results over `k` 
for higher recall with Exact KNN)
   
   While the class in itself is `package-private`, we extend the corresponding 
sub-classes for 
[byte](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnByteVectorQuery.java)
 and 
[float](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java)
 vectors (which are `public`) for implementing any custom functionality like 
above
   
   ### Issue
   
   After searching across all segments, we retain the index-level `topk` 
results 
[here](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88),
 and immediately call 
[`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219)
 to rewrite them as a 
[`DocAndScoreQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297)
   
   So the implementing classes do not have access to the final `topK` results 
anywhere, which may be useful to read / modify like a post-process step (for 
example metric emission, or counting / only keeping results above a threshold)
   
   ### Proposal
   
   Can we make 
[`#createRewrittenQuery`](https://github.com/kaivalnp/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L219)
 `protected` to allow sub-classes to override it (and ultimately access the 
`topK` results)?
   They can simply delegate to the original function at the end, or even 
implement a custom `Query` if required
   
   I don't how common of a use-case this is, and wanted to get some opinions
   
   Closes #12575


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


benwtrent commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734167969

   The new KnnCollector abstraction doesn't already address these needs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


kaivalnp commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734230703

   Thanks for the quick response @benwtrent!
   
   As far as I understand (please let me know if I'm missing something), the 
new 
[`KnnCollector`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnCollector.java#L26)
 interface and corresponding 
[`#searchNearestVectors`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/LeafReader.java#L303-L330)
 API still collects per-leaf results right?
   
   Irrespective of whether we pass a custom implementation for this interface 
in 
[`#approximateSearch`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L81-L82),
 we will not have access to the final results after [merging across all 
leaves](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88)
 - that is, no `KnnCollector` object holds results *across all segments*?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


benwtrent commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734248507

   @kaivalnp I guess you could have a collector that spans all the segments 
(created once at the query level).
   
   I am not really against this change, I am just wondering if there is a 
better existing option for what you are trying to achieve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


kaivalnp commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734321355

   > you could have a collector that spans all the segments (created once at 
the query level).
   
   Interesting, so a single `KnnCollector` passed to all 
[`searchNearestVectors`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L82)
 calls?
   
   However, `AbstractKnnVectorQuery` now [executes HNSW searches in 
parallel](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L104),
 and the `KnnCollector` is passed all the way to the low-level 
[`HnswGraphSearcher#searchLevel`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L189-L202),
 so we'll need some synchronization in collecting individual docs across 
threads?
   
   I'm not sure how much overhead this will cause, but seems like a lot?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] benwtrent commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


benwtrent commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734371468

   @kaivalnp depends on what you need to do.
   
   You can easily get around all this without any expensive locking.
   
   The collector has a "topDocs" method that could call some higher level 
collector.
   
   But that aside, you might as well just override the "rewrite" method if you 
need this fined grained control over doing specific things with top-k.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kaivalnp commented on pull request #12590: Allow implementers of AbstractKnnVectorQuery to access final topK results

2023-09-25 Thread via GitHub


kaivalnp commented on PR #12590:
URL: https://github.com/apache/lucene/pull/12590#issuecomment-1734432239

   > You can easily get around all this without any expensive locking.
   > The collector has a "topDocs" method that could call some higher level 
collector.
   
   Nice idea! So basically we pass a subclass of `TopKnnCollector` 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L82),
 where the "topDocs" function is overloaded to pass the final results of each 
segment to a high-level collector when called?
   
   I wonder how this would be any different from passing the final `TopDocs` at 
the end of 
[`#approximateSearch`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/KnnFloatVectorQuery.java#L79)
 directly to the high-level collector?
   
   In any case, even if we pass the `TopDocs` from each segment to a high-level 
collector, it would have to combine the individual `TopDocs` results across all 
segments to figure out the index-level `topK` independently of 
`AbstractKnnVectorQuery` right? (duplicating [this piece of 
work](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L88)
 in both places)
   
   > you might as well just override the "rewrite" method
   
   Overriding `rewrite` (first calling `super.rewrite` and looking at the 
rewritten `Query`) may not work because the 
[`DocAndScoreQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297)
 class is `package-private` so we cannot read the internal 
[`docs`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L299)
 without reflection? Moreover, they are sorted in ascending order of `docid` 
already (and the ordering along scores is lost - so this may not be as useful)
   
   We can always copy the [entire `rewrite` 
function](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L64-L93)
 and the 
[`DocAndScoreQuery`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java#L297-L452)
 class into all implementers of `AbstractKnnVectorQuery` to get access to the 
`topK`, but this seems a bit overkill?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vsop-479 commented on pull request #12528: Early terminate visit BKD leaf when current value greater than upper point in sorted dim.

2023-09-25 Thread via GitHub


vsop-479 commented on PR #12528:
URL: https://github.com/apache/lucene/pull/12528#issuecomment-1734701754

   @iverase Thanks for your recommendation, it may makes code more clear.
   I will try to implement it, and run the performance test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org