david-sitsky commented on issue #13611:
URL: https://github.com/apache/lucene/issues/13611#issuecomment-3349681498

   @benwtrent - my experience is for all my queries so far, to get the true 
"top 5" results with kNN, I need to set k to be a very significant percentage 
of the total docs in the index.  Maybe this is expected/acceptable behaviour 
given the nature of the algorithm, but it seemed a little unexpected.  I fully 
appreciate for gigantic indexes, this will be very beneficial given 
FloatVectorSimilarityQuery presumably will be a lot slower and perhaps more 
memory hungry (although that is not clear to me - can you clarify?).
   
   As another data-point for @msokolov, I have an index (there are 4,875 child 
docs with vector fields) where running the query I posted earlier using 
DiversifyingChildrenFloatKnnVectorQuery, I find I have to set k to 1,812 or 
more to get the right top 5 results:
   ```
   Name: Re: air travel exp store-item-id: 2501 score: 0.906724
   Name: Re: air travel exp store-item-id: 2511 score: 0.90642875
   Name: RE: hi store-item-id: 1803 score: 0.9058552
   Name: air travel exp store-item-id: 2499 score: 0.9054135
   Name: Re: hi store-item-id: 1802 score: 0.90454066
   ```
   If k is set to 1811, I get the following:
   ```
   Name: Re: air travel exp store-item-id: 2501 score: 0.906724
   Name: RE: hi store-item-id: 1803 score: 0.9058552
   Name: Re: hi store-item-id: 1802 score: 0.90454066
   Name: Re: store-item-id: 2350 score: 0.9039425
   Name: Re: hi store-item-id: 1795 score: 0.9038706
   ```
   I then tried on the same index to use KnnFloatVectorQuery directly to match 
on the child docs using the same vector query, and when printing the results, 
to manually join to the associated parent doc (to output the name and 
store-item-id fields).  Curiously, I had to set k to at least 2,822 to get the 
right top 5 results.  The results exactly match the 
DiversifyingChildrenFloatKnnVectorQuery method, the only difference is the 
different values of k required to get the rtight top-5 results.
   
   That to me seems somewhat unexpected?  Why would using KnnFloatVectorQuery 
require a k of 2,822 to get the right top-5 results, but 
DiversifyingChildrenFloatKnnVectorQuery required a k of 1,812?
   
   Apologies for all my questions and if this is all expected.  I am just 
trying to understand the best way to use kNN.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to