david-sitsky commented on issue #13611: URL: https://github.com/apache/lucene/issues/13611#issuecomment-3349681498
@benwtrent - my experience is for all my queries so far, to get the true "top 5" results with kNN, I need to set k to be a very significant percentage of the total docs in the index. Maybe this is expected/acceptable behaviour given the nature of the algorithm, but it seemed a little unexpected. I fully appreciate for gigantic indexes, this will be very beneficial given FloatVectorSimilarityQuery presumably will be a lot slower and perhaps more memory hungry (although that is not clear to me - can you clarify?). As another data-point for @msokolov, I have an index (there are 4,875 child docs with vector fields) where running the query I posted earlier using DiversifyingChildrenFloatKnnVectorQuery, I find I have to set k to 1,812 or more to get the right top 5 results: ``` Name: Re: air travel exp store-item-id: 2501 score: 0.906724 Name: Re: air travel exp store-item-id: 2511 score: 0.90642875 Name: RE: hi store-item-id: 1803 score: 0.9058552 Name: air travel exp store-item-id: 2499 score: 0.9054135 Name: Re: hi store-item-id: 1802 score: 0.90454066 ``` If k is set to 1811, I get the following: ``` Name: Re: air travel exp store-item-id: 2501 score: 0.906724 Name: RE: hi store-item-id: 1803 score: 0.9058552 Name: Re: hi store-item-id: 1802 score: 0.90454066 Name: Re: store-item-id: 2350 score: 0.9039425 Name: Re: hi store-item-id: 1795 score: 0.9038706 ``` I then tried on the same index to use KnnFloatVectorQuery directly to match on the child docs using the same vector query, and when printing the results, to manually join to the associated parent doc (to output the name and store-item-id fields). Curiously, I had to set k to at least 2,822 to get the right top 5 results. The results exactly match the DiversifyingChildrenFloatKnnVectorQuery method, the only difference is the different values of k required to get the rtight top-5 results. That to me seems somewhat unexpected? Why would using KnnFloatVectorQuery require a k of 2,822 to get the right top-5 results, but DiversifyingChildrenFloatKnnVectorQuery required a k of 1,812? Apologies for all my questions and if this is all expected. I am just trying to understand the best way to use kNN. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
