benwtrent opened a new issue, #15839:
URL: https://github.com/apache/lucene/issues/15839
### Description
Besides the overall performance improvements that could be done for HNSW &
block join queries, I think there are ways for us to improve the vector story
as a whole.
A major one is scoring ALL child docs for a nearest parent doc at a time.
When we score candidates, we score the candidates together that are children of
the same parent, and ALSO score ALL children for that common parent.
This would be complicated, a POC would likely be required to prove if its
useful, but it would allow us to:
- Bulk score all matching children in a block (super fast, locality on
disk, etc.)
- Bulk collect the children all within a parent, keeping the translation
times simple
- It MAY increase the scoring count (e.g. now for a single node in the
graph, we might score 10s or 100s of vectors :/), so maybe its something that
only occurs once we get further into the graph...
Its logical that children are all near each other in the graph.
This will be a pretty large digression in the API design. The KnnCollector
would need to:
- Provide the scoring logic (but not the score methodology)
- Keep track of "visited" nodes
This would also give some neat augmentations, like the ability to return the
average score, min score, max score for a parent and more than the single
nearest vector (e.g. could return the top 5 or whatever).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]