amoll75 commented on issue #16263: URL: https://github.com/apache/lucene/issues/16263#issuecomment-4718522727
Thanks for the feedback. I agree that the embedding model is ultimately responsible for how information is encoded and that Lucene itself should not introduce an arbitrary semantic bias. However, what we are observing in practice is that, for many real-world embedding models, nearest-neighbor retrieval tends to favor very short text segments. In our corpus, these short segments often contain less useful information than slightly longer passages expressing the same concept. The effect is not caused by HNSW itself, but it manifests at retrieval time regardless of the exact cause inside the embedding model. From an application perspective, the result is that highly informative passages are often ranked below short fragments with very similar embeddings. Your suggestion of encoding length information directly into the vector magnitude and using MIPS is interesting. The challenge is that it requires re-embedding or at least reprocessing the entire corpus and committing to a specific length-biasing strategy at indexing time. What motivated my proposal was the possibility of applying such corrections at retrieval time instead. This would allow experimentation with different normalization functions without rebuilding embeddings and would remain compatible with existing indexes. More generally, I think this could be useful beyond passage-length normalization. Many applications have document-level signals available at indexing time that are difficult or impossible to encode directly into the embedding itself. Examples include document recency, authority, popularity, quality scores, or passage length. Such signals can often be represented as a simple numeric factor and combined with the vector similarity score. Perhaps this is better viewed not as a Lucene bug, but as a feature request for optional score-adjustment mechanisms that can incorporate document-level statistics during vector retrieval. This would enable users to efficiently combine semantic similarity with additional ranking signals without requiring custom post-processing over large candidate sets. I'd be interested to hear whether others have observed similar ranking behavior with modern embedding models and large passage collections, and whether a generic mechanism for score normalization or score boosting based on stored document attributes would be considered useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
