benwtrent commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1755980866
To address some of @jpountz's worries around adversarial cases, I tested one. Cohere-Wiki, I created 100 clusters via KMeans and indexed the documents sorted by their respective cluster labels. This ended up creating 11 total segments. Two ofthe segments needed to be requantized totally. The other 9 just needed their offsets recalculated. HNSW Float32 Recall@10: 0.840 Quantized Recall@10: `0.787` Recall@10|15 (did I get the true top 10 when gathering 15): `0.848` So, we can achieve similar recall without having to requantize all vectors, even in adversarial cases. Additionally, in extreme cases, we will requantize the segment and potentially recalculate the quantiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org