msokolov commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1815705050
yes, this is a promising avenue to explore! One note of caution: we should avoid drawing strong inferences from a single dataset. I'm especially wary of GloVe because I've noticed it seems to have poor numerical properties. We especially should not be testing with random vectors. Ideally we would try several datasets, but if I had to pick one I'd recommend the minilm (384-dim) vectors we computed from wikipedia, or some internal Amazon dataset, or I know Elastic folks have been testing with a Cohere dataset? You can download the minilm data from sftp <username>@home.apache.org; cd /home/sokolov/public_html if you have an apache login. You can also regenerate using infer_vectors.py in luceneutil, but it takes a little while -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org