benwtrent commented on PR #14078: URL: https://github.com/apache/lucene/pull/14078#issuecomment-2713454213
Hey @lpld > May I also ask about the selection of datasets being used for the benchmarks? How do you choose them? I haven't tested with SIFT, though be sure to use euclidean distance when testing it. I would imagine that so few dimensions might not perform super well. There is just too much information loss. But the datasets I have been utilizing are ones that are built with modern day transformer based models. Lucene Util has tooling for downloading and using Cohere multi-lingual (max-inner product, 768 dims). Specifically, for this data format, we did testing with the following datasets and models: - https://huggingface.co/Snowflake/snowflake-arctic-embed-l with dbpedia - https://huggingface.co/intfloat/e5-small with dbpedia, hot-pot qa, quora, fiqa - https://huggingface.co/thenlper/gte-base with hotpotqa, fiqa, dbpedia - And of course, coherev3 multi-lingual https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3 - https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings - GIST-1M (sibling dataset to sift), with euclidean and max-inner product. If you are testing for a product, I would use the model that you are planning to use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org