benwtrent commented on PR #14078:
URL: https://github.com/apache/lucene/pull/14078#issuecomment-2713454213

   Hey @lpld 
   
   > May I also ask about the selection of datasets being used for the 
benchmarks? How do you choose them?
   
   I haven't tested with SIFT, though be sure to use euclidean distance when 
testing it. I would imagine that so few dimensions might not perform super 
well. There is just too much information loss.
   
   
   
   But the datasets I have been utilizing are ones that are built with modern 
day transformer based models. Lucene Util has tooling for downloading and using 
Cohere multi-lingual (max-inner product, 768 dims). 
   
   
   Specifically, for this data format, we did testing with the following 
datasets and models:
   
    - https://huggingface.co/Snowflake/snowflake-arctic-embed-l with dbpedia
    - https://huggingface.co/intfloat/e5-small with dbpedia, hot-pot qa, quora, 
fiqa
    - https://huggingface.co/thenlper/gte-base with hotpotqa, fiqa, dbpedia
    - And of course, coherev3 multi-lingual 
https://huggingface.co/datasets/Cohere/wikipedia-2023-11-embed-multilingual-v3
    - https://huggingface.co/datasets/Cohere/wikipedia-22-12-simple-embeddings
    - GIST-1M (sibling dataset to sift), with euclidean and max-inner product.
   
   If you are testing for a product, I would use the model that you are 
planning to use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to