benwtrent commented on PR #12191:
URL: https://github.com/apache/lucene/pull/12191#issuecomment-1482823550

   @rmuir 
   
   I have some text embedding & search numbers.
   
   Here are some initial numbers based on the BEIR dataset. This dataset is a 
variety of datasets ranging from question & answering to domain specific 
scientific.
   
   The following numbers are with zero additional configuration. Tuning for 
relevance makes it not zero-shot. Once you start tuning for relevance, dense 
retrieval can get much much better than BM25 as you can transfer learn your 
model. So, to make the comparison fair, this is just zero-shot out of domain 
performance.
   
   `text-embedding-ada-002` which encodes as `1536` dims has NDGC@10 on average 
`53.4`
   
   BM25 has NDGC@10 of `41.6` over BEIR. 
   
   I would imagine combining the two would significantly boost the NDGC@10 for 
`text-embedding-ada-002` as you get the best of both worlds.
   
   I can try and find additional numbers. 
   
   Do you have an idea of a performance threshold that is adequate?
   
   I agree that we shouldn't chase every new model that is released, but the 
retrieval and re-ranking performance of these LLM is getting really good.
   
   As for performance issues, this is why I am only suggesting the increase for 
byte encoded vectors as their size & performance improvements are just as 
reasonable at 2048 as float is at 1024.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to