benwtrent commented on PR #12191: URL: https://github.com/apache/lucene/pull/12191#issuecomment-1482823550
@rmuir I have some text embedding & search numbers. Here are some initial numbers based on the BEIR dataset. This dataset is a variety of datasets ranging from question & answering to domain specific scientific. The following numbers are with zero additional configuration. Tuning for relevance makes it not zero-shot. Once you start tuning for relevance, dense retrieval can get much much better than BM25 as you can transfer learn your model. So, to make the comparison fair, this is just zero-shot out of domain performance. `text-embedding-ada-002` which encodes as `1536` dims has NDGC@10 on average `53.4` BM25 has NDGC@10 of `41.6` over BEIR. I would imagine combining the two would significantly boost the NDGC@10 for `text-embedding-ada-002` as you get the best of both worlds. I can try and find additional numbers. Do you have an idea of a performance threshold that is adequate? I agree that we shouldn't chase every new model that is released, but the retrieval and re-ranking performance of these LLM is getting really good. As for performance issues, this is why I am only suggesting the increase for byte encoded vectors as their size & performance improvements are just as reasonable at 2048 as float is at 1024. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org