veqtor commented on PR #874: URL: https://github.com/apache/lucene/pull/874#issuecomment-1517620551
> willing to take actions that go against science because vendors have told them it is right If, as you say, an entire document, regardless of it's lenght, content and so on, can be represented by a vector of 768 floats, why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities? Do you see the flaw in your reasoning here? If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for: Search over large amounts of data. If you need this functionality then you have no reason to use anything else than FAISS. HNSW works ok, but only if you use it for max 500 or so embeddings, then it becomes too slow. Using FAISS you can hierarchically partition the vector space and all calculations are done efficiently. If bringing in FAISS is too drastical, then it's implementation should be studied and integrated instead. Fast efficient vector functionality is a must, if lucene doesn't support this then it and anything that builds off of it is doomed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org