jimczi commented on PR #12191: URL: https://github.com/apache/lucene/pull/12191#issuecomment-1485046957
I think that's fair @rmuir . I am also sharing some of these concerns. Although I don't think it's true to say that "HNSW doesn't scale at all (time, memory space) and there seems to be no plan to look into alternative". The memory constraints for HNSW alone are not gigantic. For each vector we need: M * 4 * 2 bytes where M is the number of links per vector (we default to 16). That's 128 bytes per vector that can be reduced easily at the cost of relevancy by reducing M. Regarding the number of dimensions, the fact that we need to compute the similarity is not inherent to hnsw. A coarse quantizer would also require to compute the similarity on the entire vector space to reach a good recall. The nice property about hnsw is that it reduces the number of comparisons at indexing and query time to a minimum at any scale. So regarding the alternative, I think one direction could be to look at product quantization that would allow to reduce the number of dimensions and provide a similarity function on the reduced space. Something like https://hal.inria.fr/inria-00514462v2/document could be "easily" implemented on top of the hnsw indexing and querying to reduce the memory needed for high dimensions use cases. Any quantization that reduces the number of dimension in fact. We should have options here and the hard limit of 1024 could stay as a hard limit at when quantization should be preferred. Another direction would be to look at ivf, a coarse quantizer that fits well in the inverted list design. The main advantage here would be to get ride of the graph and its associated memory. But again the number of dimensions would still be a constrained since it requires to compute the final score on the entire vector space. Product quantization would still be needed to reduce the constraints at the cost of relevancy. There's also the problem of merges. A merge is single threaded so we're limited when it comes to build bigger graphs. There's way to mitigate this issue, we can increase the index buffer size to create bigger graphs, we can also limit the size of segments to ensure that we don't rebuild passed a certain size. The fact that we build a graph per segment can have benefits though. With concurrent segment search, it should be possible to run a single request in parallel on each graph/segment. We could also investigate using multiple threads when merging a single segment. The other solutions that implement hnsw uses multiple threads to index in a single graph. Today we store these vectors on file and access them through MMAP. That's a differentiator compared to other solutions that requires them to be fully in memory. Considering the access pattern of hnsw (random) I am not sure that it's a real advantage but on the other hand the number of comparisons for indexing and querying is reduced to a minimum. 1024 dimensions is already quite a lot, there are benchmarks with 768 dimensions and more in the ann-benchmarks so it should be interesting to run those on a machine with restricted RAM. With the right setup, I'd assume that Lucene would not crash but the indexing would be really slow. > How long does it take for me to index 50 million documents? That's a good question, I am also interested to understand what's the memory requirements associated with it. With 1024 dimensions and M set to 16, the entire graph would take 6GB and the vectors 190GB! Assuming that we have enough RAM, it would take almost 14 hours if we can achieve 1000 docs/s. The current solution can index faster than that but I don't think it was ever tested with constrained memory. I think we need a test benchmarks to check our assumptions. There's nothing in the Lucene's doc (or ES/Solr) that warns against these constraints and we don't know how the solution degrades. Today we store these vectors on file and access them through MMAP. That's a differentiator compared to other solutions that requires them to be fully in memory. Considering the access pattern of hnsw (random) I am not sure that it's a real advantage but on the other hand the number of comparisons for indexing and querying is reduced to a minimum so a good SSD could mitigate this? We need tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org