jimczi commented on PR #12191:
URL: https://github.com/apache/lucene/pull/12191#issuecomment-1485046957

   I think that's fair @rmuir . I am also sharing some of these concerns. 
Although I don't think it's true to say that "HNSW doesn't scale at all (time, 
memory space) and there seems to be no plan to look into alternative". 
   The memory constraints for HNSW alone are not gigantic. For each vector we 
need: M * 4 * 2 bytes where M is the number of links per vector (we default to 
16). That's 128 bytes per vector that can be reduced easily at the cost of 
relevancy by reducing M.
   
   Regarding the number of dimensions, the fact that we need to compute the 
similarity is not inherent to hnsw. A coarse quantizer would also require to 
compute the similarity on the entire vector space to reach a good recall. The 
nice property about hnsw is that it reduces the number of comparisons at 
indexing and query time to a minimum at any scale.
   
   So regarding the alternative, I think one direction could be to look at 
product quantization that would allow to reduce the number of dimensions and 
provide a similarity function on the reduced space. Something like 
https://hal.inria.fr/inria-00514462v2/document could be "easily" implemented on 
top of the hnsw indexing and querying to reduce the memory needed for high 
dimensions use cases. Any quantization that reduces the number of dimension in 
fact. We should have options here and the hard limit of 1024 could stay as a 
hard limit at when quantization should be preferred. 
   
   Another direction would be to look at ivf, a coarse quantizer that fits well 
in the inverted list design. The main advantage here would be to get ride of 
the graph and its associated memory. But again the number of dimensions would 
still be a constrained since it requires to compute the final score on the 
entire vector space. Product quantization would still be needed to reduce the 
constraints at the cost of relevancy. 
   
   There's also the problem of merges. A merge is single threaded so we're 
limited when it comes to build bigger graphs. 
   There's way to mitigate this issue, we can increase the index buffer size to 
create bigger graphs, we can also limit the size of segments to ensure that we 
don't rebuild passed a certain size. The fact that we build a graph per segment 
can have benefits though. With concurrent segment search, it should be possible 
to run a single request in parallel on each graph/segment. We could also 
investigate using multiple threads when merging a single segment. The other 
solutions that implement hnsw uses multiple threads to index in a single graph. 
   
    Today we store these vectors on file and access them through MMAP. That's a 
differentiator compared to other solutions that requires them to be fully in 
memory. Considering the access pattern of hnsw (random) I am not sure that it's 
a real advantage but on the other hand the number of comparisons for indexing 
and querying is reduced to a minimum. 1024 dimensions is already quite a lot, 
there are benchmarks with 768 dimensions and more in the ann-benchmarks so it 
should be interesting to run those on a machine with restricted RAM. With the 
right setup, I'd assume that Lucene would not crash but the indexing would be 
really slow. 
   
   > How long does it take for me to index 50 million documents?
   
   That's a good question, I am also interested to understand what's the memory 
requirements associated with it.
   With 1024 dimensions and M set to 16, the entire graph would take 6GB and 
the vectors 190GB!
   Assuming that we have enough RAM, it would take almost 14 hours if we can 
achieve 1000 docs/s. The current solution can index faster than that but I 
don't think it was ever tested with constrained memory. 
   I think we need a test benchmarks to check our assumptions. There's nothing 
in the Lucene's doc (or ES/Solr) that warns against these constraints and we 
don't know how the solution degrades.
   Today we store these vectors on file and access them through MMAP. That's a 
differentiator compared to other solutions that requires them to be fully in 
memory. Considering the access pattern of hnsw (random) I am not sure that it's 
a real advantage but on the other hand the number of comparisons for indexing 
and querying is reduced to a minimum so a good SSD could mitigate this? We need 
tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to