benwtrent commented on PR #14173:
URL: https://github.com/apache/lucene/pull/14173#issuecomment-2634182175

   > Java limits the size of arrays (and lists) to 'int max' and does not allow 
'long' array indices. These will need to be changed to use a different data 
structure.
   
   Yeah, I don't like that we have this right now for `int` values at all. I 
don't immediately know how to solve this, but its pretty dang expensive as it 
is now.
   
   > We use bitsets on vector ordinals for multiple use cases, and almost all 
bitsets work with integers. 
   
   Correct, filtering though is always on `doc` ids, we filter on the doc-ids 
so we are ok there.
   
   I understand the complexity of keeping track of visitation to prevent 
re-visiting a node. However, I would argue that any graph based or modern 
vector index needs to keep track of the vectors it has visited to prevent 
recalculating the score or prevent further exploration down a previously 
explored path.
   
   > For example, we could no longer assume that maxOrdinal is the graph size. 
   
   We can easily write out the graph size and indicate the graph size is the 
count of vectors (which is the sum of non-deleted vectors). True, it gets 
complex with deleted docs. Likely we would need to iterate and count during 
merges, though I would expect that to be a minor cost in addition to all the 
other work we currently do during merging.
   
   I don't think this is a particular problem.
   
   > Given the volume of changes with "long" graph nodeIds, I wonder if we 
should do it when we add a new ANN implementation (like DiskANN maybe?).
   
   I don't understand how DiskANN would solve any of the previously expressed 
problems.
   
    - DiskANN still requires the graph to be available when doing insertion and 
querying. 
    - We still need to keep track of filtering on docs (so bit set filtering on 
doc ids, which is OK)
    - During graph exploration, we would need to keep track of vectors already 
visited.
    - We would still need to resolve vector ordinal -> doc_id


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to