benwtrent commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2634182175
> Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure. Yeah, I don't like that we have this right now for `int` values at all. I don't immediately know how to solve this, but its pretty dang expensive as it is now. > We use bitsets on vector ordinals for multiple use cases, and almost all bitsets work with integers. Correct, filtering though is always on `doc` ids, we filter on the doc-ids so we are ok there. I understand the complexity of keeping track of visitation to prevent re-visiting a node. However, I would argue that any graph based or modern vector index needs to keep track of the vectors it has visited to prevent recalculating the score or prevent further exploration down a previously explored path. > For example, we could no longer assume that maxOrdinal is the graph size. We can easily write out the graph size and indicate the graph size is the count of vectors (which is the sum of non-deleted vectors). True, it gets complex with deleted docs. Likely we would need to iterate and count during merges, though I would expect that to be a minor cost in addition to all the other work we currently do during merging. I don't think this is a particular problem. > Given the volume of changes with "long" graph nodeIds, I wonder if we should do it when we add a new ANN implementation (like DiskANN maybe?). I don't understand how DiskANN would solve any of the previously expressed problems. - DiskANN still requires the graph to be available when doing insertion and querying. - We still need to keep track of filtering on docs (so bit set filtering on doc ids, which is OK) - During graph exploration, we would need to keep track of vectors already visited. - We would still need to resolve vector ordinal -> doc_id -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org