vigyasharma commented on PR #14173: URL: https://github.com/apache/lucene/pull/14173#issuecomment-2629167044
> I think this PR is still doing globally unique ordinals for vectors? So, vectors 1, 2, 3 go to document 1 and ordinals 4, 5 go to doc 2? If so, I think we should "bite the bullet" and make vector ordinals long values. That is correct. I did play around with "long" nodeIDs in an alternate implementation but ran into a whole bunch of challenges: 1. In a lot of places, we hold all nodes of the graph (or at a level) in a single array, and use vector ordinals as array index addresses. Java limits the size of arrays (and lists) to 'int max' and does not allow 'long' array indices. These will need to be changed to use a different data structure. 2. We use bitsets on vector ordinals for multiple use cases, and almost all bitsets work with integers. There is one `LongBitSet` but it doesn't integrate well with BitSetIterators etc. At one point, I had this idea of representing vectors using `ordinals` and `subOrdinals`, where ordinals correspond to documents, i.e. one ordinal per document, and subOrdinals to the multiple vector values per document. For example, for a multi-vector with 3 values, the (ord, subOrd) values would be (ord, 0), (ord, 1), (ord, 2). For single valued vectors, we will only have subOrdinal=0 for each vector. I would then pack the ordinal and subOrdinal values into a single `long` graph nodeId. Ordinal took the lower 32 bits, and subOrdinal took 32 MSB. Since we use VLong to write them, we don't need extra storage for single valued vectors where subOrdinal bits were all zero. This was also neat because I could score `>INT_MAX` vectors in the graph, and use bit masks to find the baseOrdinal (first vector of the doc) from any graph nodeId. However, in addition to the "long" nodeId challenges above, this approach broke latent assumptions like "vector ordinals are densely packed". For example, we could no longer assume that maxOrdinal is the graph size. Hence I pivoted back to "int ordinals" and started maintaining `ordToDoc` maps with many-one mapping. Given the volume of changes with "long" graph nodeIds, I wonder if we should do it when we add a new ANN implementation (like DiskANN maybe?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org