jpountz commented on PR #11860: URL: https://github.com/apache/lucene/pull/11860#issuecomment-1305835975
I guess that encoding each block with a different number of bits per value would mostly help if node IDs are somewhat clustered so that the set of neighbors to a given node would be close to each other. Maybe this is something that could happen, e.g. for an e-commerce catalog sorted by category or something along these lines that would be somewhat correlated with the vectors. But in the general case, neighbors could be anywhere in the [0, maxDoc) range so I don't think we'd gain much compared to using the same number of bits per value across the whole range. Another question to me is whether we should try to delta-encode these lists or store absolute node IDs. By using absolute node IDs, we need `n * ⌈lg(m)⌉` bits to encode a list of neighbors. Something like Elias-Fano which is a very compact way to encode sorted lists of integers requires `2 * n + n * ⌈lg(m/n)⌉` bits. Assuming that neighbors are mostly random in the [0, maxDoc), 16 neighbors per node and 2^24 nodes in total, this would reduce storage requirements by ~10% which is not negligible but not much either. I guess that it could be much better than that if node IDs were clustered though. We used to leverage the `PackedInts` utility class for all bit-packing related stuff in the past, but it tried to do too much at the same time, which prevented it from being good at everything. For instance, you don't want to do bit packing the same way if you want to do random access vs. if you want to decode sequences of packed integers, so we now have 3 utility classes for bit packing: - `PackedInts` to manage in-memory packed integers that can be accessed randomly. We use this e.g. at merge time to compute doc ID maps once deletes are applied. - `Direct(Monotonic)Reader`/`Direct(Monotonic)Writer` are the preferred classes to have disk-based packed integers that can be accessed randomly. We use them mainly for doc values. - Custom bulk decoding like `Lucene90PostingsFormat`'s `ForUtil` to bulk-decode sequences of packed integers. Given my understanding of the access pattern, I can think of two main options that could make sense: Option 1: - Single number of bits per value across all nodes. - Bulk decoding logic like `Lucene90PostingsFormat`'s `ForUtil`, maybe specialized for different numbers of bits per value, e.g. 20, 24, 28 and 32. Option 2: - Delta-encode node IDs and store deltas using vints or pfor delta. - Use a `DirectMonotonicReader` to store a mapping between node IDs and the start offset of neighbors for a given node, like `BINARY` doc values. In my mind the choice depends on how likely we think that nodes could be clustered in a way that nodes that are close to each other would have similar doc IDs. If this is a case that is likely and that we want to optimize for, option 2 would likely compress significantly better. Otherwise option 1. I think both options would be relatively easy to implement, I'm happy to help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org