[GitHub] [lucene] jpountz commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

GitBox Mon, 07 Nov 2022 08:07:03 -0800


jpountz commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1305835975


   I guess that encoding each block with a different number of bits per value 
would mostly help if node IDs are somewhat clustered so that the set of 
neighbors to a given node would be close to each other. Maybe this is something 
that could happen, e.g. for an e-commerce catalog sorted by category or 
something along these lines that would be somewhat correlated with the vectors. 
But in the general case, neighbors could be anywhere in the [0, maxDoc) range 
so I don't think we'd gain much compared to using the same number of bits per 
value across the whole range.
   
   Another question to me is whether we should try to delta-encode these lists 
or store absolute node IDs. By using absolute node IDs, we need `n * ⌈lg(m)⌉` 
bits to encode a list of neighbors. Something like Elias-Fano which is a very 
compact way to encode sorted lists of integers requires `2 * n + n * ⌈lg(m/n)⌉` 
bits. Assuming that neighbors are mostly random in the [0, maxDoc), 16 
neighbors per node and 2^24 nodes in total, this would reduce storage 
requirements by ~10% which is not negligible but not much either. I guess that 
it could be much better than that if node IDs were clustered though.
   
   We used to leverage the `PackedInts` utility class for all bit-packing 
related stuff in the past, but it tried to do too much at the same time, which 
prevented it from being good at everything. For instance, you don't want to do 
bit packing the same way if you want to do random access vs. if you want to 
decode sequences of packed integers, so we now have 3 utility classes for bit 
packing:
    - `PackedInts` to manage in-memory packed integers that can be accessed 
randomly. We use this e.g. at merge time to compute doc ID maps once deletes 
are applied.
    - `Direct(Monotonic)Reader`/`Direct(Monotonic)Writer` are the preferred 
classes to have disk-based packed integers that can be accessed randomly. We 
use them mainly for doc values.
    - Custom bulk decoding like `Lucene90PostingsFormat`'s `ForUtil` to 
bulk-decode sequences of packed integers.
   
   Given my understanding of the access pattern, I can think of two main 
options that could make sense:
   
   Option 1:
    - Single number of bits per value across all nodes.
    - Bulk decoding logic like `Lucene90PostingsFormat`'s `ForUtil`, maybe 
specialized for different numbers of bits per value, e.g. 20, 24, 28 and 32.
   
   Option 2:
    - Delta-encode node IDs and store deltas using vints or pfor delta.
    - Use a `DirectMonotonicReader` to store a mapping between node IDs and the 
start offset of neighbors for a given node, like `BINARY` doc values.
   
   In my mind the choice depends on how likely we think that nodes could be 
clustered in a way that nodes that are close to each other would have similar 
doc IDs. If this is a case that is likely and that we want to optimize for, 
option 2 would likely compress significantly better. Otherwise option 1. I 
think both options would be relatively easy to implement, I'm happy to help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

Reply via email to