[GitHub] [lucene] benwtrent commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

GitBox Mon, 14 Nov 2022 13:06:57 -0800


benwtrent commented on PR #11860:
URL: https://github.com/apache/lucene/pull/11860#issuecomment-1314379458


   I am digging back into this.
   
    - Vectors that are similar do not have similar doc_ids. To ensure this, 
some clustering would have to occur. This implies a different graph structure 
as vectors would be known ahead of time. 
       - In fact, having similar doc_ids for similar nodes might fix some 
issues we have in requiring ALL the `.vex` to be loaded in memory as we could 
jump to clusters of doc_ids to read the most relevant neighbors... While a 
fruitful endeavor, this is complicated and out of scope, for now.
    - We have seen graphs with as many as `Integer.MAX_VALUE` vectors per 
segment. However, these are exceptions.
   
   So:
    - The neighbors for a given vector have effectively random doc_ids from 
`[0, MAX_DOC)`
    - We read ALL neighbors at once when exploring
    - We must be able to skip quickly to a specific level and a specific node 
in that vector
   
   This leads me down the road to consider @jpountz `Option 1`. Unified 
bits-per-value(bpv) but with specialized bulk encoding/decoding, similar to 
ForUtil.
   
   ------------------
   @rmuir suggestion is that we use various bpv. To justify this, the set of 
neighbor distributions would have to be significantly different. I am not sure 
we can rely on this without some clustering mechanism guaranteeing doc_id 
distributions. Let me try to explain.
   
   Let's take two disconnected nodes, `a` and `b` and assume that each neighbor 
has a random doc ID from `[0, MAX_DOC)`. The number of neighbors is determined 
by `M`. Let's use the default value of `16`. Let us label the set of `M` 
neighbors for `a` and `b` to be `a_neighbors` and `b_neighbors`, respectively. 
For simplicity, let’s assume `a_neighbors` contains `MAX_DOC - 1` and thus has 
the worst case bpv requirement.
   
   This turns into a probabilistic problem. Basically, given the total 
population of documents `[0, MAX_DOC)`, what is the probability that every 
neighbor in `b_neighbor` requires at least 4 fewer bits than `MAX_DOC - 1`(I 
choose 4 bits as it fits nicely into our ForUtil optimizations,  we will snap 
to standard bpv in increments of 4)
   
   This can be expressed as independent random events. Consequently, there is a 
1/8 chance to get a single document that requires 4 bits less information (the 
chance of getting a doc in `[0-(MAX_DOC - 1)/8]`). We have to do this 
independent 16 times (`M` times), which means we get `Math.pow(1.0/8, 16.0)` as 
our probability of success. Or `3.552713678800501E-15`, or expressed as a 
percentage 0.0000000000003% chance of it happening. TECHNICALLY, the 
probability is even worse as we are sampling without replacement. Meaning our 
probability gets worse than 1/8 as the desired set of docs to choose from gets 
smaller after every success. 
   
   I don’t think this justifies us storing various bpv. Until we can make sure 
that the collection of neighbors have some distribution across the doc_ids 
other than completely random.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent commented on pull request #11860: GITHUB-11830 Better optimize storage for vector connections

Reply via email to