mccullocht opened a new issue, #15789:
URL: https://github.com/apache/lucene/issues/15789
### Description
I've been thinking about this for a while and have prototyped parts of it,
I'm curious of others think this is worth the effort and complexity.
OSQ produces an approximate distance between a query vector and a quantized
doc vector in the index. Extend this to include a statistical confidence lower
and upper bound on the true distance.
There are a couple of possible applications that I can think of:
1. "Wavelet"-style distance scoring.
2. Statistical bounding for re-scoring.
There are probably other applications wherever there's an ordered list of
scored vectors.
## Proposed Mechanism
WARNING: I used Gemini to explore the math around this idea. I have verified
this works in practice on some sets of L2 normalized vectors but I'm not sure
how well this generalizes.
The main observation is that the comparison of a fixed unit vector and a
random unit vector is a random variable where the mean is 0 and the standard
deviation is $\frac{1}{\sqrt{D}}$ where $D$ is the number of dimensions. When
you sum $D$ dimensions of a vector product, the standard deviation of the sum
scales by $\sqrt{D}$, but because the vectors are normalized the total standard
deviation ends up scaling as:
$$\sigma \approx \sqrt{D} \cdot \lparen\frac{1}{\sqrt{D}} \cdot
\frac{1}{\sqrt{D}}\rparen = \frac{1}{\sqrt{D}}$$
We can combine this with a $Z$ score to adjust the confidence interval on
the bound.
Quantization error needs to be incorporated into the bound to reflect our
level of confidence based on loss of precision. As a starting point we will use
the l2 norm of the quantization residual:
$$\lVert \epsilon \rVert = \lVert v - \bar{v} \rVert$$
Where $v$ is the original vector and $\bar{v}$ is the de-quantized vector.
This $\lVert \epsilon \rVert$ value can be stored as a single float term for
each vector. The $\lVert \epsilon \rVert$ value can be combined to compute a
hard bound on the error by multiplying it with the L2 norm of the other vector,
e.g. for two vectors $q$ and $d$ the hard bound can be computed as $\lVert q
\rVert \cdot \lVert \epsilon_d \rVert + \lVert d \rVert \cdot \lVert \epsilon_q
\rVert$. Combined with the standard deviation and $Z$ score we get:
$$Bound = Z \cdot \sigma = Z \cdot \frac{\lVert q \rVert \cdot \lVert
\epsilon_d \rVert + \lVert d \rVert \cdot \lVert \epsilon_q \rVert}{\sqrt{D}}$$
This formulation works for dot product distance computation; for L2 we need
to multiply the value by 2 to account for our dot product derived formulation
of L2 distance $\lVert q \rVert \cdot \lVert d \rVert - 2 \cdot \langle q,d
\rangle$.
In cases where we have a very high precision query (8 bits+) then the
$\lVert \epsilon \rVert$ value will be very small and we can likely omit it
without any significant loss of precision.
As noted we will have to store $\lVert \epsilon \rVert$ as a scalar term on
each vector, but for angular distance we will also need to store the L2 norm
(e.g. $\lVert q \rVert$) which is computed but not stored today. This will make
each vector 4-8 bytes larger in storage.
## Wavelet-style distance scoring.
Extend `RandomVectorScorer` to add a new method:
```java
public interface RandomVectorScorer {
/**
* Returns the score between the query and the provided node. If the
score would be less than
* minSimilarity the implementation may exit early with a negative
sentinel score.
*
* @param node the node to score
* @param minSimilarity the minimum similarity to consider
* @return the computed score, or a negative sentinel score if the score
would be less than
* minSimilarity
*/
float score(int node, float minSimilarity);
}
```
When collecting results we may use this method and pass the `minSimilarity`
from the result set as the bound -- for instance, we might use this in HNSW
traversal since any candidate with lower similarity than the worst result would
never be popped.
An OSQ based scorer can implement this method by:
1. Quantizing a primary vector the way we do today, then quantizing a second
_residual_ vector from the difference between the original query and the
de-quantized primary vector.
2. Compute the score of the two primary vectors and an upper bound score. If
the upper bound score is less than `minSimilarity` we can exit early.
3. Adjust the primary query dot product by adding the dot product of the
residual query and the primary query, compute a score, and return.
I've tried this in exhaustive scoring cases on test data sets and the
residual adjustment (3) triggers about 0.5% of the time over a set of 2M
vectors with minimal loss in recall (<0.5%) with $Z = 1.96$ (95% confidence).
Assuming 1-bit doc vectors I don't think this would have a large impact on HNSW
traversal as we are dominated by memory latency and the fallback should trigger
a lot more since the scored set correlates with the highest scores. Exhaustive
retrieval cases will likely do better, as would any partition-based search
strategy like IVF or SPANN.
## Statistical bounding for re-scoring
If the user wants 100 results and I have bounds for each result, I can take
the max lower bound of my 100 results and include any vectors whose _upper
bound_ is greater than that value. This would make re-scoring more precise
since any vector that is likely to produce a result that would make the top 100
with a more precise score would be included.
In Lucene this is likely to be very annoying to implement as I want heap
items to be a 16 byte struct but the heap implementation only accepts 8 byte
items (encoded as `long`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]