[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566498#comment-17566498 ]
Michael Sokolov commented on LUCENE-10577: ------------------------------------------ I'm also not sure about Codec parameter vs FieldInfo, but it's clearly a lower-profile change to add to the Codec, and we could always extend it to the Field later? I think you meant "we allow {*}signed{*}" byte values? Thanks for raising this - I had completely forgotten about the scoring angle. To keep the byte dot-product scores positive we can divide by (dimension * 2^14 (max product of two bytes)). Scores might end up quite small, but at least it will be safe and shouldn't lose any information. Regarding Euclidean - consider that Euclidean is only different from dot-product when the vectors have different lengths (euclidean norms). If they are all the same, you might as well use dot product since it will lead to the same ordering (although the values will differ). On the other hand, if they are different, then quantization into a byte is necessarily going to lose more information since - if you scale by a large value, to get it to fit into a byte, then the precision of small values scaled by the same constant will be greatly reduced. I felt this made it a bad fit, and prevented it. But we could certainly implement euclidean distance over bytes. Maybe somebody smarter finds a use for it. Also, currently I didn't do anything special about the similarity computation in KnnVectorQuery, where it is used when falling back to exact KNN. There was no test failure, because the codec will convert to float on demand, and this is what was going on in there. So it would be suboptimal in this case. But worse is that these floats will be large and negative and potentially lead to negative scores. To address this we may want to refactor/move the exact KNN computation into the vector Reader; ie {{KnnVectorsReader.exactSearch(String field, float[] target, int k).}} > Quantize vector values > ---------------------- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Reporter: Michael Sokolov > Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org