[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949437#comment-16949437
 ] 

Michael Sokolov commented on LUCENE-9004:
-----------------------------------------

I agree with the general idea, Trey, of getting the vector scoring support in 
place, and the data structures to support that, as the first step. The 
VectorDocValues I posted is pretty good I think, for forward-only iteration 
over vectors. It doesn't do much apart from recording the dimension and 
providing methods for retrieving float[] per document/field. The 
BinaryDocValues format may spend some extra bytes recording the per-document 
value lengths or it may (?) optimize for the case where all lengths are equal - 
at any rate it certainly could, without the need for a separate class.

I think it is important though to have some idea how we are going to use these 
data structures in algorithms since that will inform the data design. For 
example, while coding this up I learned about the importance of supporting 
random access, something our DocValues implementations are not optimized for. 
To Adrien's point about a specialized Format, I think that would enable more 
optimal random access. The DocValues API basically requires creating a new 
DocValues whenever docids go backwards, so it's not ideal for this use case, 
although for scoring only it's probably fine, since scoring will be done in 
docid order. I didn't try to bite that all off at once, drawing the line at 
making no codec-level changes for this POC.

Re: supporting other numeric types in vectors -  I think this adds probably 
adds needless complication at this point; the API can be evolved later. We 
should follow the example of Points which I think only encodes float? Precision 
/ size tradeoff seems right for storage and at this stage we should favor 
flexibility over bit-packing optimizations.

I do see that Query implementation can be very challenging! I had not fully 
appreciated the problem honestly, so thanks for the foresight Adrien. We can 
limit the problem by definition though, framing the Query as 
{{KnnGraphQuery(float[] target, int topK)}} where {{topK}} is not necessarily 
the same as the number of results requested. This topK is a filtering 
criterion, defining a radius of inclusion for the query, and is required to be 
supplied by the constructor of the query. With that definition, we are free to 
retrieve all results up front, rank them by docid and iterate over them. I'm 
not sure what else we can do that's sensible? We can expand backwards over the 
graph, to include an ever-widening circle (bigger topK), but the expansion will 
not be related to docid order at all.

Here's a strategy for getting to something committable:
 # Get the {{VectorDocValues}} tightened up and usable for ranking. We should 
be able to commit this while the remainder is cooking. Even if we replace it 
ultimately within the context of nearest-vector search, I think it can have 
some value on its own for other use cases, and we can continue to use it here 
for a while.
 # FInish testing graph segment-merge for sorted index and in face of deletions.
 # Get the HNSW implementation in better shape. Today it does not incorporate 
the "hierarchical" part of the algorithm, and I haven't done any testing with 
real data, just synthetic vectors. I think we *do* want to do some more 
experimentation here before locking down an encoding format. One of the 
oddities of this algorithm is that you run multiple traversals across the 
graph, starting from "random" entry points. In the hierarchical version, these 
random points are chosen from a top-level "tier" that have long-distance links, 
and then there is a tree-like hierarchy of levels in the graph that enables 
more efficient traversal. I think this tiering needs to be explored in order to 
inform the storage format. Until we understand that better, we can continue to 
use the DocValues storage for both vectors and graph links.
 # Implement a KnnGraphQuery as described above
 # Return to the storage layer and lock down an efficient format. I think there 
could be some improvement from using a memory-mapped FloatBuffer to directly 
read floats from disk and avoid the copying into an external float[], and as I 
said we want to better support random access and avoid copying the entire 
contents of a DocValues field into RAM when we merge. It seems a new Format may 
be needed in order to enable that.

I won't work on this for a little while since I have some heavy day-job 
responsibilities next week, then I will be at ApacheCon EU, then two weeks 
holiday, but if someone else wants to pick up the VectorDocValues or make 
something better, I'll be happy :) In any case I'll return to this eventually 
and push ahead.

> Approximate nearest vector search
> ---------------------------------
>
>                 Key: LUCENE-9004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9004
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael Sokolov
>            Priority: Major
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to