[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949437#comment-16949437 ]
Michael Sokolov commented on LUCENE-9004: ----------------------------------------- I agree with the general idea, Trey, of getting the vector scoring support in place, and the data structures to support that, as the first step. The VectorDocValues I posted is pretty good I think, for forward-only iteration over vectors. It doesn't do much apart from recording the dimension and providing methods for retrieving float[] per document/field. The BinaryDocValues format may spend some extra bytes recording the per-document value lengths or it may (?) optimize for the case where all lengths are equal - at any rate it certainly could, without the need for a separate class. I think it is important though to have some idea how we are going to use these data structures in algorithms since that will inform the data design. For example, while coding this up I learned about the importance of supporting random access, something our DocValues implementations are not optimized for. To Adrien's point about a specialized Format, I think that would enable more optimal random access. The DocValues API basically requires creating a new DocValues whenever docids go backwards, so it's not ideal for this use case, although for scoring only it's probably fine, since scoring will be done in docid order. I didn't try to bite that all off at once, drawing the line at making no codec-level changes for this POC. Re: supporting other numeric types in vectors - I think this adds probably adds needless complication at this point; the API can be evolved later. We should follow the example of Points which I think only encodes float? Precision / size tradeoff seems right for storage and at this stage we should favor flexibility over bit-packing optimizations. I do see that Query implementation can be very challenging! I had not fully appreciated the problem honestly, so thanks for the foresight Adrien. We can limit the problem by definition though, framing the Query as {{KnnGraphQuery(float[] target, int topK)}} where {{topK}} is not necessarily the same as the number of results requested. This topK is a filtering criterion, defining a radius of inclusion for the query, and is required to be supplied by the constructor of the query. With that definition, we are free to retrieve all results up front, rank them by docid and iterate over them. I'm not sure what else we can do that's sensible? We can expand backwards over the graph, to include an ever-widening circle (bigger topK), but the expansion will not be related to docid order at all. Here's a strategy for getting to something committable: # Get the {{VectorDocValues}} tightened up and usable for ranking. We should be able to commit this while the remainder is cooking. Even if we replace it ultimately within the context of nearest-vector search, I think it can have some value on its own for other use cases, and we can continue to use it here for a while. # FInish testing graph segment-merge for sorted index and in face of deletions. # Get the HNSW implementation in better shape. Today it does not incorporate the "hierarchical" part of the algorithm, and I haven't done any testing with real data, just synthetic vectors. I think we *do* want to do some more experimentation here before locking down an encoding format. One of the oddities of this algorithm is that you run multiple traversals across the graph, starting from "random" entry points. In the hierarchical version, these random points are chosen from a top-level "tier" that have long-distance links, and then there is a tree-like hierarchy of levels in the graph that enables more efficient traversal. I think this tiering needs to be explored in order to inform the storage format. Until we understand that better, we can continue to use the DocValues storage for both vectors and graph links. # Implement a KnnGraphQuery as described above # Return to the storage layer and lock down an efficient format. I think there could be some improvement from using a memory-mapped FloatBuffer to directly read floats from disk and avoid the copying into an external float[], and as I said we want to better support random access and avoid copying the entire contents of a DocValues field into RAM when we merge. It seems a new Format may be needed in order to enable that. I won't work on this for a little while since I have some heavy day-job responsibilities next week, then I will be at ApacheCon EU, then two weeks holiday, but if someone else wants to pick up the VectorDocValues or make something better, I'll be happy :) In any case I'll return to this eventually and push ahead. > Approximate nearest vector search > --------------------------------- > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Michael Sokolov > Priority: Major > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10 seconds and searched them at 95% > recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I > haven't made any attempt to use multithreaded search for this, but it is > amenable to per-segment concurrency. > [1] > https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org