[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052614#comment-17052614 ]
Julie Tibshirani commented on LUCENE-9136: ------------------------------------------ Hello [~tomoko]! My explanation before was way too brief, I'm still getting used to the joint JIRA/ GitHub set-up :) I'll give more context on the suggested direction. The draft adds a new format VectorsFormat, which simply delegates to DocValuesFormat and PostingsFormat under the hood: * The original vectors are stored as BinaryDocValues. * The vectors are also clustered through k-means clustering, and the cluster information is stored in postings format. In particular, each cluster centroid is encoded to a BytesRef to represent a term. Each document belonging to the centroid is added to the postings list for that term. Given a query vector, we first iterate through all the centroid terms to find a small number of closest centroids. We then take the disjunction of all those postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. To produce the score for each candidate, we load its vector from BinaryDocValues and compute the distance to the query vector. I liked that this approach didn't introduce major new data structures and could re-use the existing formats. To respond to your point, one difference between this approach and HNSW is that it’s able to re-use the formats without modifications to their APIs or implementations. In particular, it doesn’t require random access for doc values, they are only accessed through forward iteration. So to keep the code as simple as possible, I stuck with BinaryDocValues and didn’t create a new way to store the vector values. However, the PR does introduce a new top-level VectorsFormat as I thought this gave nice flexibility while prototyping. There are two main hacks in the draft that would need addressing: * It's fairly fragile to re-use formats explicitly since we write to the same files as normal doc values and postings – I think there would be a conflict if there were both a vector field and a doc values field with the same name. * To write the postings list, we compute the map from centroid to documents in memory. We then expose it through a hacky Fields implementation called ClusterBackedFields and pass it to the postings writer. It would be better to avoid this hack and not to compute cluster information using a map. Even apart from code-level concerns, I don't think the draft PR would be ready to integrate immediately. There are some areas where I think further work is needed to determine if coarse quantization (IVFFlat) is the right approach: * It would be good to run tests to understand how it scales to larger sets of documents, say in the 5M - 100M range. We would probably want to scale the number of centroids with the number of documents – a common heuristic is to set num centroids = sqrt(dataset size). Looking at the [FAISS experiments|https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors], it can be helpful to use an even higher number of centroids. ** Do we still obtain good recall and QPS for these larger dataset sizes? ** Can we still afford to run k-means at index time, given a larger number of centroids? With 10,000 centroids for example, each time we index a document we’ll be computing the distance between the document and 10,000 other vectors. This is a big concern and I think we would need strategies to address it. * It’s great that coarse quantization is relatively simple and could be implemented with existing data structures. But would we expect a much bigger speed-up and better scaling with a graph approach like HNSW? I think this still requires more analysis. * More thinking is required as to how to handle deleted documents (as discussed in LUCENE-9004). > Introduce IVFFlat to Lucene for ANN similarity search > ----------------------------------------------------- > > Key: LUCENE-9136 > URL: https://issues.apache.org/jira/browse/LUCENE-9136 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Xin-Chun Zhang > Priority: Major > Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, > image-2020-02-16-15-05-02-451.png > > Time Spent: 50m > Remaining Estimate: 0h > > Representation learning (RL) has been an established discipline in the > machine learning space for decades but it draws tremendous attention lately > with the emergence of deep learning. The central problem of RL is to > determine an optimal representation of the input data. By embedding the data > into a high dimensional vector, the vector retrieval (VR) method is then > applied to search the relevant items. > With the rapid development of RL over the past few years, the technique has > been used extensively in industry from online advertising to computer vision > and speech recognition. There exist many open source implementations of VR > algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various > choices for potential users. However, the aforementioned implementations are > all written in C++, and no plan for supporting Java interface, making it hard > to be integrated in Java projects or those who are not familier with C/C++ > [[https://github.com/facebookresearch/faiss/issues/105]]. > The algorithms for vector retrieval can be roughly classified into four > categories, > # Tree-base algorithms, such as KD-tree; > # Hashing methods, such as LSH (Local Sensitive Hashing); > # Product quantization based algorithms, such as IVFFlat; > # Graph-base algorithms, such as HNSW, SSG, NSG; > where IVFFlat and HNSW are the most popular ones among all the VR algorithms. > IVFFlat is better for high-precision applications such as face recognition, > while HNSW performs better in general scenarios including recommendation and > personalized advertisement. *The recall ratio of IVFFlat could be gradually > increased by adjusting the query parameter (nprobe), while it's hard for HNSW > to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. > Recently, the implementation of HNSW (Hierarchical Navigable Small World, > LUCENE-9004) for Lucene, has made great progress. The issue draws attention > of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. > As an alternative for solving ANN similarity search problems, IVFFlat is also > very popular with many users and supporters. Compared with HNSW, IVFFlat has > smaller index size but requires k-means clustering, while HNSW is faster in > query (no training required) but requires extra storage for saving graphs > [indexing 1M > vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. > Another advantage is that IVFFlat can be faster and more accurate when > enables GPU parallel computing (current not support in Java). Both algorithms > have their merits and demerits. Since HNSW is now under development, it may > be better to provide both implementations (HNSW && IVFFlat) for potential > users who are faced with very different scenarios and want to more choices. > The latest branch is > [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org