[ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052614#comment-17052614
 ] 

Julie Tibshirani commented on LUCENE-9136:
------------------------------------------

Hello [~tomoko]! My explanation before was way too brief, I'm still getting 
used to the joint JIRA/ GitHub set-up :) I'll give more context on the 
suggested direction.

The draft adds a new format VectorsFormat, which simply delegates to 
DocValuesFormat and PostingsFormat under the hood:
 * The original vectors are stored as BinaryDocValues.
 * The vectors are also clustered through k-means clustering, and the cluster 
information is stored in postings format. In particular, each cluster centroid 
is encoded to a BytesRef to represent a term. Each document belonging to the 
centroid is added to the postings list for that term.

Given a query vector, we first iterate through all the centroid terms to find a 
small number of closest centroids. We then take the disjunction of all those 
postings enums to obtain a DocIdSetIterator of candidate nearest neighbors. To 
produce the score for each candidate, we load its vector from BinaryDocValues 
and compute the distance to the query vector.

I liked that this approach didn't introduce major new data structures and could 
re-use the existing formats. To respond to your point, one difference between 
this approach and HNSW is that it’s able to re-use the formats without 
modifications to their APIs or implementations. In particular, it doesn’t 
require random access for doc values, they are only accessed through forward 
iteration. So to keep the code as simple as possible, I stuck with 
BinaryDocValues and didn’t create a new way to store the vector values. 
However, the PR does introduce a new top-level VectorsFormat as I thought this 
gave nice flexibility while prototyping.

There are two main hacks in the draft that would need addressing:
 * It's fairly fragile to re-use formats explicitly since we write to the same 
files as normal doc values and postings – I think there would be a conflict if 
there were both a vector field and a doc values field with the same name.
 * To write the postings list, we compute the map from centroid to documents in 
memory. We then expose it through a hacky Fields implementation called 
ClusterBackedFields and pass it to the postings writer. It would be better to 
avoid this hack and not to compute cluster information using a map.

Even apart from code-level concerns, I don't think the draft PR would be ready 
to integrate immediately. There are some areas where I think further work is 
needed to determine if coarse quantization (IVFFlat) is the right approach:
 * It would be good to run tests to understand how it scales to larger sets of 
documents, say in the 5M - 100M range. We would probably want to scale the 
number of centroids with the number of documents – a common heuristic is to set 
num centroids = sqrt(dataset size). Looking at the [FAISS 
experiments|https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors],
 it can be helpful to use an even higher number of centroids.
 ** Do we still obtain good recall and QPS for these larger dataset sizes?
 ** Can we still afford to run k-means at index time, given a larger number of 
centroids? With 10,000 centroids for example, each time we index a document 
we’ll be computing the distance between the document and 10,000 other vectors. 
This is a big concern and I think we would need strategies to address it.
 * It’s great that coarse quantization is relatively simple and could be 
implemented with existing data structures. But would we expect a much bigger 
speed-up and better scaling with a graph approach like HNSW? I think this still 
requires more analysis.
 * More thinking is required as to how to handle deleted documents (as 
discussed in LUCENE-9004).

> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
>                 Key: LUCENE-9136
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9136
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Xin-Chun Zhang
>            Priority: Major
>         Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png, 
> image-2020-02-16-15-05-02-451.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Representation learning (RL) has been an established discipline in the 
> machine learning space for decades but it draws tremendous attention lately 
> with the emergence of deep learning. The central problem of RL is to 
> determine an optimal representation of the input data. By embedding the data 
> into a high dimensional vector, the vector retrieval (VR) method is then 
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has 
> been used extensively in industry from online advertising to computer vision 
> and speech recognition. There exist many open source implementations of VR 
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
> choices for potential users. However, the aforementioned implementations are 
> all written in C++, and no plan for supporting Java interface, making it hard 
> to be integrated in Java projects or those who are not familier with C/C++  
> [[https://github.com/facebookresearch/faiss/issues/105]]. 
> The algorithms for vector retrieval can be roughly classified into four 
> categories,
>  # Tree-base algorithms, such as KD-tree;
>  # Hashing methods, such as LSH (Local Sensitive Hashing);
>  # Product quantization based algorithms, such as IVFFlat;
>  # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition, 
> while HNSW performs better in general scenarios including recommendation and 
> personalized advertisement. *The recall ratio of IVFFlat could be gradually 
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW 
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio. 
> Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention 
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 
> As an alternative for solving ANN similarity search problems, IVFFlat is also 
> very popular with many users and supporters. Compared with HNSW, IVFFlat has 
> smaller index size but requires k-means clustering, while HNSW is faster in 
> query (no training required) but requires extra storage for saving graphs 
> [indexing 1M 
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
>  Another advantage is that IVFFlat can be faster and more accurate when 
> enables GPU parallel computing (current not support in Java). Both algorithms 
> have their merits and demerits. Since HNSW is now under development, it may 
> be better to provide both implementations (HNSW && IVFFlat) for potential 
> users who are faced with very different scenarios and want to more choices.
> The latest branch is 
> [*lucene-9136-ann-ivfflat*]([https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat)|https://github.com/irvingzhang/lucene-solr/commits/jira/lucene-9136-ann-ivfflat]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to