[
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xin-Chun Zhang updated LUCENE-9136:
-----------------------------------
Attachment: (was: image-2020-02-16-14-36-54-478.png)
> Introduce IVFFlat to Lucene for ANN similarity search
> -----------------------------------------------------
>
> Key: LUCENE-9136
> URL: https://issues.apache.org/jira/browse/LUCENE-9136
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Xin-Chun Zhang
> Priority: Major
> Attachments: 1581409981369-9dea4099-4e41-4431-8f45-a3bb8cac46c0.png,
> image-2020-02-16-15-05-02-451.png
>
>
> Representation learning (RL) has been an established discipline in the
> machine learning space for decades but it draws tremendous attention lately
> with the emergence of deep learning. The central problem of RL is to
> determine an optimal representation of the input data. By embedding the data
> into a high dimensional vector, the vector retrieval (VR) method is then
> applied to search the relevant items.
> With the rapid development of RL over the past few years, the technique has
> been used extensively in industry from online advertising to computer vision
> and speech recognition. There exist many open source implementations of VR
> algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various
> choices for potential users. However, the aforementioned implementations are
> all written in C++, and no plan for supporting Java interface, making it hard
> to be integrated in Java projects or those who are not familier with C/C++
> [[https://github.com/facebookresearch/faiss/issues/105]].
> The algorithms for vector retrieval can be roughly classified into four
> categories,
> # Tree-base algorithms, such as KD-tree;
> # Hashing methods, such as LSH (Local Sensitive Hashing);
> # Product quantization based algorithms, such as IVFFlat;
> # Graph-base algorithms, such as HNSW, SSG, NSG;
> where IVFFlat and HNSW are the most popular ones among all the VR algorithms.
> IVFFlat is better for high-precision applications such as face recognition,
> while HNSW performs better in general scenarios including recommendation and
> personalized advertisement. *The recall ratio of IVFFlat could be gradually
> increased by adjusting the query parameter (nprobe), while it's hard for HNSW
> to improve its accuracy*. In theory, IVFFlat could achieve 100% recall ratio.
> Recently, the implementation of HNSW (Hierarchical Navigable Small World,
> LUCENE-9004) for Lucene, has made great progress. The issue draws attention
> of those who are interested in Lucene or hope to use HNSW with Solr/Lucene.
> As an alternative for solving ANN similarity search problems, IVFFlat is also
> very popular with many users and supporters. Compared with HNSW, IVFFlat has
> smaller index size but requires k-means clustering, while HNSW is faster in
> query (no training required) but requires extra storage for saving graphs
> [indexing 1M
> vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]].
> Another advantage is that IVFFlat can be faster and more accurate when
> enables GPU parallel computing (current not support in Java). Both algorithms
> have their merits and demerits. Since HNSW is now under development, it may
> be better to provide both implementations (HNSW && IVFFlat) for potential
> users who are faced with very different scenarios and want to more choices.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]