[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

Varun Thacker (Jira) Wed, 17 Jun 2020 11:27:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138707#comment-17138707
 ]


Varun Thacker commented on LUCENE-9322:
---------------------------------------

JDK 

 
{code:java}
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.242-b08, mixed mode)
{code}
 

 

This is my first time trying out JMH. I took the encoding approach we used in 
VectorField vs the encoding approach taken by DenseVectorField ( in SOLR-14397 
) and compared them

 

The VectorField approach to encode is much faster than using Base64 encoding  

 

 
{code:java}
@Benchmark
public void testVectorFieldEncoding() {
    float[] vector = new float[512];
    for (int i=0; i<512; i++) {
        vector[i] = i + i/1000f;
    }

    for (int i=0; i<10_000; i++) {
        ByteBuffer buffer = ByteBuffer.allocate(Float.BYTES * vector.length);
        buffer.asFloatBuffer().put(vector);
        buffer.array();
    }
}
{code}
 

JMH output

 
{code:java}
Result: 123.116 ±(99.9%) 2.671 ops/s [Average]
  Statistics: (min, avg, max) = (95.557, 123.116, 143.097), stdev = 11.310
  Confidence interval (99.9%): [120.445, 125.787]




# Run complete. Total time: 00:08:07


Benchmark                      Mode  Samples    Score  Score error  Units
o.e.MyBenchmark.testVectorFieldEncoding    thrpt      200  123.116        2.671 
 ops/s
{code}
 

 

 
{code:java}
@Benchmark
public void testBase64Encoding() {
    float[] vector = new float[512];
    for (int i=0; i<512; i++) {
        vector[i] = i + i/1000f;
    }

    for (int i=0; i<10_000; i++) {
        ByteBuffer buffer = ByteBuffer.allocate(Float.BYTES * vector.length);
        for (float value : vector) {
            buffer.putFloat(value);
        }
        buffer.rewind();

        java.util.Base64.getEncoder().encode(buffer).array();
    }
}
{code}
 

JMH output
{code:java}
Result: 35.069 ±(99.9%) 0.745 ops/s [Average]
  Statistics: (min, avg, max) = (25.792, 35.069, 41.335), stdev = 3.154
  Confidence interval (99.9%): [34.324, 35.814]




# Run complete. Total time: 00:08:06


Benchmark                      Mode  Samples   Score  Score error  Units
o.e.MyBenchmark.testBase64Encoding    thrpt      200  35.069        0.745  ops/s
{code}
 

> Discussing a unified vectors format API
> ---------------------------------------
>
>                 Key: LUCENE-9322
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9322
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Julie Tibshirani
>            Priority: Major
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW ([#LUCENE-9004]) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9322) Discussing a unified vectors format API

Reply via email to