[ https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161700#comment-17161700 ]
Alex Klibisz commented on LUCENE-9322: -------------------------------------- Very briefly, I just remembered another thing you might consider if you are considering storing both dense vectors and sparse vectors. There are two optimizations for sparse vectors at the storage level: # Very obvious, just store the "present/true/positive" indices instead of the full vector. # Maybe less obvious, if you store the indices in sorted order, you can compute intersections more efficiently, which is useful for some similarity functions. For example `int size_of_intersection((0,1,2,3),(2,3,4)) = 2` can be computed with only an int counter and no other intermediate data structures. Whereas, `int size_of_intersection((0,2,1,3),(2,3,4)) = 2` requires converting one of the arrays to a hashset, which adds up at scale. The sorted intersection algo is pretty obvious but here it is in case you need it: [https://github.com/alexklibisz/elastiknn/blob/74815f2613653e2c266bf7eb56b020943dd80b9a/core/src/main/java/com/klibisz/elastiknn/utils/ArrayUtils.java#L10-L36] - Ak > Discussing a unified vectors format API > --------------------------------------- > > Key: LUCENE-9322 > URL: https://issues.apache.org/jira/browse/LUCENE-9322 > Project: Lucene - Core > Issue Type: New Feature > Reporter: Julie Tibshirani > Priority: Major > > Two different approximate nearest neighbor approaches are currently being > developed, one based on HNSW (LUCENE-9004) and another based on coarse > quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to > handle vectors. In LUCENE-9136 we discussed the possibility of a unified API > that could support both approaches. The two ANN strategies give different > trade-offs in terms of speed, memory, and complexity, and it’s likely that > we’ll want to support both. Vector search is also an active research area, > and it would be great to be able to prototype and incorporate new approaches > without introducing more formats. > To me it seems like a good time to begin discussing a unified API. The > prototype for coarse quantization > ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit > soon (this depends on everyone's feedback of course). The approach is simple > and shows solid search performance, as seen > [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326]. > I think this API discussion is an important step in moving that > implementation forward. > The goals of the API would be > # Support for storing and retrieving individual float vectors. > # Support for approximate nearest neighbor search -- given a query vector, > return the indexed vectors that are closest to it. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org