[ https://issues.apache.org/jira/browse/SOLR-14397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17088265#comment-17088265 ]
Trey Grainger commented on SOLR-14397: -------------------------------------- Documenting some implementation notes for some non-obvious decisions I've made along the way. Mostly notes for me to revisit, but also possibly some insight for anyone reviewing (feedback welcome). # It feels like a cleaner document API design may be for the field input format to switch to an array instead of a delimited string. So, for example, change from this on documents: {code:java} {"vectors":"1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0|0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0"}, {"id":"6", "vectors":"0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0"}{code} to this: {code:java} {"id":"6", "vectors":[[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0],[0.0,5.0,0.0,0.0,0.0,3.0,5.0,4.0]]}, {"id":"6", "vectors":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]}{code} I didn't do this because the document parser doesn't understand multi-dimensional arrays (arrays of arrays) and because it wants to treat an array as a multi-valued field of numbers (and add each number as a new field on the document) instead of the full array being understood as the value. If the parser were field-aware and could delegate the parsing this would make this much easier (and would make it possible to extend support for fields with interesting kinds of objects in the future), but for now I side-stepped all of this and am just passing in a delimited string instead because it works well. Additionally, I experimented with supporting multiple vectors per field through making the field {{multiValued=true}}, but this multiple fields and it wasn't immediately obvious to me that you could easily recombine each of those fields into one field later, which is needed to encode them in order and for efficient usage. As such, I arrived at the current "pass in the whole list of vectors as a delimited string" approach you currently see. Happy to explore this input format structure further if others feel it is important, just didn't want to get bogged down on those rabbit trails to enable syntactic sugar before getting the main functionality working. # The {{useDocValuesAsStored}} option on fields is quite unintuitive IMHO, and sometimes acts more as a suggestion (not always respected as the user would expect) than a reliable, user-specified choice. Specifically, it assumes that {{docValues}} and {{stored}} values are always equal and interchangeable if both are set (which isn't the case) and chooses one or the other based upon performance optimizations instead of based upon the user-specified choice. It has some hacks added in already to handle point field (numeric) types that already violated this principle of {{stored=docValues}} values, with a comment to revisit if other types also need to work around. Problematically, since {{stored}} fields make a call to "{{toExternal}}" to translate the stored value to a readable value, but {{docValues}} make no such call, this means that Solr choosing to return {{docValues}} instead of {{stored}} values (when the user set useDocValuesAsStored=false) means that the display version in the results is going be incorrect in cases where a different internal encoding is taking place on the docValues (such as the binary compression for this DenseVector field). I made a change to the {{SolrDocumentFetcher}} to overcome this that I think better respects the {{useDocValuesAsStored}} option by setting it as a requirement instead of a suggestion, but the approach here warrants a separate conversation on another JIRA, so I'll file one soon. # Just noticed there is an "{{org.apace.solr.common.util.Base64}}" class that implements Base64 encoding. I've been using the Java "{{java.util.Base64}}" class. The Solr version was implemented in 2009 and it looks like the Java version didn't come out later until Java 8, so maybe the Solr version is just old? If so, it would be worthwhile to validate there are no differences that favor the Solr version (encoding or performance) and consider replacing the Solr version with the Java version. Otherwise, if anyone knows a good reason to prefer the Solr version instead, I'm happy to switch over. > Vector Search in Solr > --------------------- > > Key: SOLR-14397 > URL: https://issues.apache.org/jira/browse/SOLR-14397 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Trey Grainger > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Search engines have traditionally relied upon token-based matching (typically > keywords) on an inverted index, plus relevance ranking based upon keyword > occurrence statistics. This can be viewed as a "sparse vector” match (where > each term is a one-hot encoded dimension in the vector), since only a few > keywords out of all possible keywords are considered in each query. With the > introduction of deep-learning-based transformers over the last few years, > however, the state of the art in relevance has moved to ranking models based > upon dense vectors that encode a latent, semantic understanding of both > language constructs and the underlying domain upon which the model was > trained. These dense vectors are also referred to as “embeddings”. An example > of this kind of embedding would be taking the phrase “chief executive officer > of the tech company” and converting it to [0.03, 1.7, 9.12, 0, 0.3] > . Other similar phrases should encode to vectors with very similar numbers, > so we may expect a query like “CEO of a technology org” to generate a vector > like [0.1, 1.9, 8.9, 0.1, 0.4]. When performing a cosine similarity > calculation between these vectors, we would expect a number closer to 1.0, > whereas a very unrelated text blurb would generate a much smaller cosine > similarity. > This is a proposal for how we should implement these vector search > capabilities in Solr. > h1. Search Process Overview: > In order to implement dense vector search, the following process is typically > followed: > h2. Offline: > An encoder is built. An encoder can take in text (a query, a sentence, a > paragraph, a document, etc.) and return a dense vector representing that > document in a rich semantic space. The semantic space is learned from > training on textual data (usually, though other sources work, too), typically > from the domain of the search engine. > h2. Document Ingestion: > When documents are processed, they are passed to the encoder, and the dense > vector(s) returned are stored as fields on the document. There could be one > or more vectors per-document, as the granularity of the vectors could be > per-document, per field, per paragraph, per-sentence, or even per phrase or > per term. > h2. Query Time: > *Encoding:* The query is translated to a dense vector by passing it to the > encoder > Quantization: The query is quantized. Quantization is the process of taking > a vector with many values and turning it into “terms” in a vector space that > approximates the full vector space of the dense vectors. > *ANN Matching:* A query on the quantized vector tokens is executed as an ANN > (approximate nearest neighbor) search. This allows finding most of the best > matching documents (typically up to 95%) with a traditional and efficient > lookup against the inverted index. > _(optional)_ *ANN Ranking*: ranking may be performed based upon the matched > quantized tokens to get a rough, initial ranking of documents based upon the > similarity of the query and document vectors. This allows the next step > (re-ranking) to be performed on a smaller subset of documents. > *Re-Ranking:* Once the initial matching (and optionally ANN ranking) is > performed, a similarity calculation (cosine, dot-product, or any number of > other calculations) is typically performed between the full (non-quantized) > dense vectors for the query and those in the document. This re-ranking will > typically be on the top-N results for performance reasons. > *Return Results:* As with any search, the final step is typically to return > the results in relevance-ranked order. In this case, that would be sorted by > the re-ranking similarity score (i.e. “cosine descending”). > ------ > *Variant:* For small document sets, it may be preferable to rank all > documents and skip steps steps 2, 3, and 4. This is because ANN Matching > typically reduces recall (current state of the art is around 95% recall), so > it can be beneficial to rank all documents if performance is not a concern. > In this case, step 5 is performed on the full doc set and would obviously > just be considered “ranking” instead of “re-”ranking. > h1. Proposed Implementation in Solr: > h2. Phase 1: Storage of Dense Vectors & Scoring on Vector Similarity > * > h3. Dense Vector Field: > We will add a new dense vector field type in Solr. This field type would be a > compressed encoding of a dense vector into a BinaryDocValues Field. There are > other ways to do it, but this is almost certain to be the most efficient. > Ideally this field is multi-valued. If it is single-valued then we are > either limited to only document-level vectors, or otherwise we have to create > many vector fields (i.e. per paragraph or term) and search across them, which > will never be practical or scale well. BinaryDocValues does not natively > support multiple values, but since the Solr field type will need to be > responsible for encoding the numeric vectors as a byte[] it should also be > able to encode multiple vectors at the same time. > * > h3. Vector Similarity Value Sources: > Implement function queries (value sources) that take in a query vector, a > (multi-valued) vector field name, and a multi-value selector ("max", "min" > ,"avg", "first"), and return the computed calculation. This function could > then be used to score documents via the “func” query parser. ANN queries > could be implemented using separate “fq” params for filtering, or as part of > the “q” query parameter to support optional ANN Ranking, and then the > existing Solr re-ranking functionality could be used for re-ranking the top N > results using the “func” query parser and these new cosine or dot product > value sources as the re-rank query. Existing Solr logic for performing > two-phase iteration will ensure that these vector functions will not be > excessively computed for documents that do not already match the “cheaper” > ANN queries. > *Example*: > {noformat} > q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, first){noformat} > ^ simple, no-ANN. Third param is optional and is the function > (min|max|avg|first|last). I think this should default to "max" (or whatever > "most related" is for each similarity function. It's "max" for cosine and dot > product). > *Example:* > {noformat} > q={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", vectors_field, > first)&fq=[ANN_QUERY]{noformat} > ^ ensures function is not computed unless ANN query matches due to two-phase > iterator > *Example:* > {noformat} > q=[ANN_QUERY]&rq={!rerank reRankQuery=$rqq reRankDocs=10000 > reRankWeight=1000000}&rqq={!func}cosine("1.55,3.53,2.3,0.7,3.44,2.33", > vectors_field){noformat} > ^Supports an initial ranking based upon the ANN QUERY, followed by a > re-ranking based upon the cosine function. > The key implementation challenges are expected to be: > * ensuring decent query (speed) performance > * Determining an efficient vector/byte encoding that has decent > disk-space/query time decoding trade offs and makes it possible to decode > only the ‘first’ vector per doc in cases where that is all that’s requested > by the distance function > h2. Phase 2: Pluggable ANN Search > * Implement pluggable quantization/ANN search. There are multiple different > algorithms to do quantization for efficient ANN search. The simplest is > something like LSH, with more advanced implementations being worked on in > Lucene right now in LUCENE-9004 (HNSW) and LUCENE-9136 (IVFFLAT), and with a > prototyped LSH implementation (Solr Vector Search plugin) and a random > projections approach (Hangry) linked to in SOLR-12890. > * Given that the transformers/encoders are a rapidly evolving area of data > science and NLP research, it feels unreasonable to assume that Solr will be > able to keep up to date on the latest model integrations. Given that, if we > build support into Solr for encoder models, we need to make it very pluggable. > * Phase 1 fully enables quantization to be done OUTSIDE of Solr, and thus > for external systems to accomplish quantization by creating quantized terms > externally, indexing them to Solr, and then sending in queries to Solr with > the quantized terms. I’m not convinced that keeping quantization outside of > Lucene/Solr isn’t a more manageable model long term, but if certain models do > get integrated natively into Lucene (as is currently being worked on) then we > should obviously look to leverage them. > h2. Important Considerations: > * Multi-valued dense vectors are a key differentiating feature here. I’m > considering this a key requirement and designing accordingly. This will allow > for multiple embeddings per document (i.e. “word embeddings, sentence > embeddings, paragraph embeddings, etc.) to be modeled. I like the way > [~ehatcher] designed the the Payload Function Query and think we should model > the multi-valued handling similarly. > * As best as possible, it would be nice to support multiple (pluggable) > fields/value sources that can be converted to dense vectors and compared. At > a minimum we need an efficient dense vector field, but it may be beneficial > to support a sparse vector field where only meaningful values need to be > encoded, in order to save space. Additionally, there may be some utility in > other field types or arbitrary value sources that can product vector-like > output to be able to be leveraged in the function queries being proposed. > h1. Note on Prior Approaches > * There are a handful of ways to perform vector search today with Solr. One > is through Streaming Expressions, and alternatively there are a few plugins > which have implemented this functionality for Solr. One of these was > contributed in SOLR-12890, and I've outlined examples of most of these > approaches in the comments on SOLR-12890. > * The Streaming Expressions approach works well out of the box, but doesn't > integrate nicely with traditional keyword search use cases, so we need a > solution that integrates with real-time queries and enables the full suite of > Solr's query-time features. > * The most advanced Solr plugin out there seems to be this one: > https://github.com/moshebla/solr-vector-scoring (forked the version in > SOLR-12890 and then added LSH functionality for ANN). > * I encountered several challenges when working with the implementation in > SOLR-12890 which informed the above proposed design: > 1. The plugin encodes dense vectors into payloads attached to terms > representing vector positions. This is pretty inefficient and is way too slow > at significant scale. I believe we instead need the underlying field types to > be binary doc values field with efficient compression and decoding for > optimal performance when scanning through all the matching documents and > dimensions. > 2. The plugin only supports one vector per-field, making supporting > word/phrase vectors, sentence vectors, paragraph vectors, etc. challenging. > It's not obvious how to modify the current approach to support multiple > vectors per field. > 3. The plugin uses a query parser instead of just a function query, which > prevents re-use of the vector similarity function. Using a function query > instead will enable the vector similarity scores to be easily leveraged for > relevance scoring, sorting, returned as fields, or used within other function > queries. Given that there are two goals of 1) filtering on ANN, and 2) > Scoring on vector similarity (usually via re-ranking), using a function query > for the scoring piece will be more flexible across use scoring use cases > (combining with other functions, returning, sorting, etc.) > 4. The LSH quantization/ANN implementation is a cool idea, but it is > hard-coded in the implementation. I recommend making it a separate filter and > making the implementation more pluggable (though for now this can be > externalized and passed in to Solr). We may be able to pull the LSH work in > during phase 2. > For cleaner discussion and tracking, I'm moving this revised proposal to this > new JIRA, and we can use SOLR-12890 for continue discussion (if any) of the > previously contributed plugin implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org