[
https://issues.apache.org/jira/browse/SOLR-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948916#comment-16948916
]
Joel Bernstein edited comment on SOLR-12890 at 11/19/19 6:53 PM:
-----------------------------------------------------------------
h1. Rough survey of some available approaches:
h2. 1) Vector Scoring using Streaming Expressions (works now):
*Docs:* (paste each into "Documents" pane in Solr Admin UI as type:"json")
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/food_collection/update?commit=true --data-binary '
[
{"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple
juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
{"id": "3",
"name_s":"cappuccino","vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]},
{"id": "4", "name_s":"cheese
pizza","vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]},
{"id": "5", "name_s":"green tea","vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]},
{"id": "6", "name_s":"latte","vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]},
{"id": "7", "name_s":"soda","vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]},
{"id": "8", "name_s":"cheese bread
sticks","vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]},
{"id": "9", "name_s":"water","vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]},
{"id": "10", "name_s":"cinnamon bread
sticks","vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]}
]{code}
*Streaming Expression:*
{code:java}
sort(
select(
search(food_collection,
q="*:*",
fl="id,vector_fs",
sort="id asc",
rows=3),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as
sim,
id),
by="sim desc")
*Response:*
{
"result-set": {
"docs": [
{ "sim": 0.99996111, "id": "1" },
{ "sim": 0.98590279, "id": "10" },
{ "sim": 0.55566643, "id": "2" },
{ "EOF": true, "RESPONSE_TIME": 10 }
]
}
}{code}
*Benefits*:
1) Works Now out of the box
*Drawbacks*:
1) Have to switch searches to using Streaming Expressions, which may not be
practical in some use cases.
2) Solr doesn't have multi-dimensional point field support yet (SOLR-11077),
so you can only store one vector per field per document.
3) Requires traversing all vectors and scoring them. Needs some sore of KNN
option (possibly this could be done with another inner-streaming expression
using a hash function?)
h2. 2) Available Solr Vector Search Plugin (works now):
[https://github.com/saaay71/solr-vector-scoring]
Note: I recently reached out to Ali (the author of this plugin) and asked him
to add an ASL 2.0 license, which he has now done, so we can pull in this code
as needed.
*Docs*
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/{your-collection-name}/update?commit=true
--data-binary '
[
{"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "},
{"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "},
{"name":"example 2", "vector":"0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "},
{"name":"example 3", "vector":"0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "},
{"name":"example 4", "vector":"0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "},
{"name":"example 5", "vector":"0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "}
]'
{code}
*Request:*
{code:java}
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp
f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0"}
{code}
*Response:*
{code:java}
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"{!myqp f=vector vector=\"0.1,4.75,0.3,1.2,0.7,4.0\"}",
"fl":"name,score,vector"}},
"response":{"numFound":6,"start":0,"maxScore":0.99984086,"docs":[
{
"name":["example 3"],
"vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":0.99984086},
{
"name":["example 0"],
"vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "],
"score":0.7693964},
{
"name":["example 5"],
"vector":["0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "],
"score":0.76322395},
{
"name":["example 4"],
"vector":["0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "],
"score":0.5328145},
{
"name":["example 1"],
"vector":["0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "],
"score":0.48513117},
{
"name":["example 2"],
"vector":["0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "],
"score":0.44909418}]
}}
{code}
*Benefits:*
1) Works now (when you install the plugin)
2) Can use in regular search (Search Handler) and don't have to switch to
streaming expressions
*Drawbacks*:
1) Slow implementation. It uses payloads to store the values within the
vector, and traversing those is very expensive. If we were going to follow this
approach, at a minimum we should switch from using payloads to overriding term
frequencies for a speedup.
2) Only supports one vector per field per document (I think... haven't tried
with a multi-valued text field, but it looks like the payload scoring logic
expects only a single value per dimension. Might be possible to modify this...)
3) Requires traversing all vectors and scoring them. Needs some sort of KNN
option to not be slow at scale.
h2. 3) Available Solr Vector Search Plugin with LSH Hashing (works now)
[https://github.com/moshebla/solr-vector-scoring]
[~moshebla], who created this JIRA issue, forked option #2 above and added and
LSH implementation so that a KNN filter can be applied before the vector
scoring, greatly improving the speed of the vector search / scoring.
*Docs:*
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/{your-collection-name}/update?update.chain=LSH&commit=true
--data-binary '
[
{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"},
{"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}
]'
{code}
*Request:*
{code:java}
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp
f=vector vector=\"1.55,3.53,2.3,0.7,3.44,2.33\" lsh=\"true\"
reRankDocs=\"5\"}&fl=name,score,vector,_vector_,_lsh_hash_
{code}
*Response:*
{code:java}
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"{!vp f=vector vector=\"1.55,3.53,2.3,0.7,3.44,2.33\" lsh=\"true\"
reRankDocs=\"5\"}",
"fl":"id, score, vector, _vector_, _lsh_hash_",
"wt":"xml"}},
"response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[
{
"id": "1",
"vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8",
"1_35",
"2_7",
"3_10",
"4_2",
"5_35",
"6_16",
"7_30",
"8_27",
"9_12",
"10_7",
"11_32",
"12_48",
"13_36",
"14_10",
"15_7",
"16_42",
"17_5",
"18_3",
"19_2",
"20_1",
"21_0",
"22_24",
"23_18",
"24_42",
"25_31",
"26_35",
"27_8",
"28_1",
"29_24",
"30_47",
"31_14",
"32_22",
"33_39",
"34_0",
"35_34",
"36_34",
"37_39",
"38_27",
"39_27",
"40_45",
"41_10",
"42_21",
"43_34",
"44_41",
"45_9",
"46_31",
"47_0",
"48_4",
"49_43"],
"score":36.65736}
]
}
}
{code}
*Benefits:*
1) Works now (when you install the plugin)
2) Can use in regular search (Search Handler) and don't have to switch to
streaming expressions
3) Provides a KNN implementation that prevents every document's vector from
having to be scored
*Drawbacks:*
1) Slow. Even though the KNN reduces unnecessarily scoring all vectors, using
payloads in still inefficient here. If we were going to follow this approach,
at a minimum we should switch from using payloads to overriding term
frequencies for a speedup.
2) Only supports one vector per field per document (same as original it was
forked from)
3) Doesn't currently have an ASL 2.0 license on it for reuse. [~moshebla] -
since Ali has added a license to the original repo now, can you please pull
that into your repo so that your changes are also covered under ASL2.0? Thanks!
h2. 4) Port over the Elasticsearch implementation
Elasticsearch recently implemented sparse and dense vector fields, though they
chose to release the feature under their proprietary Elastic license instead of
an open source license. HOWEVER, when they were originally implementing this
feature they were intending for it to be open source, and only later decided to
change the license to be proprietary, so most of the feature was built under
and ASL 2.0 license before they restricted it. This means that we can port over
any part of their implementation that existed prior to this commit under the
ASL 2.0 license:
[https://github.com/elastic/elasticsearch/commit/952ddf247a2df8ade64ae067c1904436fd7a2ba8]
*Benefits*:
1) Encoded vectors into BinaryDocValues, so certainly more efficient that the
approaches with the Solr Plugins above (using payloads)
*Drawbacks:*
1) Doesn't work with Solr yet, so would have to do more work to port it over.
2) Can't copy over future improvements since Elasticsearch isn't open source
(only early versions of this feature were)
3) Only supports one vector per field per document currently (though this
shouldn't be too hard to change)
4) Doesn't appear to provide a quantized representation for KNN filtering
prior to scoring, so likely slow at scale.
h2. 5) Port over Open Distro for Stretchysearch implementation
[https://github.com/opendistro-for-elasticsearch/k-NN]
Open Distro for "Stretchysearch" (not using the original project name because
Elastic is suing them for Trademark infringement for doing so) created their
own ASL2.0 plugin for their Open Distro version which implements vector search
with KNN support. On the surface, this looks like it may be an enhanced version
over the Elastic version in #4 above, but it appears to still be a work in
progress, so not sure if it is production ready yet (haven't tried it).
*Benefits:*
1) Provides a quantized representation of the vectors for KNN, so should still
be fast at scale with lots of docs
2) They appear to making codec-level changes, which implies this approach may
end up being way more efficient than any of the others mentioned above over
time.
*Drawbacks:*
1) Developed against "Stretchysearch" code base, so will take extra effort to
port to Solr.
2) Looks like a work in progress, so likely not ready for production use and
porting
3) Only supports one vector per field per document currently
h2. 6) Others please contribute ideas!
There's lots of ways to approach this problem, and we have some really smart
people in this community. All ideas welcome - possibly we should even implement
a few different approaches.
was (Author: solrtrey):
h1. Rough survey of some available approaches:
h2. 1) Vector Scoring using Streaming Expressions (works now):
*Docs:* (paste each into "Documents" pane in Solr Admin UI as type:"json")
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/food_collection/update?commit=true --data-binary '
[
{"id": "1", "name_s":"donut","vector_fs":[5.0,0.0,1.0,5.0,0.0,4.0,5.0,1.0]},
{"id": "2", "name_s":"apple
juice","vector_fs":[1.0,5.0,0.0,0.0,0.0,4.0,4.0,3.0]},
{"id": "3",
"name_s":"cappuccino","vector_fs":[0.0,5.0,3.0,0.0,4.0,1.0,2.0,3.0]},
{"id": "4", "name_s":"cheese
pizza","vector_fs":[5.0,0.0,4.0,4.0,0.0,1.0,5.0,2.0]},
{"id": "5", "name_s":"green tea","vector_fs":[0.0,5.0,0.0,0.0,2.0,1.0,1.0,5.0]},
{"id": "6", "name_s":"latte","vector_fs":[0.0,5.0,4.0,0.0,4.0,1.0,3.0,3.0]},
{"id": "7", "name_s":"soda","vector_fs":[0.0,5.0,0.0,0.0,3.0,5.0,5.0,0.0]},
{"id": "8", "name_s":"cheese bread
sticks","vector_fs":[5.0,0.0,4.0,5.0,0.0,1.0,4.0,2.0]},
{"id": "9", "name_s":"water","vector_fs":[0.0,5.0,0.0,0.0,0.0,0.0,0.0,5.0]},
{"id": "10", "name_s":"cinnamon bread
sticks","vector_fs":[5.0,0.0,1.0,5.0,0.0,3.0,4.0,2.0]}
]{code}
*Streaming Expression:*
{code:java}
sort(
select(
search(food_collection,
q="*:*",
fl="id,vector_fs",
sort="id asc", rows=3),
cosineSimilarity(vector_fs, array(5.1,0.0,1.0,5.0,0.0,4.0,5.0,1.0)) as
sim,
id),
by="sim desc")
*Response:*
{
"result-set": {
"docs": [
{ "sim": 0.99996111, "id": "1" },
{ "sim": 0.98590279, "id": "10" },
{ "sim": 0.55566643, "id": "2" },
{ "EOF": true, "RESPONSE_TIME": 10 }
]
}
}{code}
*Benefits*:
1) Works Now out of the box
*Drawbacks*:
1) Have to switch searches to using Streaming Expressions, which may not be
practical in some use cases.
2) Solr doesn't have multi-dimensional point field support yet (SOLR-11077),
so you can only store one vector per field per document.
3) Requires traversing all vectors and scoring them. Needs some sore of KNN
option (possibly this could be done with another inner-streaming expression
using a hash function?)
h2. 2) Available Solr Vector Search Plugin (works now):
[https://github.com/saaay71/solr-vector-scoring]
Note: I recently reached out to Ali (the author of this plugin) and asked him
to add an ASL 2.0 license, which he has now done, so we can pull in this code
as needed.
*Docs*
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/{your-collection-name}/update?commit=true
--data-binary '
[
{"name":"example 0", "vector":"0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "},
{"name":"example 1", "vector":"0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "},
{"name":"example 2", "vector":"0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "},
{"name":"example 3", "vector":"0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "},
{"name":"example 4", "vector":"0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "},
{"name":"example 5", "vector":"0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "}
]'
{code}
*Request:*
{code:java}
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp
f=vector vector="0.1,4.75,0.3,1.2,0.7,4.0"}
{code}
*Response:*
{code:java}
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"{!myqp f=vector vector=\"0.1,4.75,0.3,1.2,0.7,4.0\"}",
"fl":"name,score,vector"}},
"response":{"numFound":6,"start":0,"maxScore":0.99984086,"docs":[
{
"name":["example 3"],
"vector":["0|0.06 1|4.73 2|0.29 3|1.27 4|0.69 5|3.9 "],
"score":0.99984086},
{
"name":["example 0"],
"vector":["0|1.55 1|3.53 2|2.3 3|0.7 4|3.44 5|2.33 "],
"score":0.7693964},
{
"name":["example 5"],
"vector":["0|0.64 1|3.95 2|1.03 3|1.65 4|0.99 5|0.09 "],
"score":0.76322395},
{
"name":["example 4"],
"vector":["0|4.01 1|3.69 2|2 3|4.36 4|1.09 5|0.1 "],
"score":0.5328145},
{
"name":["example 1"],
"vector":["0|3.54 1|0.4 2|4.16 3|4.88 4|4.28 5|4.25 "],
"score":0.48513117},
{
"name":["example 2"],
"vector":["0|1.11 1|0.6 2|1.47 3|1.99 4|2.91 5|1.01 "],
"score":0.44909418}]
}}
{code}
*Benefits:*
1) Works now (when you install the plugin)
2) Can use in regular search (Search Handler) and don't have to switch to
streaming expressions
*Drawbacks*:
1) Slow implementation. It uses payloads to store the values within the
vector, and traversing those is very expensive. If we were going to follow this
approach, at a minimum we should switch from using payloads to overriding term
frequencies for a speedup.
2) Only supports one vector per field per document (I think... haven't tried
with a multi-valued text field, but it looks like the payload scoring logic
expects only a single value per dimension. Might be possible to modify this...)
3) Requires traversing all vectors and scoring them. Needs some sort of KNN
option to not be slow at scale.
h2. 3) Available Solr Vector Search Plugin with LSH Hashing (works now)
[https://github.com/moshebla/solr-vector-scoring]
[~moshebla], who created this JIRA issue, forked option #2 above and added and
LSH implementation so that a KNN filter can be applied before the vector
scoring, greatly improving the speed of the vector search / scoring.
*Docs:*
{code:java}
curl -X POST -H "Content-Type: application/json"
http://localhost:8983/solr/{your-collection-name}/update?update.chain=LSH&commit=true
--data-binary '
[
{"id":"1", "vector":"1.55,3.53,2.3,0.7,3.44,2.33"},
{"id":"2", "vector":"3.54,0.4,4.16,4.88,4.28,4.25"}
]'
{code}
*Request:*
{code:java}
http://localhost:8983/solr/{your-collection-name}/query?fl=name,score,vector&q={!vp
f=vector vector=\"1.55,3.53,2.3,0.7,3.44,2.33\" lsh=\"true\"
reRankDocs=\"5\"}&fl=name,score,vector,_vector_,_lsh_hash_
{code}
*Response:*
{code:java}
{
"responseHeader":{
"status":0,
"QTime":8,
"params":{
"q":"{!vp f=vector vector=\"1.55,3.53,2.3,0.7,3.44,2.33\" lsh=\"true\"
reRankDocs=\"5\"}",
"fl":"id, score, vector, _vector_, _lsh_hash_",
"wt":"xml"}},
"response":{"numFound":1,"start":0,"maxScore":36.65736,"docs":[
{
"id": "1",
"vector":"1.55,3.53,2.3,0.7,3.44,2.33",
"_vector_":"/z/GZmZAYeuFQBMzMz8zMzNAXCj2QBUeuA==",
"_lsh_hash_":["0_8",
"1_35",
"2_7",
"3_10",
"4_2",
"5_35",
"6_16",
"7_30",
"8_27",
"9_12",
"10_7",
"11_32",
"12_48",
"13_36",
"14_10",
"15_7",
"16_42",
"17_5",
"18_3",
"19_2",
"20_1",
"21_0",
"22_24",
"23_18",
"24_42",
"25_31",
"26_35",
"27_8",
"28_1",
"29_24",
"30_47",
"31_14",
"32_22",
"33_39",
"34_0",
"35_34",
"36_34",
"37_39",
"38_27",
"39_27",
"40_45",
"41_10",
"42_21",
"43_34",
"44_41",
"45_9",
"46_31",
"47_0",
"48_4",
"49_43"],
"score":36.65736}
]
}
}
{code}
*Benefits:*
1) Works now (when you install the plugin)
2) Can use in regular search (Search Handler) and don't have to switch to
streaming expressions
3) Provides a KNN implementation that prevents every document's vector from
having to be scored
*Drawbacks:*
1) Slow. Even though the KNN reduces unnecessarily scoring all vectors, using
payloads in still inefficient here. If we were going to follow this approach,
at a minimum we should switch from using payloads to overriding term
frequencies for a speedup.
2) Only supports one vector per field per document (same as original it was
forked from)
3) Doesn't currently have an ASL 2.0 license on it for reuse. [~moshebla] -
since Ali has added a license to the original repo now, can you please pull
that into your repo so that your changes are also covered under ASL2.0? Thanks!
h2. 4) Port over the Elasticsearch implementation
Elasticsearch recently implemented sparse and dense vector fields, though they
chose to release the feature under their proprietary Elastic license instead of
an open source license. HOWEVER, when they were originally implementing this
feature they were intending for it to be open source, and only later decided to
change the license to be proprietary, so most of the feature was built under
and ASL 2.0 license before they restricted it. This means that we can port over
any part of their implementation that existed prior to this commit under the
ASL 2.0 license:
[https://github.com/elastic/elasticsearch/commit/952ddf247a2df8ade64ae067c1904436fd7a2ba8]
*Benefits*:
1) Encoded vectors into BinaryDocValues, so certainly more efficient that the
approaches with the Solr Plugins above (using payloads)
*Drawbacks:*
1) Doesn't work with Solr yet, so would have to do more work to port it over.
2) Can't copy over future improvements since Elasticsearch isn't open source
(only early versions of this feature were)
3) Only supports one vector per field per document currently (though this
shouldn't be too hard to change)
4) Doesn't appear to provide a quantized representation for KNN filtering
prior to scoring, so likely slow at scale.
h2. 5) Port over Open Distro for Stretchysearch implementation
[https://github.com/opendistro-for-elasticsearch/k-NN]
Open Distro for "Stretchysearch" (not using the original project name because
Elastic is suing them for Trademark infringement for doing so) created their
own ASL2.0 plugin for their Open Distro version which implements vector search
with KNN support. On the surface, this looks like it may be an enhanced version
over the Elastic version in #4 above, but it appears to still be a work in
progress, so not sure if it is production ready yet (haven't tried it).
*Benefits:*
1) Provides a quantized representation of the vectors for KNN, so should still
be fast at scale with lots of docs
2) They appear to making codec-level changes, which implies this approach may
end up being way more efficient than any of the others mentioned above over
time.
*Drawbacks:*
1) Developed against "Stretchysearch" code base, so will take extra effort to
port to Solr.
2) Looks like a work in progress, so likely not ready for production use and
porting
3) Only supports one vector per field per document currently
h2. 6) Others please contribute ideas!
There's lots of ways to approach this problem, and we have some really smart
people in this community. All ideas welcome - possibly we should even implement
a few different approaches.
> Vector Search in Solr (Umbrella Issue)
> --------------------------------------
>
> Key: SOLR-12890
> URL: https://issues.apache.org/jira/browse/SOLR-12890
> Project: Solr
> Issue Type: New Feature
> Reporter: mosh
> Priority: Major
>
> We have recently come across a need to index documents containing vectors
> using solr, and have even worked on a small POC. We used an URP to calculate
> the LSH(we chose to use the superbit algorithm, but the code is designed in a
> way the algorithm picked can be easily chagned), and stored the vector in
> either sparse or dense forms, in a binary field.
> Perhaps an addition of an LSH URP in conjunction with a query parser that
> uses the same properties to calculate LSH(or maybe ktree, or some other
> algorithm all together) should be considered as a Solr feature?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]