Re: Boosting query results

Walter Underwood Thu, 07 Jul 2016 15:48:07 -0700

I think it works to join against the other collection to get scores. But I’m 
not sure. I think that was suggested for a fairly static collection of 
documents with rapidly changing scoring inputs.


Personally, I would try a straight popularity boost to see if it got you 80% of 
the way there.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jul 7, 2016, at 2:46 PM, Mark T. Trembley <mark.tremb...@etrailer.com> 
> wrote:
> 
> Yes, the spam issue is something I'm aware of. I plan on having some sanity 
> checks in place to make sure that the boosts are in line with expectations 
> either at query time or while indexing the scores into Solr.
> 
> I just read through that document along with some of the more recent posts 
> about signals, and it appears that I'm going down the same path as 
> Lucidworks. I'm storing the aggregated search term and product id in an 
> alternate index.  It seems that the piece that I'm missing is getting the 
> boost per document. In the following post, it appears to me that Fusion is 
> applying a boost to the main query by obtaining the scores from a set number 
> of documents from the aggregate collection. I'm going to assume that part of 
> it's query processing pipeline is to run a query on the aggregation 
> collection to obtain the scores from that query and return them for use on 
> the main query.
> 
> https://lucidworks.com/blog/2015/09/01/better-search-fusion-signals/
> 
> I think I could possibly hack something together on my side that mimics what 
> I think is happening in Fusion, but with my tinkering, it seems to me that 
> using a !join query (with scoring) like I've been trying could handle the job 
> if I could only understand how the query executes on the joined collection 
> and how I can pass a calculated score back to the main query for use in 
> calculating a final score on the main collection.
> 
> 
> On 7/7/2016 1:34 PM, Walter Underwood wrote:
>> If it is running in an environment protected from spammers, you might want 
>> to start with the work that LucidWorks did on click scoring.
>> 
>> https://lucidworks.com/blog/2015/03/23/mixed-signals-using-lucidworks-fusions-signals-api/
>>  
>> <https://lucidworks.com/blog/2015/03/23/mixed-signals-using-lucidworks-fusions-signals-api/>
>> 
>> Of course, there are no environments free of spammers. I’ve seen them in 
>> enterprise search, too. But they are easier to deal with there. Call them up 
>> and tell them they need to stop immediately or their pages disappear from 
>> the search engine.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jul 7, 2016, at 11:29 AM, Walter Underwood <wun...@wunderwood.org> wrote:
>>> 
>>> You understand that you are making your site extremely easy to spam, right? 
>>> This is how Microsoft became the top hit for “evil empire” on Google.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jul 7, 2016, at 11:25 AM, Mark T. Trembley <mark.tremb...@etrailer.com> 
>>>> wrote:
>>>> 
>>>> I've found that it is definitely complicated!
>>>> 
>>>> Essentially what I am attempting to do is boost products based on the 
>>>> number of times that particular product has been selected via historical 
>>>> searches using the same search term or phrase.
>>>> 
>>>> 
>>>> On 7/7/2016 11:55 AM, Walter Underwood wrote:
>>>>> That is a very complicated design. What are you trying to achieve? Maybe 
>>>>> there is a different approach that is simpler.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>>>>> On Jul 7, 2016, at 9:26 AM, Mark T. Trembley 
>>>>>> <mark.tremb...@etrailer.com> wrote:
>>>>>> 
>>>>>> That works with static boosts based on documents matching the query 
>>>>>> "Boost2". I want to apply a different boost to documents based on the 
>>>>>> value assigned to Boost2 within the document.
>>>>>> 
>>>>>> From my sample documents, when running a query with "Boost2," I want 
>>>>>> Document2 boosted by 20.0 and Document6 boosted by 15.0:
>>>>>> 
>>>>>> {
>>>>>>  "id" : "Document2_Boost2",
>>>>>>  "B1_s" : "Boost2",
>>>>>>  "B1_f" : 20
>>>>>> }
>>>>>> {
>>>>>>  "id" : "Document6_Boost2",
>>>>>>  "B1_s" : "Boost2",
>>>>>>  "B1_f" : 15
>>>>>> }
>>>>>> 
>>>>>> 
>>>>>> On 7/7/2016 10:21 AM, Walter Underwood wrote:
>>>>>>> This looks like a job for “bq”, the boost query parameter. I used this 
>>>>>>> to boost textbooks which were used at the student’s school. bq does not 
>>>>>>> force documents to be included in the result set. It does affect the 
>>>>>>> ranking of the included documents.
>>>>>>> 
>>>>>>> bq=B1_ss:Boost2 will boost documents that match that. You can use 
>>>>>>> weights, like bq=B1_ss:Boost2^10
>>>>>>> 
>>>>>>> Here is the relationship between fq, q, and bq:
>>>>>>> 
>>>>>>> fq: selection, does not affect ranking
>>>>>>> q: selection and ranking
>>>>>>> bq: does not affect selection, affects ranking
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org
>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jul 7, 2016, at 7:30 AM, Mark T. Trembley 
>>>>>>>> <mark.tremb...@etrailer.com> wrote:
>>>>>>>> 
>>>>>>>> I have a question about the best way to rank my results based on a 
>>>>>>>> score field that can have different values per document and where each 
>>>>>>>> document can have different scores based on which term is queried.
>>>>>>>> 
>>>>>>>> Essentially what I'm wanting to have happen is provide a list of terms 
>>>>>>>> that when matched via a query it returns a corresponding score to help 
>>>>>>>> boost the original document. So if I had a document with a 
>>>>>>>> multi-valued field named B1_ss with terms [Boost1|10], [Boost2|20], 
>>>>>>>> [Boost3|100] and my search query is "Boost2", I want that document's 
>>>>>>>> result to be boosted by 20. Also note that "Boost2" can boost 
>>>>>>>> different documents at different levels. The query to select the 
>>>>>>>> actual documents will select against other fields in the document and 
>>>>>>>> could possibly return documents with any combination of B1 terms.
>>>>>>>> 
>>>>>>>> I'm still trying to figure out how best to model this in my index, 
>>>>>>>> either as child documents, or in another collection, or if it would 
>>>>>>>> make more sense to figure out how to make it work via payloads or by 
>>>>>>>> boosting the terms at index time.
>>>>>>>> 
>>>>>>>> I'm running Solr 5.5.1 in cloud mode. Each server has a complete 
>>>>>>>> replica of all collections.
>>>>>>>> 
>>>>>>>> The document structure I've been toying with the most is to put the 
>>>>>>>> boosts into a separate index and join them using !join syntax and 
>>>>>>>> returning the scores, but I've not had any luck getting quality 
>>>>>>>> results from those tests. The extra "scores" index is structured like 
>>>>>>>> this (I'll add the json for my test collections at the end of the 
>>>>>>>> email):
>>>>>>>> id:Document1_Boost1
>>>>>>>> B1_s:Boost1
>>>>>>>> B1_f:10
>>>>>>>> id:Document1_Boost3
>>>>>>>> B1_s:Boost3
>>>>>>>> B1_f:100
>>>>>>>> Using this structure, I get close, but the scores are not what I'm 
>>>>>>>> expecting. If I use the following query, the explain says it's using 
>>>>>>>> the score from Document6_Boost2 even though my query is specifying 
>>>>>>>> B1_s:Boost3
>>>>>>>> http://localhost:8983/solr/generic/select?q={!join from=id 
>>>>>>>> to=B1_name_ss fromIndex=scores 
>>>>>>>> score=max}B1_s:Boost3{!func}B1_f&fl=*,score&debugQuery=true
>>>>>>>> 
>>>>>>>> <lstname="explain">
>>>>>>>> <strname="Document6">
>>>>>>>> *3.379996* = Score based on join value Document6_Boost2
>>>>>>>> </str>
>>>>>>>> <strname="Document1">
>>>>>>>> *2.2533307* = Score based on join value Document1_Boost1
>>>>>>>> </str>
>>>>>>>> <strname="Document7">
>>>>>>>> *0.24786638* = Score based on join value Document7_Boost333
>>>>>>>> </str>
>>>>>>>> <strname="Document3">*0.0* = Score based on join value 
>>>>>>>> Document3_NoBoost</str>
>>>>>>>> </lst>
>>>>>>>> 
>>>>>>>> My guess is that it's now doing an all document query on the "scores" 
>>>>>>>> collection to return the scores in addition to the B1_s query I've 
>>>>>>>> passed in. I can't figure out where it's getting those scores from as 
>>>>>>>> a simple query against the "scores" collection returns scores like I'd 
>>>>>>>> expect to see them based on a similar query:
>>>>>>>> http://192.168.1.194:8983/solr/scores/select?q=B1_s:Boost3 AND 
>>>>>>>> _val_:B1_f&fl=score,*&debugQuery=true
>>>>>>>> 
>>>>>>>> <lstname="explain">
>>>>>>>> <strname="Document1_Boost3">
>>>>>>>> *46.834885* = sum of: 1.7682717 = weight(B1_s:Boost3 in 1) 
>>>>>>>> [ClassicSimilarity], result of: 1.7682717 = score(doc=1,freq=1.0), 
>>>>>>>> product of: 0.8926926 = queryWeight, product of: 1.9808292 = 
>>>>>>>> idf(docFreq=2, maxDocs=8) 0.45066613 = queryNorm 1.9808292 = 
>>>>>>>> fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = 
>>>>>>>> termFreq=1.0 1.9808292 = idf(docFreq=2, maxDocs=8) 1.0 = 
>>>>>>>> fieldNorm(doc=1) 45.066612 = FunctionQuery(float(B1_f)), product of: 
>>>>>>>> 100.0 = float(B1_f)=100.0 1.0 = boost 0.45066613 = queryNorm
>>>>>>>> </str>
>>>>>>>> <strname="Document6_Boost3">
>>>>>>>> *15.288256* = sum of: 1.7682717 = weight(B1_s:Boost3 in 5) 
>>>>>>>> [ClassicSimilarity], result of: 1.7682717 = score(doc=5,freq=1.0), 
>>>>>>>> product of: 0.8926926 = queryWeight, product of: 1.9808292 = 
>>>>>>>> idf(docFreq=2, maxDocs=8) 0.45066613 = queryNorm 1.9808292 = 
>>>>>>>> fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = 
>>>>>>>> termFreq=1.0 1.9808292 = idf(docFreq=2, maxDocs=8) 1.0 = 
>>>>>>>> fieldNorm(doc=5) 13.519984 = FunctionQuery(float(B1_f)), product of: 
>>>>>>>> 30.0 = float(B1_f)=30.0 1.0 = boost 0.45066613 = queryNorm
>>>>>>>> </str>
>>>>>>>> </lst>
>>>>>>>> 
>>>>>>>> I feel like I'm getting close to what I need, but it's just not clear 
>>>>>>>> to me what I'm missing at this point.
>>>>>>>> 
>>>>>>>> The other option I've been toying with is using payloads, but actually 
>>>>>>>> utilizing the payloads as part of the scoring process is beyond me at 
>>>>>>>> this time.
>>>>>>>> 
>>>>>>>> Any thoughts or hints on the best way to boost the relevancy of these 
>>>>>>>> scoreswould be appreciated.
>>>>>>>> Thanks
>>>>>>>> Mark
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> GENERIC:
>>>>>>>> {
>>>>>>>>   "id" : "Document1",
>>>>>>>>   "B1_ss" : ["Boost1|10","Boost3|100"],
>>>>>>>>   "title_s" : "Title1"
>>>>>>>>   ,"otherstuff_ss" : ["stuff1","suggestion"]
>>>>>>>>   ,"B1_name_ss" : ["Document1_Boost1","Document1_Boost3"]
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document2",
>>>>>>>>   "B1_ss" : ["Boost2|20"],
>>>>>>>>   "name_s" : "Product2",
>>>>>>>>   "title_s" : "Title2"
>>>>>>>>   ,"otherstuff_ss" : ["stuff2","recommendation"]
>>>>>>>>   ,"B1_name_ss" : ["Document2_Boost1"]
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document3",
>>>>>>>>   "name_s" : "Product3",
>>>>>>>>   "B1_ss" : ["NoBoost"],
>>>>>>>>   "title_s" : "Title3"
>>>>>>>>   ,"otherstuff_ss" : ["stuff3","new","suggestion"]
>>>>>>>>   ,"B1_name_ss" : ["Document3_NoBoost"]
>>>>>>>> },
>>>>>>>>  {
>>>>>>>>  "id" : "Document4",
>>>>>>>>   "name_s" : "Product4",
>>>>>>>>   "title_s" : "Title4"
>>>>>>>>   ,"otherstuff_ss" : ["stuff4","old","suggestion"]
>>>>>>>> } ,
>>>>>>>>  {
>>>>>>>>  "id" : "Document5",
>>>>>>>>   "name_s" : "Product5",
>>>>>>>>   "title_s" : "Title5"
>>>>>>>>   ,"otherstuff_ss" : ["stuff5","recommendation"]
>>>>>>>> },
>>>>>>>>  {
>>>>>>>>   "id" : "Document6",
>>>>>>>>   "name_s" : "Product6",
>>>>>>>>   "B1_ss" : ["Boost2|15","Boost3|30"],
>>>>>>>>   "title_s" : "Title6"
>>>>>>>>   ,"B1_name_ss" : ["Document6_Boost2","Document6_Boost3"]
>>>>>>>> },
>>>>>>>>  {
>>>>>>>>    "id" : "Document7",
>>>>>>>>   "name_s" : "Product7",
>>>>>>>>   "B1_ss" : ["NoBoost","Boost333|1.1"],
>>>>>>>>   "title_s" : "Title7"
>>>>>>>>   ,"B1_name_ss" : ["Document7_NoBoost","Document7_Boost333"]
>>>>>>>> }
>>>>>>>> 
>>>>>>>> SCORES:
>>>>>>>> {
>>>>>>>>   "id" : "Document1_Boost1",
>>>>>>>>   "B1_s" : "Boost1",
>>>>>>>>   "B1_f" : 10
>>>>>>>> },
>>>>>>>>   {
>>>>>>>>   "id" : "Document1_Boost3",
>>>>>>>>   "B1_s" : "Boost3",
>>>>>>>>   "B1_f" : 100
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document2_Boost2",
>>>>>>>>   "B1_s" : "Boost2",
>>>>>>>>   "B1_f" : 20
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document3_NoBoost",
>>>>>>>>   "B1_s" : "NoBoost"
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document6_Boost2",
>>>>>>>>   "B1_s" : "Boost2",
>>>>>>>>   "B1_f" : 15
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document6_Boost3",
>>>>>>>>   "B1_s" : "Boost3",
>>>>>>>>   "B1_f" : 30
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document7_NoBoost",
>>>>>>>>   "B1_s" : "NoBoost"
>>>>>>>> },
>>>>>>>> {
>>>>>>>>   "id" : "Document7_Boost333",
>>>>>>>>   "B1_s" : "Boost333",
>>>>>>>>   "B1_f" : 1.1
>>>>>>>> }
>>>>>>>> 
>> 
>

Re: Boosting query results

Reply via email to