Re: Combined Dismax and Block Join Scoring on nested documents

Alexandre Rafalovitch Mon, 21 Nov 2016 14:00:07 -0800

You could do:
*) LinkedIn
*) Wiki
*) Write it up, give it to me and I'll stick it as a guest post on my
blog (with attribution of your choice)
*) Write it up, give it to Lucidworks and they may (not sure about
rules) stick it on their blog


Regards,
    Alex.
----
http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 November 2016 at 02:36, Mike Allen
<mike.al...@thecommercepartnership.com> wrote:
> Sure thing Alex. I don't actually do any personal blogging, but if there's a 
> suitable place - the Solr Wiki perhaps - you'd suggest I can write something 
> up I'd be more than happy to. What goes around comes around!
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
> Sent: 21 November 2016 13:01
> To: solr-user
> Subject: Re: Combined Dismax and Block Join Scoring on nested documents
>
> A blog article about what you learned would be very welcome. These edge cases 
> are something other people could certainly learn from.
> Share the knowledge forward etc.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 21 November 2016 at 23:57, Mike Allen 
> <mike.al...@thecommercepartnership.com> wrote:
>> Hi Mikhail,
>>
>> Thanks for your advice, it went a long way towards helping me get the right 
>> documents in the first place, especially paramterising the block join with 
>> an explicit v, as otherwise it was a nightmare of parser errors.  Not to 
>> mention I'm still figuring out the nuances of where I need a whitespace and 
>> where I don't! However, I spent a part of the weekend fiddling around with 
>> spaces and +'s and I believe I've got it working as I'd hoped.
>>
>> Again, many thanks,
>>
>> Mike
>>
>> -----Original Message-----
>> From: Mikhail Khludnev [mailto:m...@apache.org]
>> Sent: 18 November 2016 12:58
>> To: solr-user
>> Subject: Re: Combined Dismax and Block Join Scoring on nested
>> documents
>>
>> Hello Mike,
>> Structured queries in Solr are way cumbersome.
>> Start from:
>> q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product 
>> score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...
>>
>> beside of "explain" there is a parsed query entry in debug that's more 
>> useful for troubleshooting purposes.
>> Please also make sure that + is properly encoded by %2B and pass http hurdle.
>>
>> On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < 
>> mike.al...@thecommercepartnership.com> wrote:
>>
>>> Apologies if I'm doing something incredibly stupid as I'm new to Solr.
>>> I am having an issue with scoring child documents in a block join
>>> query when including a dismax query. I'm actually a little unclear on
>>> whether or not that's a complete oxymoron, combining dismax and block join.
>>>
>>>
>>>
>>> Problem statement: Given a set of Product documents - which contain
>>> the product names and descriptions - which contain nested variant
>>> documents (see below for abridged example) - which contain the
>>> boolean stock status
>>> (in_stock) and the variant prices (list_price_gbp) - I want to do a
>>> Dismax query of, say, "skirt" on the product name (name) and sort the
>>> resulting product documents by the minimum price (list_price_gbp) of
>>> their child variant documents. Note that, although the abridged
>>> document doesn't show them, there are a number of other arbitrary
>>> fields which may be used as filter queries on the child documents,
>>> for example size or colour, which will in effect change the "active"
>>> minimum price of a product. Hence, denormalizing, or flattening, the
>>> documents is not really an option I want to pursue.
>>>
>>>
>>>
>>> An abridged example document returned by the Solr Admin Query console
>>> which I am querying:
>>>
>>>
>>>
>>> <doc>
>>>
>>>     <str name="id">12345</str>
>>>
>>>                 <str name="content_type">product</str>
>>>
>>>                 <str name="name">black flared skirt</str>
>>>
>>>                 <float name="min_list_price_gbp">40.0</float>
>>>
>>>                 <result name="doc" numFound="2" start="0">
>>>
>>>       <doc>
>>>
>>>                     <str name="skuid">12345abcd</str>
>>>
>>>                                 <str name="productid">12345</str>
>>>
>>>         <str name="content_type">variant</str>
>>>
>>>                                 <float
>>> name="list_price_gbp">65.0</float>
>>>
>>>                                 <bool name="in_stock">true</bool>
>>>
>>>                   </doc>
>>>
>>>                   <doc>
>>>
>>>                     <str name="skuid">12345fghi</str>
>>>
>>>                                 <str name="productid">12345</str>
>>>
>>>         <str name="content_type">variant</str>
>>>
>>>                                 <float
>>> name="list_price_gbp">40.0</float>
>>>
>>>                                 <bool name="in_stock">true</bool>
>>>
>>>                   </doc>
>>>
>>> </doc>
>>>
>>>
>>>
>>> So I am familiar with the block join score mode; setting aside the
>>> dismax aspect for now, this query, using the Function Query
>>> {!func}list_price_gbp, with score ascending, returns documents
>>> ordered correctly, with a £2.00
>>> (cheapest) product first:
>>>
>>>
>>>
>>> q={!parent which=content_type:product
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The "explain" for this is:
>>>
>>>
>>>
>>> 2.0000184 = Score based on 1 child docs in range from 26752 to 26752,
>>> best
>>> match:
>>>
>>>   2.0000184 = sum of:
>>>
>>>     1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>>>
>>>       1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>         1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>>>
>>>         1.0 = tfNorm, computed from:
>>>
>>>           1.0 = termFreq=1.0
>>>
>>>           1.2 = parameter k1
>>>
>>>           0.0 = parameter b (norms omitted for field)
>>>
>>>     2.0 = FunctionQuery(float(list_price_gbp)), product of:
>>>
>>>       2.0 = float(list_price_gbp)=2.0
>>>
>>>       1.0 = boost
>>>
>>>       1.0 = queryNorm
>>>
>>>
>>>
>>> Even though this is doing what I want, I have a slight niggle the
>>> that overall score is not just the result of the Function Query,
>>> however, as all results get the same tiny fraction added, it doesn't matter.
>>>
>>>
>>>
>>> However, when I prepend my dismax query:
>>>
>>>
>>>
>>> q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The scoring is only dependent on the dismax scoring, where the "explain"
>>> for
>>> this is:
>>>
>>>
>>>
>>> 2.7600822 = sum of:
>>>
>>>   2.7600822 = weight(name:skirt in 13406) [], result of:
>>>
>>>     2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       0.76987 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         7.111111 = fieldLength
>>>
>>>
>>>
>>> So in actual fact, with score ascending, it is ordering the results
>>> by least matching first and the nested document list_price_gbp is
>>> irrelevant. I strongly suspect I am being totally dumb and that this
>>> is expected behaviour for an obvious reason that escapes me, apart
>>> from perhaps it's because the two scoring methods are just plainly
>>> incompatible.
>>>
>>>
>>>
>>> I have additionally tried just doing a lucene query instead:
>>>
>>>
>>>
>>> q=+name:skirt +{!parent which=content_type:product score=min}
>>> (in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> The "explain" of this indicates it's scoring products, for which
>>> list_price_gbp simply does not exist, as the Function Query always
>>> returns zero.
>>>
>>>
>>>
>>> 6243963 = sum of:
>>>
>>>   3.624396 = weight(name:skirt in 18113) [], result of:
>>>
>>>     3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       1.0109531 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         4.0 = fieldLength
>>>
>>>   1.0 =
>>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>>> QueryBitSetProducer(con
>>> tent_type:product))), product of:
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>   0.0 = FunctionQuery(float(list_price_gbp)), product of:
>>>
>>>     0.0 = float(list_price_gbp)=0.0
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>
>>>
>>> Indeed, if I change the Function Query field to a product scoped
>>> field, min_list_price_gbp, like so:
>>>
>>>
>>>
>>> q=+name:skirt +{!parent which=content_type:product
>>> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
>>> f="productid"
>>> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
>>> true))&start=0&row
>>> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>>>
>>>
>>>
>>> then the "explain" certainly does show the Function Query evaluating
>>>
>>>
>>>
>>> 8.624397 = sum of:
>>>
>>>   3.624396 = weight(name:skirt in 17890) [], result of:
>>>
>>>     3.624396 = score(doc=17890,freq=1.0 = termFreq=1.0
>>>
>>> ), product of:
>>>
>>>       3.5851278 = idf(docFreq=103, docCount=3731)
>>>
>>>       1.0109531 = tfNorm, computed from:
>>>
>>>         1.0 = termFreq=1.0
>>>
>>>         1.2 = parameter k1
>>>
>>>         0.75 = parameter b
>>>
>>>         4.108818 = avgFieldLength
>>>
>>>         4.0 = fieldLength
>>>
>>>   1.0 =
>>> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
>>> QueryBitSetProducer(con
>>> tent_type:product))), product of:
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>   14.0 = FunctionQuery(float(min_list_price_gbp)), product of:
>>>
>>>     14.0 = float(min_list_price_gbp)=14.0
>>>
>>>     1.0 = boost
>>>
>>>     1.0 = queryNorm
>>>
>>>
>>>
>>> My grasp of the syntax is pretty flakey, so I would be immensely
>>> grateful if someone could point out if I'm just doing something
>>> incredibly dumb. In my head, I see what I am trying to do as
>>>
>>>
>>>
>>> (some dismax or lucene query on parent document [e.g."skirt"])
>>>
>>>                 => (get a subset of these parent docs based on a
>>> block
>>> join)
>>>
>>>                                 => (where the children match a bunch
>>> of arbitrary filter queries [e.g. "colour:red"])
>>>
>>>                                                 => (then subquery the
>>> child docs that match the same filter queries[e.g. "colour:red"])
>>>
>>>                                                                 =>
>>> (then score this subset of child documents)
>>>
>>>
>>> => (and order by that score)
>>>
>>>
>>>
>>>
>>> Is this actually possible? I've been googling about this for a day or
>>> so and can't quite find anything definitive. I'm going to maybe try
>>> and dive into the solr source code, but I'm a c# guy, not java,
>>> without a debuggable environment as unneeded yet, and that could
>>> prove pretty painful.
>>>
>>>
>>>
>>> Any help would be appreciated, even if it is just "can't be done", as
>>> at least I could stop chasing my tail.
>>>
>>>
>>>
>>> Mike
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>

Re: Combined Dismax and Block Join Scoring on nested documents

Reply via email to