RE: Combined Dismax and Block Join Scoring on nested documents

Mike Allen Mon, 21 Nov 2016 04:58:18 -0800

Hi Mikhail,

Thanks for your advice, it went a long way towards helping me get the right 
documents in the first place, especially paramterising the block join with an 
explicit v, as otherwise it was a nightmare of parser errors.  Not to mention 
I'm still figuring out the nuances of where I need a whitespace and where I 
don't! However, I spent a part of the weekend fiddling around with spaces and 
+'s and I believe I've got it working as I'd hoped.


Again, many thanks,

Mike

-----Original Message-----
From: Mikhail Khludnev [mailto:m...@apache.org] 
Sent: 18 November 2016 12:58
To: solr-user
Subject: Re: Combined Dismax and Block Join Scoring on nested documents

Hello Mike,
Structured queries in Solr are way cumbersome.
Start from:
q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product score=min 
v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&...

beside of "explain" there is a parsed query entry in debug that's more useful 
for troubleshooting purposes.
Please also make sure that + is properly encoded by %2B and pass http hurdle.

On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < 
mike.al...@thecommercepartnership.com> wrote:

> Apologies if I'm doing something incredibly stupid as I'm new to Solr. 
> I am having an issue with scoring child documents in a block join 
> query when including a dismax query. I'm actually a little unclear on 
> whether or not that's a complete oxymoron, combining dismax and block join.
>
>
>
> Problem statement: Given a set of Product documents - which contain 
> the product names and descriptions - which contain nested variant 
> documents (see below for abridged example) - which contain the boolean 
> stock status
> (in_stock) and the variant prices (list_price_gbp) - I want to do a 
> Dismax query of, say, "skirt" on the product name (name) and sort the 
> resulting product documents by the minimum price (list_price_gbp) of 
> their child variant documents. Note that, although the abridged 
> document doesn't show them, there are a number of other arbitrary 
> fields which may be used as filter queries on the child documents, for 
> example size or colour, which will in effect change the "active" 
> minimum price of a product. Hence, denormalizing, or flattening, the 
> documents is not really an option I want to pursue.
>
>
>
> An abridged example document returned by the Solr Admin Query console 
> which I am querying:
>
>
>
> <doc>
>
>     <str name="id">12345</str>
>
>                 <str name="content_type">product</str>
>
>                 <str name="name">black flared skirt</str>
>
>                 <float name="min_list_price_gbp">40.0</float>
>
>                 <result name="doc" numFound="2" start="0">
>
>       <doc>
>
>                     <str name="skuid">12345abcd</str>
>
>                                 <str name="productid">12345</str>
>
>         <str name="content_type">variant</str>
>
>                                 <float 
> name="list_price_gbp">65.0</float>
>
>                                 <bool name="in_stock">true</bool>
>
>                   </doc>
>
>                   <doc>
>
>                     <str name="skuid">12345fghi</str>
>
>                                 <str name="productid">12345</str>
>
>         <str name="content_type">variant</str>
>
>                                 <float 
> name="list_price_gbp">40.0</float>
>
>                                 <bool name="in_stock">true</bool>
>
>                   </doc>
>
> </doc>
>
>
>
> So I am familiar with the block join score mode; setting aside the 
> dismax aspect for now, this query, using the Function Query 
> {!func}list_price_gbp, with score ascending, returns documents ordered 
> correctly, with a £2.00
> (cheapest) product first:
>
>
>
> q={!parent which=content_type:product
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The "explain" for this is:
>
>
>
> 2.0000184 = Score based on 1 child docs in range from 26752 to 26752, 
> best
> match:
>
>   2.0000184 = sum of:
>
>     1.8374416E-5 = weight(in_stock:T in 26752) [], result of:
>
>       1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0
>
> ), product of:
>
>         1.8374416E-5 = idf(docFreq=27211, docCount=27211)
>
>         1.0 = tfNorm, computed from:
>
>           1.0 = termFreq=1.0
>
>           1.2 = parameter k1
>
>           0.0 = parameter b (norms omitted for field)
>
>     2.0 = FunctionQuery(float(list_price_gbp)), product of:
>
>       2.0 = float(list_price_gbp)=2.0
>
>       1.0 = boost
>
>       1.0 = queryNorm
>
>
>
> Even though this is doing what I want, I have a slight niggle the that 
> overall score is not just the result of the Function Query, however, 
> as all results get the same tiny fraction added, it doesn't matter.
>
>
>
> However, when I prepend my dismax query:
>
>
>
> q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product 
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The scoring is only dependent on the dismax scoring, where the "explain"
> for
> this is:
>
>
>
> 2.7600822 = sum of:
>
>   2.7600822 = weight(name:skirt in 13406) [], result of:
>
>     2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0
>
> ), product of:
>
>       3.5851278 = idf(docFreq=103, docCount=3731)
>
>       0.76987 = tfNorm, computed from:
>
>         1.0 = termFreq=1.0
>
>         1.2 = parameter k1
>
>         0.75 = parameter b
>
>         4.108818 = avgFieldLength
>
>         7.111111 = fieldLength
>
>
>
> So in actual fact, with score ascending, it is ordering the results by 
> least matching first and the nested document list_price_gbp is 
> irrelevant. I strongly suspect I am being totally dumb and that this 
> is expected behaviour for an obvious reason that escapes me, apart 
> from perhaps it's because the two scoring methods are just plainly 
> incompatible.
>
>
>
> I have additionally tried just doing a lucene query instead:
>
>
>
> q=+name:skirt +{!parent which=content_type:product score=min} 
> (in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> The "explain" of this indicates it's scoring products, for which 
> list_price_gbp simply does not exist, as the Function Query always 
> returns zero.
>
>
>
> 6243963 = sum of:
>
>   3.624396 = weight(name:skirt in 18113) [], result of:
>
>     3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0
>
> ), product of:
>
>       3.5851278 = idf(docFreq=103, docCount=3731)
>
>       1.0109531 = tfNorm, computed from:
>
>         1.0 = termFreq=1.0
>
>         1.2 = parameter k1
>
>         0.75 = parameter b
>
>         4.108818 = avgFieldLength
>
>         4.0 = fieldLength
>
>   1.0 =
> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
> QueryBitSetProducer(con
> tent_type:product))), product of:
>
>     1.0 = boost
>
>     1.0 = queryNorm
>
>   0.0 = FunctionQuery(float(list_price_gbp)), product of:
>
>     0.0 = float(list_price_gbp)=0.0
>
>     1.0 = boost
>
>     1.0 = queryNorm
>
>
>
> Indeed, if I change the Function Query field to a product scoped 
> field, min_list_price_gbp, like so:
>
>
>
> q=+name:skirt +{!parent which=content_type:product 
> score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms
> f="productid"
> v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(
> true))&start=0&row
> s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml
>
>
>
> then the "explain" certainly does show the Function Query evaluating
>
>
>
> 8.624397 = sum of:
>
>   3.624396 = weight(name:skirt in 17890) [], result of:
>
>     3.624396 = score(doc=17890,freq=1.0 = termFreq=1.0
>
> ), product of:
>
>       3.5851278 = idf(docFreq=103, docCount=3731)
>
>       1.0109531 = tfNorm, computed from:
>
>         1.0 = termFreq=1.0
>
>         1.2 = parameter k1
>
>         0.75 = parameter b
>
>         4.108818 = avgFieldLength
>
>         4.0 = fieldLength
>
>   1.0 =
> {!cache=false}ConstantScore(BitDocIdSetFilterWrapper(
> QueryBitSetProducer(con
> tent_type:product))), product of:
>
>     1.0 = boost
>
>     1.0 = queryNorm
>
>   14.0 = FunctionQuery(float(min_list_price_gbp)), product of:
>
>     14.0 = float(min_list_price_gbp)=14.0
>
>     1.0 = boost
>
>     1.0 = queryNorm
>
>
>
> My grasp of the syntax is pretty flakey, so I would be immensely 
> grateful if someone could point out if I'm just doing something 
> incredibly dumb. In my head, I see what I am trying to do as
>
>
>
> (some dismax or lucene query on parent document [e.g."skirt"])
>
>                 => (get a subset of these parent docs based on a block
> join)
>
>                                 => (where the children match a bunch 
> of arbitrary filter queries [e.g. "colour:red"])
>
>                                                 => (then subquery the 
> child docs that match the same filter queries[e.g. "colour:red"])
>
>                                                                 => 
> (then score this subset of child documents)
>
>
> => (and order by that score)
>
>
>
>
> Is this actually possible? I've been googling about this for a day or 
> so and can't quite find anything definitive. I'm going to maybe try 
> and dive into the solr source code, but I'm a c# guy, not java, 
> without a debuggable environment as unneeded yet, and that could prove 
> pretty painful.
>
>
>
> Any help would be appreciated, even if it is just "can't be done", as 
> at least I could stop chasing my tail.
>
>
>
> Mike
>
>
>
>
>
>
>
>
>
>
>
>


--
Sincerely yours
Mikhail Khludnev

RE: Combined Dismax and Block Join Scoring on nested documents

Reply via email to