RE: Why do documents without the search query term rank highest

Scotten Stuart Tue, 01 Dec 2015 14:22:59 -0800

WOW!

Thanks Chris - I have read your feedback but I will need to go through it a 
couple more times to get my head around it :) - thanks for taking the time to 
help - much appreciated!




Thanks
Stuart
PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | 
McDonald's Corporation
2111 McDonald's Drive | Oak Brook, IL 60523 USA
Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com




-----Original Message-----
From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
Sent: Tuesday, December 01, 2015 3:52 PM
To: solr-user@lucene.apache.org
Subject: RE: Why do documents without the search query term rank highest


: Again, my confusion is why the document 'Home' appears ahead of the
: document 'Big Mac' in the ranking when the query term 'big' only appears
: once in 'Home' but several times in 'Big Mac'?

The key to understanding how documents are scored is in the query structure and 
the "explain" output.

By default the explain output is a simple string using newlines & whitespace 
indenting for formatting -- something that got lost when you pasted it into 
email -- but i've tried to reformat it below based on educated guesses and lots 
of experience. (FWIW: adding debug.explain.structured=true will use the 
xml/json/whatever response format for structure instead of newlines + indenting)

<str name="http://www-a4.staging.mcdonalds.com/us/en/home.html";>

0.027089478 = (MATCH) product of:
.0.18962634 = (MATCH) sum of:
..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity]
...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of:
....0.3345638 = queryWeight, product of:
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.06456205 = queryNorm
....0.56678677 = fieldWeight in 78, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.109375 = fieldNorm(doc=78)
.0.14285715 = coord(1/7)

So what the above tells us, is that the top scoring document (home.html) 
matched a single clause of the query which was "keywords:big".  The *term* 
"keywords:big" appeared 1 time (freq=1.0) in this document, and is in a total 
of 3 documents (docFreq).

(note that *term* is key here -- the number of times the *word* big appears in 
all fields doesn't matter for score calculations, just that it appears in the 
"keywords" field for a total of 3 documents, and this is one of them)

There were "penalties" to the score for this document based on the "fieldNorm" 
of the keywords field (which comes from index time document & field boosts, as 
well as field length at index time) and because it only matched 1/7 of the 
clauses of the query.

Now lets compare with the second match....

<str 
name="http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html";>

0.0075755017 = (MATCH) product of:
.0.026514255 = (MATCH) sum of:
..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity]
...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of:
....0.3345638 = queryWeight, product of:
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.06456205 = queryNorm
....0.043826047 = fieldWeight in 104, product of:
.....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.0048828125 = fieldNorm(doc=104)
..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity]
...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of:
....0.3345638 = queryWeight, product of:
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.06456205 = queryNorm
....0.035424173 = fieldWeight in 104, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....5.18205 = idf(docFreq=3, maxDocs=262)
.....0.0068359375 = fieldNorm(doc=104)
.0.2857143 = coord(2/7)

In this case, the document matches two clauses of the query -- 
"description:big" and "title:big".  The term description:big is matched 3 times 
(termFreq) in this document, and evidently exists in only 3 documents in the 
index (docFreq) but the fieldNorm is penalizing the overall scores.  Likewise 
the term title:big is matched 1 time, and exists in only 3 documents in your 
index -- the fieldNorm is slightly higher (probably due to the shorter length 
of the title).  The overall score of the second doc is penalized for only 
matching 2 of the 7 clauses.

Based on what i'm seeing here, the biggest suprise i have is the fieldNorm 
values you are getting -- they don't make sense given the lengths of the fields 
you showed us in the output unless some index time document (or
field) boosts are getting applied -- perhaps intended to "promote" the 
"home.html" page in your search results?  My guess is a some setting in your 
CMS is doing this?  maybe based on "page depth" or something like that?

Based on your configs, I'm guessing you're running Solr 4.2 -- So I tried 
loading up copies of those 2 documents using the config+schema you provided, 
and here are the score explanations i got...

**NOTE** Things like the docFreqs (and therfore queryWeight &
fieldWeight) are NOT going to be comparable because my index *only* had those 
two documents ... the key here is to compare the fieldNorms below with the 
fieldNorms from the same documents in your query...


http://www-a4.staging.mcdonalds.com/us/en/home.html
0.004108005 = (MATCH) product of:
.0.028756034 = (MATCH) sum of:
..0.028756034 = (MATCH) weight(keywords:big in 0) [DefaultSimilarity],
...0.028756034 = score(doc=0,freq=1.0 = termFreq=1.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.109375 = fieldWeight in 0, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.109375 = fieldNorm(doc=0)
.0.14285715 = coord(1/7)


http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html
0.07352274 = (MATCH) product of:
.0.25732958 = (MATCH) sum of:
..0.14230545 = (MATCH) weight(description:big in 1) [DefaultSimilarity],
...0.14230545 = score(doc=1,freq=3.0 = termFreq=3.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.54126585 = fieldWeight in 1, product of:
.....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.3125 = fieldNorm(doc=1)
..0.115024135 = (MATCH) weight(title:big in 1) [DefaultSimilarity],
...0.115024135 = score(doc=1,freq=1.0 = termFreq=1.0), product of:
....0.2629123 = queryWeight, product of:
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.2629123 = queryNorm
....0.4375 = fieldWeight in 1, product of:
.....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
.....1.0 = idf(docFreq=1, maxDocs=2)
.....0.4375 = fieldNorm(doc=1)
.0.2857143 = coord(2/7)


...the fieldNorm for "home.html" is the same, but the fieldNorm(s) for 
BigMac.html are much higher.  The only explanation I have is that your CMS is 
sending fractional "boost" values at index time for some documents (again -- i 
speculate it might based on how "deep" the page is in your site, in an attempt 
to "promote" higher level pages)



-Hoss
http://www.lucidworks.com/

________________________________

The information contained in this e-mail and any accompanying documents is 
confidential, may be privileged, and is intended solely for the person and/or 
entity to whom it is addressed (i.e. those identified in the "To" and "cc" 
box). They are the property of McDonald's Corporation. Unauthorized review, 
use, disclosure, or copying of this communication, or any part thereof, is 
strictly prohibited and may be unlawful. If you have received this e-mail in 
error, please return the e-mail and attachments to the sender and delete the 
e-mail and attachments and any copy from your system. McDonald's thanks you for 
your cooperation.

RE: Why do documents without the search query term rank highest

Reply via email to