WOW! Thanks Chris - I have read your feedback but I will need to go through it a couple more times to get my head around it :) - thanks for taking the time to help - much appreciated!
Thanks Stuart PMP, Business Technical Analyst | CRS Consultant | Corporate IT Digital | McDonald's Corporation 2111 McDonald's Drive | Oak Brook, IL 60523 USA Office: +1 630.623.5950 | Cell: 301.633.3298 | stuart.scot...@us.mcd.com -----Original Message----- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Tuesday, December 01, 2015 3:52 PM To: solr-user@lucene.apache.org Subject: RE: Why do documents without the search query term rank highest : Again, my confusion is why the document 'Home' appears ahead of the : document 'Big Mac' in the ranking when the query term 'big' only appears : once in 'Home' but several times in 'Big Mac'? The key to understanding how documents are scored is in the query structure and the "explain" output. By default the explain output is a simple string using newlines & whitespace indenting for formatting -- something that got lost when you pasted it into email -- but i've tried to reformat it below based on educated guesses and lots of experience. (FWIW: adding debug.explain.structured=true will use the xml/json/whatever response format for structure instead of newlines + indenting) <str name="http://www-a4.staging.mcdonalds.com/us/en/home.html"> 0.027089478 = (MATCH) product of: .0.18962634 = (MATCH) sum of: ..0.18962634 = (MATCH) weight(keywords:big in 78) [DefaultSimilarity] ...0.18962634 = score(doc=78,freq=1.0 = termFreq=1.0 ), product of: ....0.3345638 = queryWeight, product of: .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.06456205 = queryNorm ....0.56678677 = fieldWeight in 78, product of: .....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.109375 = fieldNorm(doc=78) .0.14285715 = coord(1/7) So what the above tells us, is that the top scoring document (home.html) matched a single clause of the query which was "keywords:big". The *term* "keywords:big" appeared 1 time (freq=1.0) in this document, and is in a total of 3 documents (docFreq). (note that *term* is key here -- the number of times the *word* big appears in all fields doesn't matter for score calculations, just that it appears in the "keywords" field for a total of 3 documents, and this is one of them) There were "penalties" to the score for this document based on the "fieldNorm" of the keywords field (which comes from index time document & field boosts, as well as field length at index time) and because it only matched 1/7 of the clauses of the query. Now lets compare with the second match.... <str name="http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html"> 0.0075755017 = (MATCH) product of: .0.026514255 = (MATCH) sum of: ..0.0146626085 = (MATCH) weight(description:big in 104) [DefaultSimilarity] ...0.0146626085 = score(doc=104,freq=3.0 = termFreq=3.0 ), product of: ....0.3345638 = queryWeight, product of: .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.06456205 = queryNorm ....0.043826047 = fieldWeight in 104, product of: .....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.0048828125 = fieldNorm(doc=104) ..0.011851646 = (MATCH) weight(title:big in 104) [DefaultSimilarity] ...0.011851646 = score(doc=104,freq=1.0 = termFreq=1.0 ), product of: ....0.3345638 = queryWeight, product of: .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.06456205 = queryNorm ....0.035424173 = fieldWeight in 104, product of: .....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .....5.18205 = idf(docFreq=3, maxDocs=262) .....0.0068359375 = fieldNorm(doc=104) .0.2857143 = coord(2/7) In this case, the document matches two clauses of the query -- "description:big" and "title:big". The term description:big is matched 3 times (termFreq) in this document, and evidently exists in only 3 documents in the index (docFreq) but the fieldNorm is penalizing the overall scores. Likewise the term title:big is matched 1 time, and exists in only 3 documents in your index -- the fieldNorm is slightly higher (probably due to the shorter length of the title). The overall score of the second doc is penalized for only matching 2 of the 7 clauses. Based on what i'm seeing here, the biggest suprise i have is the fieldNorm values you are getting -- they don't make sense given the lengths of the fields you showed us in the output unless some index time document (or field) boosts are getting applied -- perhaps intended to "promote" the "home.html" page in your search results? My guess is a some setting in your CMS is doing this? maybe based on "page depth" or something like that? Based on your configs, I'm guessing you're running Solr 4.2 -- So I tried loading up copies of those 2 documents using the config+schema you provided, and here are the score explanations i got... **NOTE** Things like the docFreqs (and therfore queryWeight & fieldWeight) are NOT going to be comparable because my index *only* had those two documents ... the key here is to compare the fieldNorms below with the fieldNorms from the same documents in your query... http://www-a4.staging.mcdonalds.com/us/en/home.html 0.004108005 = (MATCH) product of: .0.028756034 = (MATCH) sum of: ..0.028756034 = (MATCH) weight(keywords:big in 0) [DefaultSimilarity], ...0.028756034 = score(doc=0,freq=1.0 = termFreq=1.0), product of: ....0.2629123 = queryWeight, product of: .....1.0 = idf(docFreq=1, maxDocs=2) .....0.2629123 = queryNorm ....0.109375 = fieldWeight in 0, product of: .....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .....1.0 = idf(docFreq=1, maxDocs=2) .....0.109375 = fieldNorm(doc=0) .0.14285715 = coord(1/7) http://www-a4.staging.mcdonalds.com/us/en/our_story/replacement-to-new-search/BigMac.html 0.07352274 = (MATCH) product of: .0.25732958 = (MATCH) sum of: ..0.14230545 = (MATCH) weight(description:big in 1) [DefaultSimilarity], ...0.14230545 = score(doc=1,freq=3.0 = termFreq=3.0), product of: ....0.2629123 = queryWeight, product of: .....1.0 = idf(docFreq=1, maxDocs=2) .....0.2629123 = queryNorm ....0.54126585 = fieldWeight in 1, product of: .....1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 .....1.0 = idf(docFreq=1, maxDocs=2) .....0.3125 = fieldNorm(doc=1) ..0.115024135 = (MATCH) weight(title:big in 1) [DefaultSimilarity], ...0.115024135 = score(doc=1,freq=1.0 = termFreq=1.0), product of: ....0.2629123 = queryWeight, product of: .....1.0 = idf(docFreq=1, maxDocs=2) .....0.2629123 = queryNorm ....0.4375 = fieldWeight in 1, product of: .....1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 .....1.0 = idf(docFreq=1, maxDocs=2) .....0.4375 = fieldNorm(doc=1) .0.2857143 = coord(2/7) ...the fieldNorm for "home.html" is the same, but the fieldNorm(s) for BigMac.html are much higher. The only explanation I have is that your CMS is sending fractional "boost" values at index time for some documents (again -- i speculate it might based on how "deep" the page is in your site, in an attempt to "promote" higher level pages) -Hoss http://www.lucidworks.com/ ________________________________ The information contained in this e-mail and any accompanying documents is confidential, may be privileged, and is intended solely for the person and/or entity to whom it is addressed (i.e. those identified in the "To" and "cc" box). They are the property of McDonald's Corporation. Unauthorized review, use, disclosure, or copying of this communication, or any part thereof, is strictly prohibited and may be unlawful. If you have received this e-mail in error, please return the e-mail and attachments to the sender and delete the e-mail and attachments and any copy from your system. McDonald's thanks you for your cooperation.