Hi eveybody,

I am migrating from solr 6.5.1 to solr 8.6.1 and am having a couple of
issues for which I need your help. There is a significant change in ranking
between Solr 6 and 8 search results which I need to fix before using Solr8
in our live environment. I noticed a couple of changes upfront which could
be some of the reasons for ranking changes.

1. Solr Omit norms not working as expected in Solr 8 with
BM25SimilarityFactory.
2. LegacyBM25SimilarityFactory 'qf' parameter boost value not correct when
using Edismax.

I tried the Solr examples with the following configuration and can
replicate the difference on Solr 8.6.1.

*Schema being used:*
<field name="manu" type="text_general" indexed="true" stored="true"
*omitNorms="true"*/>

*Solr query:*
http://localhost:8983/solr/solr/select?q=*manu:Samsung*
&debug=true&wt=json&indent=on

*Solr 6 debug output (Note, 0.0 = parameter b (norms omitted for field))*
 "SP2514N":"
2.6390574 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
  2.6390574 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
    2.6390574 = idf, computed as log(1 + (docCount - docFreq + 0.5) /
(docFreq + 0.5)) from:
      1.0 = docFreq
      20.0 = docCount
    1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:
      1.0 = termFreq=1.0
      1.2 = parameter k1
      *0.0 = parameter b (norms omitted for field)*
"}

*Solr 8 debug output*
"SP2514N":"
1.5827883 = weight(manu:samsung in 1) [SchemaSimilarity], result of:
  1.5827883 = score(freq=1.0), computed as boost * idf * tf from:
    2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
      1 = n, number of documents containing term
      20 = N, total number of documents with field
    0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
avgdl)) from:
      1.0 = freq, occurrences of term within document
      1.2 = k1, term saturation parameter


*0.75 = b, length normalization parameter      1.0 = dl, length of field
  2.45 = avgdl, average length of field*
"}

As you can see above, length normalization is not used in solr 6 which is
correct while it is being used in Solr 8. I tried to replicate this with
LegacyBM25SimilarityFactory as well and see the same issue there. Secondly
LegacyBM25SimilarityFactory is behaving differently with the *'qf' boost*
value for fields with the edismax parser which I am also using.

Request handler with Edismax:
<requestHandler name="/search" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="wt">json</str>
<str name="indent">off</str>
<int name="rows">10</int>
<str name="defType">edismax</str>
<str name="qf">manu</str>
<str name="mm">100%</str>
<str name="lowercaseOperators">false</str>
</lst>
</requestHandler>

Debug output:
"SP2514N":"
3.4821343 = weight(manu:samsung in 1) [LegacyBM25Similarity], result of:
  3.4821343 = score(freq=1.0), computed as boost * idf * tf from:
    *2.2 = boost*
    2.6390574 = idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
      1 = n, number of documents containing term
      20 = N, total number of documents with field
    0.59975517 = tf, computed as freq / (freq + k1 * (1 - b + b * dl /
avgdl)) from:
      1.0 = freq, occurrences of term within document
      1.2 = k1, term saturation parameter
      0.75 = b, length normalization parameter
      1.0 = dl, length of field
      2.45 = avgdl, average length of field
"}

On checking the Solr source code this value of 2.2 = boost is roughly equal
to 1 + k1, as per the code below.

return bm25Similarity.scorer(*boost * (1 + bm25Similarity.getK1()*),
collectionStats, termStats);

Since LegacyBM25Similarity is supposed to keep the same scoring as Solr 6
BM25Similarity, which is not working as expected, I cannot test the changes
in scoring. Kindly help to resolve the above 2 issues. I could be doing
something wrong with the configuration, but I read the Solr 7 and Solr 8
migration notes, so not sure where I'm going wrong. Kindly advise.

Reply via email to