On 6/22/2018 9:29 AM, Prathyusha Kondeti wrote:
> when I search for java using below query
>
> curl
> http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score
>  desc
>
> I am expecting the content with *Id :2* should come first as it contains
> more matches related to java.But solr is giving inconsistent results.
>
> Please suggest why I am not able to get desired results.

Solr relies on Lucene for score calculations.

Years of effort has gone into tuning the Lucene code that calculates
scores.  It is almost certain that the score is working as designed, but
the design does not fit your expectations.

Lucene's score calculation (which defaults to the BM25 similarity in
Solr 6.x and later) takes term frequency (TF) into account, but that is
not the whole story.  Another part of the calculation is inverse
document frequency (IDF).  BM25 is more complicated than just those two
factors, but I they are large influences in the final score.

One thing that taking both TF and IDF into account does is reduce the
score when the size of the document is large -- because the term showing
up in a short document probably means that it's more relevant there. 
The actual calculation is certainly a lot more complex than what I'm
going to describe, but the simple idea below illustrates what is
probably happening:

For the doc with id 1, there are two terms, and the search for java
matches one of them - it's half of the document, which makes it pretty
important for that document.  For the doc with id 2, the search term
appears three times, but there are nine terms total, so the term only
contributes a third of that document.  For id 3, the importance is also
about one third.  This means that id 1 probably outscores both id 2 and
id 3 for a search term of "java".

Here's a detailed article about TF and IDF.  Older versions of Solr
(before 6.x) used this kind of calcuation:

https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Here's an article about BM25, default in 6.0 and later.  This relevance
calculation does work a lot like TF-IDF, but aims to produce even better
ranking with a more complex mathematical model:

https://en.wikipedia.org/wiki/Okapi_BM25

Thanks,
Shawn

Reply via email to