On 6/22/2018 9:29 AM, Prathyusha Kondeti wrote: > when I search for java using below query > > curl > http://localhost:8983/solr/test/select?fl=score,id&q=(java)&wt=json&sort=score > desc > > I am expecting the content with *Id :2* should come first as it contains > more matches related to java.But solr is giving inconsistent results. > > Please suggest why I am not able to get desired results.
Solr relies on Lucene for score calculations. Years of effort has gone into tuning the Lucene code that calculates scores. It is almost certain that the score is working as designed, but the design does not fit your expectations. Lucene's score calculation (which defaults to the BM25 similarity in Solr 6.x and later) takes term frequency (TF) into account, but that is not the whole story. Another part of the calculation is inverse document frequency (IDF). BM25 is more complicated than just those two factors, but I they are large influences in the final score. One thing that taking both TF and IDF into account does is reduce the score when the size of the document is large -- because the term showing up in a short document probably means that it's more relevant there. The actual calculation is certainly a lot more complex than what I'm going to describe, but the simple idea below illustrates what is probably happening: For the doc with id 1, there are two terms, and the search for java matches one of them - it's half of the document, which makes it pretty important for that document. For the doc with id 2, the search term appears three times, but there are nine terms total, so the term only contributes a third of that document. For id 3, the importance is also about one third. This means that id 1 probably outscores both id 2 and id 3 for a search term of "java". Here's a detailed article about TF and IDF. Older versions of Solr (before 6.x) used this kind of calcuation: https://en.wikipedia.org/wiki/Tf%E2%80%93idf Here's an article about BM25, default in 6.0 and later. This relevance calculation does work a lot like TF-IDF, but aims to produce even better ranking with a more complex mathematical model: https://en.wikipedia.org/wiki/Okapi_BM25 Thanks, Shawn