Length normalization in the Similarity class will generally favor
shorter fields. For example, with the DefaultSimilarity, the length
norm for a 2 term field is 0.625. For a three term field it is 0.5.
The norm is multiplied by the score.
I say "generally will favor" because the length norm value which is
calculated as
(float)(1.0 / numTerms)
is stored in the index as a single byte (instead of four bytes), thus
losing precision. This works fine for searching larger documents such
as web pages or news articles, but it can cause some problems when you
are simply searching on short fields such as product names or article
titles.
To solve this, we wrote our own Similarity class which extends
DefaultSimilarity and maps numTerms 1-10 to precalculated values between
1.5f and 0.3125f. For numTerms >10, we use the standard formula above.
If anyone else is interested in this, I can post the code as a patch in
Jira.
-Sean
Simon Hu wrote:
Hi
I have a text field named prodname in the solr index. Lets say there are 3
document in the index and here are the field values for prodname field:
Doc1: cordless drill
Doc2: cordless drill battery
Doc3: cordless drill charger
Searching for prodname:"cordless drill" will hit all three documents. So
how can I make Doc1 score higher than the other two?
BTW, I am using solr1.2.
thanks!
-Simon