Hi,
We are using solr for our movie title search.

As it is "title search", this should be treated different than the normal
document search.
Hence, we use a modified version of TFIDFSimilarity with the following
changes.
-  disabled TF & IDF and will only have 1 as value.
-  disabled norms by specifying omitNorms as true for all the fields.

There are 6 fields with different analyzers and we make use of different
weights in edismax's qf & pf parameters to match tokens & boost phrases.

But, movies could have aliases and have multiple titles. So, we made the
fields multivalued.

Now, consider the following four documents
1>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"
4>  "Beauty and the Beast"

Note: Document 3 has two titles in it.

So, for a query "Beauty and the Beast" and with the above configuration all
the documents receive same score. But 1,3,4 should have got same score and
document 2 lesser than others.

To solve this, we followed what is suggested in the following thread:
http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html

Now, the fields which are used to boost are made to use Norms. And for
matching norms are disabled. This is to make sure that exact & near exact
matches are rewarded.

But, for the same query, we get the following results.
query: "Beauty & the Beast"
Search Results:
1>  "Beauty and the Beast"
4>  "Beauty and the Beast"
2>  "The Real Beauty and the Beast"
3>  "Beauty and the Beast", "La bella y la bestia"

Clearly, the changes have solved only a part of the problem. The document 3
should be ranked/scored higher than document 2.

This is because lucene considers the total field length across all the
values in a multivalued field for normalization.

How do we handle this scenario and make sure that in multivalued fields the
normalization is taken care of?


-- 
Regards,
Sravan

Reply via email to