You can use a separate field for title aliases. That is what I did for Netflix 
search.

Why disable idf? Disabling tf for titles can be a good idea, for example the 
movie “New York, New York” is not twice as much about New York as some other 
film that just lists it once.

Also, consider using a popularity score as a boost.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 31, 2018, at 4:38 AM, Sravan Kumar <sra...@caavo.com> wrote:
> 
> Hi,
> We are using solr for our movie title search.
> 
> 
> As it is "title search", this should be treated different than the normal
> document search.
> Hence, we use a modified version of TFIDFSimilarity with the following
> changes.
> -  disabled TF & IDF and will only have 1 as value.
> -  disabled norms by specifying omitNorms as true for all the fields.
> 
> There are 6 fields with different analyzers and we make use of different
> weights in edismax's qf & pf parameters to match tokens & boost phrases.
> 
> But, movies could have aliases and have multiple titles. So, we made the
> fields multivalued.
> 
> Now, consider the following four documents
> 1>  "Beauty and the Beast"
> 2>  "The Real Beauty and the Beast"
> 3>  "Beauty and the Beast", "La bella y la bestia"
> 4>  "Beauty and the Beast"
> 
> Note: Document 3 has two titles in it.
> 
> So, for a query "Beauty and the Beast" and with the above configuration all
> the documents receive same score. But 1,3,4 should have got same score and
> document 2 lesser than others.
> 
> To solve this, we followed what is suggested in the following thread:
> http://lucene.472066.n3.nabble.com/Influencing-scores-on-values-in-multiValue-fields-td1791651.html
> 
> Now, the fields which are used to boost are made to use Norms. And for
> matching norms are disabled. This is to make sure that exact & near exact
> matches are rewarded.
> 
> But, for the same query, we get the following results.
> query: "Beauty & the Beast"
> Search Results:
> 1>  "Beauty and the Beast"
> 4>  "Beauty and the Beast"
> 2>  "The Real Beauty and the Beast"
> 3>  "Beauty and the Beast", "La bella y la bestia"
> 
> Clearly, the changes have solved only a part of the problem. The document 3
> should be ranked/scored higher than document 2.
> 
> This is because lucene considers the total field length across all the
> values in a multivalued field for normalization.
> 
> How do we handle this scenario and make sure that in multivalued fields the
> normalization is taken care of?
> 
> 
> -- 
> Regards,
> Sravan

Reply via email to