Re: Title Search scoring issues with multivalued field & norm

2018-02-04 Thread Sravan Kumar
Using edismax with different fields for each title will affect the final scores if the tie paramter is non-zero. Can we create separate document for each title? The uniqueness won't be for movie_id but for each title. In this manner, even while using edismax, the other titles won't affect the scor

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: Perhaps you are right on not to consider stemming. Instead fuzzy search will cover these along with the misspellings. In case of symbols, we want the titles matching the symbols ranked higher than the others. Perhaps we can use this field only for boosting. Certain movies have around 4-6

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
I was the first search engineer at Netflix and moved their search from a home-grown engine to Solr. It worked very well with a single title field and aliases. I think your schema is too complicated for movie search. Stemming is not useful. It doesn’t help search and it can hurt. You don’t want

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This is done through the fieldnorm component in the class. The issue is when the field is multivalued. Consider the field has two string each of 4 tokens. The fieldNorm from the lucene TFIDFSimilarity class considers the total su

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Tim Casey
For smaller length documents TFIDFSimilarity will weight towards shorter documents. Another way to say this, if your documents are 5-10 terms, the 5 terms are going to win. You might think about having per token, or token pair, weight. I would be surprised if there was not something similar out t

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Sravan Kumar
@Walter: We have 6 fields declared in schema.xml for title each with different type of analyzer. One without processing symbols, other stemmed and other removing symbols, etc. So, if we have separate fields for each alias it will be that many times the number of final fields declared in schema

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Erick Erickson
Or use a boost for the phrase, something like "beauty and the beast"^5 On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood wrote: > You can use a separate field for title aliases. That is what I did for > Netflix search. > > Why disable idf? Disabling tf for titles can be a good idea, for example

Re: Title Search scoring issues with multivalued field & norm

2018-01-31 Thread Walter Underwood
You can use a separate field for title aliases. That is what I did for Netflix search. Why disable idf? Disabling tf for titles can be a good idea, for example the movie “New York, New York” is not twice as much about New York as some other film that just lists it once. Also, consider using a