We have the following two fields for our movie title search
- title without symbols
a custom analyser with WordDelimiterFilterFactory, SynonymFilterFactory and
other filters to retain only alpha numeric characters.
- title with word bi grams
a custom analyser with solr.ShingleFilterFactory to generate "bi gram" word
tokens with '_' as separator.

A custom similarity class is used to make tf & idf values as 1.

Edismax query parser is used to perform all searches. Phrase boosting (pf)
is also used.

There are couple of issues while searching:
1>  BiGram field doesn't generate bi grams if the white spaces in the query
are not escaped.
- For example, if the query is "pursuit of happyness", then bi grams are
not generated.  This is due to the fact that the edismax query parser
tokenizes based on whitespaces before passing the string to
analyser(correct me if I am wrong).
But in case of "pursuit\ of\ happyness", they are as the string which is
passed to the analyser is with the whitespace.

2>  Fuzzy search doesn't work in  whitespace escaped queries.
Ex: "pursuit~2\ of\ happiness~1"

3> Edismax's Phrase boosting doesn't work the way it should in
non-whitespace escaped fuzzy queries.

If the query is "pursuit~2 of happiness~1" (without escaping whitespaces)

fuzzy queries are generated
(title_name:pursuit~2), (title_name:happiness~1) in the parsed query.
But,edismax pf (phrase boost) generates query like
title_name:"pursuit (2 pursuit2) of happiness (1 happiness1)"
This means the analyser got the original query consisting the fuzzy
operator for phrase boosting.


1> How whitespaces should be handled in case of filters like
solr.ShingleFilterFactory to generate bi grams?
2> If generating bi grams requires whitespaces escaped and fuzzy searches
not, how do we accomodate both these in a single solr request and scored
together.



-
-- 
Regards,
Sravan

Reply via email to