Greetings all,
I'm having trouble tracking down why a particular query is not
working. A user is trying to do a search for
alternate_form_title_text:"three films by louis malle" specifically to
find the 4 records that contain the phrase "Three films by Louis Malle"
in their alternate_form_title_text field.
However the search return 0 records.
The modified searches:
alternate_form_title_text:"three films by louis malle"~1
or
alternate_form_title_text:"three films" AND
alternate_form_title_text:"louis malle"
both return the 4 records. So it seems that it is the word "by" which
is listed in the stopword filter list is causing the problem.
The analyzer/filter sequence for indexing the alternate_form_title_text
field is _almost_ exactly the same as the sequence for querying that field.
for indexing the sequence is:
org.apache.solr.analysis.HTMLStripWhitespaceTokenizerFactory {}
schema.UnicodeNormalizationFilterFactory {composed=false,
remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory {bigrams=false}
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
catenateWords=1, generateWordParts=1, catenateAll=0, catenateNumbers=1}
org.apache.solr.analysis.LowerCaseFilterFactory {}
org.apache.solr.analysis.EnglishPorterFilterFactory {protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
for querying the sequence is:
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
schema.UnicodeNormalizationFilterFactory {composed=false,
remove_modifiers=true, fold=true, version=icu4j, remove_diacritics=true}
schema.CJKFilterFactory {bigrams=false}
org.apache.solr.analysis.SynonymFilterFactory {synonyms=synonyms.txt,
expand=true, ignoreCase=true}
org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
ignoreCase=true, enablePositionIncrements=true}
org.apache.solr.analysis.WordDelimiterFilterFactory{generateNumberParts=1,
catenateWords=0, generateWordParts=1, catenateAll=0, catenateNumbers=0}
org.apache.solr.analysis.LowerCaseFilterFactory {}
org.apache.solr.analysis.EnglishPorterFilterFactory {protected=protwords.txt}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
If I run a test through the field anaylsis admin page, submitting the
string* three films by louis malle *through both the Field value (Index)
and the Field value (query) the reslts (shown below) seem to indicate
the the query ought to find the 4 records in question, by it does not,
and I'm at a loss to explain why.
Index Analyzer
term position 1 2 4 5
term text three film loui mall
term type word word word word
source start,end 0,5 6,11 15,20 21,26
Query Analyzer
term position 1 2 4 5
term text three film loui mall
term type word word word word
source start,end 0,5 6,11 15,20 21,26