Hello All,
I'm having an issue with the way the WordDelimiterFilter parses compound words.
My field declaration is simple, looks like this:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
When indexing 'fokker-plank' I do get the token for both fokker, planck, and
fokker-planck. But in that case the fokker-planck token it is followed by a
'planck' token. The analysis looks like this.
position 1 2
term text fokker-planck planck
fokker (table layout
implies planck)
startOffset 0 7
0
So in the case where fokker-plank is the first token there should be no second
token, its already been used if the first was matched. The problem manifests
itself when doing phrase searches...
"Fokker-Plank equations" won't find the exact phrase, Fokker-Plank equations,
because its sees the term planck as between Fokker-Plank and equations. Hope
that makes sense! Should I submit this as a bug?
As it stands it would return a true hit (erroneously I believe) on the phrase
search "fokker planck", so really all 3 tokens should be returned at offset 0
and there should be no second token so phrase searches are preserved.
Thanks in advance
Steven Fuchs