Hello All, I'm having an issue with the way the WordDelimiterFilter parses compound words. My field declaration is simple, looks like this:
<analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> When indexing 'fokker-plank' I do get the token for both fokker, planck, and fokker-planck. But in that case the fokker-planck token it is followed by a 'planck' token. The analysis looks like this. position 1 2 term text fokker-planck planck fokker (table layout implies planck) startOffset 0 7 0 So in the case where fokker-plank is the first token there should be no second token, its already been used if the first was matched. The problem manifests itself when doing phrase searches... "Fokker-Plank equations" won't find the exact phrase, Fokker-Plank equations, because its sees the term planck as between Fokker-Plank and equations. Hope that makes sense! Should I submit this as a bug? As it stands it would return a true hit (erroneously I believe) on the phrase search "fokker planck", so really all 3 tokens should be returned at offset 0 and there should be no second token so phrase searches are preserved. Thanks in advance Steven Fuchs