issues with WordDelimiterFilter

Steven Fuchs Mon, 19 Dec 2011 19:59:39 -0800

Hello All,
I'm having an issue with the way the WordDelimiterFilter parses compound words. 
My field declaration is simple, looks like this:


      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

When indexing 'fokker-plank' I do get the token for both fokker, planck, and 
fokker-planck. But in that case the fokker-planck token it is followed by a 
'planck' token. The analysis looks like this.


position                        1                                       2
 term text              fokker-planck           planck
                                fokker                          (table layout 
implies planck)
startOffset             0                                       7
                                0


So in the case where fokker-plank is the first token there should be no second 
token, its already been used if the first was matched. The problem manifests 
itself when doing phrase searches...

"Fokker-Plank equations" won't find the exact phrase, Fokker-Plank equations, 
because its sees the term planck as between Fokker-Plank and equations. Hope 
that makes sense! Should I submit this as a bug?

As it stands it would return a true hit (erroneously I believe) on the phrase 
search "fokker planck", so really all 3 tokens should be returned at offset 0 
and there should be no second token so phrase searches are preserved.

Thanks in advance
Steven Fuchs

issues with WordDelimiterFilter

Reply via email to