Re: issues with WordDelimiterFilter

Chris Hostetter Fri, 30 Dec 2011 15:04:44 -0800

: I'm having an issue with the way the WordDelimiterFilter parses compound 
: words. My field declaration is simple, looks like this:
: 
:       <analyzer type="index">
:         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
:         <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"/>
:         <filter class="solr.LowerCaseFilterFactory"/>
:       </analyzer>


you haven't said anything about what your query time analyzer looks like 
-- based on your other comments, i'm going to assume it just uses 
whitespaceTokenizer and lower case filter w/o WDF at all -- but if you 
don't have any "query" analyzer declared that means the analyzer above is 
used in both case, which is most likely not what you want.

: : When indexing 'fokker-plank' I do get the token for both fokker, 
: : planck, and fokker-planck. But in that case the fokker-planck token it 
: : is followed by a 'planck' token. The analysis looks like this.

that is expected - when WDF splits up a token (and keeps hte original) it 
puts the first of the split tokens at the same position as the original, 
and each other split token follows in subsequent positions -- positions in 
token streams are simple integer increments, so there is no way to say 
that the split "fokker" and "planck" tokens appear in that sequence *and* 
that they both appear at the same position as the original "fokker-planck"

: So in the case where fokker-plank is the first token there should be no 
: second token, its already been used if the first was matched. The 

that type of logic (hierarchical sequences of tokens) is just not possible 
with lucene.

: problem manifests itself when doing phrase searches...
: 
: "Fokker-Plank equations" won't find the exact phrase, Fokker-Plank 
: equations, because its sees the term planck as between Fokker-Plank and 
: equations. Hope that makes sense! Should I submit this as a bug?

for phrase queries like this to work when using WDF, it's neccessary to 
use some slop in your phrase query (to overcome the position gaps 
introduced by the split out tokens) ... either that, or turn off 
"preserveOriginal" and use a query analyzer thta also splits at query time

: As it stands it would return a true hit (erroneously I believe) on the 
: phrase search "fokker planck", so really all 3 tokens should be returned 

Hmmm... if you do *not* want a phrase search for "fokker planck" to match 
documents containing "fokker-planck" then why are you using WDF at all?

: at offset 0 and there should be no second token so phrase searches are 
: preserved.

if all the tokens wound up in the exact same position, then a 
phrase query for "fokker planck" would still match this document (so it 
wouldn't solve your problem) but you would also get matches for things 
like the phrase "planck fokker" -- which is not likelye what *anyone* 
would expect.


-Hoss

Re: issues with WordDelimiterFilter

Reply via email to