Thanks for the reply On Dec 30, 2011, at 6:04 PM, Chris Hostetter wrote: > > : I'm having an issue with the way the WordDelimiterFilter parses compound > : words. My field declaration is simple, looks like this: > : > : <analyzer type="index"> > : <tokenizer class="solr.WhitespaceTokenizerFactory"/> > : <filter class="solr.WordDelimiterFilterFactory" > preserveOriginal="1"/> > : <filter class="solr.LowerCaseFilterFactory"/> > : </analyzer> > > you haven't said anything about what your query time analyzer looks like > -- based on your other comments, i'm going to assume it just uses > whitespaceTokenizer and lower case filter w/o WDF at all -- but if you > don't have any "query" analyzer declared that means the analyzer above is > used in both case, which is most likely not what you want.
yes, you are correct on my query analyzer just being whitespace and lower case. I had only omitted it for clarity. > : So in the case where fokker-plank is the first token there should be no > : second token, its already been used if the first was matched. The > > that type of logic (hierarchical sequences of tokens) is just not possible > with lucene. ok, so if I understand it this is an issue but can't be worked around... > : problem manifests itself when doing phrase searches... > : > : "Fokker-Plank equations" won't find the exact phrase, Fokker-Plank > : equations, because its sees the term planck as between Fokker-Plank and > : equations. Hope that makes sense! Should I submit this as a bug? > > for phrase queries like this to work when using WDF, it's neccessary to > use some slop in your phrase query (to overcome the position gaps > introduced by the split out tokens) ... either that, or turn off > "preserveOriginal" and use a query analyzer thta also splits at query time It seems like the preserveOriginal isn't the best option here since it introduces an extra term into its version of the text. I don't really want the query to be split as fokker-planck shouldn't find a lone planck..... I may need a seperate field that isn't WordDelimited and us an OR of the two as my result..... > > : As it stands it would return a true hit (erroneously I believe) on the > : phrase search "fokker planck", so really all 3 tokens should be returned > > Hmmm... if you do *not* want a phrase search for "fokker planck" to match > documents containing "fokker-planck" then why are you using WDF at all? I the case where quotes are used I do want an exact phrase search done. So I do want fokker, planck, fokker planck, or "fokker-planck" to match a document that contains the term fokker-planck, but not "fokker planck" > > : at offset 0 and there should be no second token so phrase searches are > : preserved. > > if all the tokens wound up in the exact same position, then a > phrase query for "fokker planck" would still match this document (so it > wouldn't solve your problem) but you would also get matches for things > like the phrase "planck fokker" -- which is not likelye what *anyone* > would expect. > > > -Hoss Thanks so much for your time. Very helpful! steve