It looks like the fact that this duplicate token is generated by WordDelimiter after StopFilter means that it's not filtered out.
In any case, a search on "david david" against this field does find documents with values like "David's" as well as "David, David, David..." Michael Della Bitta ------------------------------------------------ Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com Where Influence Isn’t a Game On Tue, Jul 31, 2012 at 2:07 PM, Michael Della Bitta <michael.della.bi...@appinions.com> wrote: > Yes, that had occurred to me too, but I wasn't exposed to the original > query from the developer who was having the trouble, just the text and > strange analysis. I'll confer with him to make sure there's actually > something to work on here. > > Michael Della Bitta > > ------------------------------------------------ > Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 > www.appinions.com > Where Influence Isn’t a Game > > > On Tue, Jul 31, 2012 at 1:37 PM, Jack Krupansky <j...@basetechnology.com> > wrote: >> I agree that it would make more sense for the catenated word ("johnsons") to >> be at the same position as the leading word ("johnson"). >> >> But, what are some example queries that would "fail" given this behavior? >> "johnson and johnson" would not falsely match since you have position >> increment enabled for stop word removal (but would falsely match if you used >> a sloppy phrase query or did not have position increment enabled). >> >> -- Jack Krupansky >> >> -----Original Message----- From: Michael Della Bitta >> Sent: Tuesday, July 31, 2012 12:03 PM >> To: solr-user@lucene.apache.org >> Subject: Word Delimiter issue >> >> >> Hello all, >> >> We're running into a weird issue with Word Delimiter and apostrophes. >> For a text field that uses the out of the box field definition: >> >> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <!-- in this example, we will only use synonyms at query time >> <filter class="solr.SynonymFilterFactory" >> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> >> --> >> <!-- Case insensitive stop word removal. >> add enablePositionIncrements=true in both the index and query >> analyzers to leave a 'gap' for more accurate phrase queries. >> --> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" enablePositionIncrements="true"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="1" catenateNumbers="1" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="com.jodange.solr.KStemFilterFactory"/> >> </analyzer> >> <analyzer type="query"> >> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" >> ignoreCase="true" expand="true"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" enablePositionIncrements="true"/> >> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" >> generateNumberParts="1" catenateWords="0" catenateNumbers="0" >> catenateAll="0" splitOnCaseChange="1"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="com.jodange.solr.KStemFilterFactory"/> >> </analyzer> >> </fieldType> >> >> (note that com.jodange.solr.KStemFilterFactory is a backport of KStem >> for Solr 1.4 we hacked together.) >> >> The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates >> two tokens for 'johnson.' Basically it looks like WordDelimiter is >> splitting on the apostrophe in "Johnson's", emitting the token >> 'johnson' for the left part, and both the tokens 's' and 'johnsons' >> for the right part, and later, stemming takes that down to 'johnson'. >> >> Which is kind of difficult if you're searching for Johnson and Johnson! >> >> Here's a image of the analysis happening: >> >> http://imgur.com/BUuNT >> >> Two questions, >> >> 1. I would have expected the catenated token to show up at the same >> position as the left hand side token, since they begin with the same >> letters. Does that not make sense? >> >> 2. Does it make sense to filter out apostrophes prior to WordDelimiter >> to prevent this from happening, or will that cause other issues? >> >> Thanks, >> >> Michael Della Bitta >> >> ------------------------------------------------ >> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 >> www.appinions.com >> Where Influence Isn’t a Game