Yes, that had occurred to me too, but I wasn't exposed to the original
query from the developer who was having the trouble, just the text and
strange analysis. I'll confer with him to make sure there's actually
something to work on here.

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Tue, Jul 31, 2012 at 1:37 PM, Jack Krupansky <j...@basetechnology.com> wrote:
> I agree that it would make more sense for the catenated word ("johnsons") to
> be at the same position as the leading word ("johnson").
>
> But, what are some example queries that would "fail" given this behavior?
> "johnson and johnson" would not falsely match since you have position
> increment enabled for stop word removal (but would falsely match if you used
> a sloppy phrase query or did not have position increment enabled).
>
> -- Jack Krupansky
>
> -----Original Message----- From: Michael Della Bitta
> Sent: Tuesday, July 31, 2012 12:03 PM
> To: solr-user@lucene.apache.org
> Subject: Word Delimiter issue
>
>
> Hello all,
>
> We're running into a weird issue with Word Delimiter and apostrophes.
> For a text field that uses the out of the box field definition:
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
> <!-- Case insensitive stop word removal.
>          add enablePositionIncrements=true in both the index and query
>          analyzers to leave a 'gap' for more accurate phrase queries.
>        -->
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="com.jodange.solr.KStemFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="com.jodange.solr.KStemFilterFactory"/>
> </analyzer>
> </fieldType>
>
> (note that com.jodange.solr.KStemFilterFactory is a backport of KStem
> for Solr 1.4 we hacked together.)
>
> The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates
> two tokens for 'johnson.' Basically it looks like WordDelimiter is
> splitting on the apostrophe in "Johnson's", emitting the token
> 'johnson' for the left part, and both the tokens 's' and 'johnsons'
> for the right part, and later, stemming takes that down to 'johnson'.
>
> Which is kind of difficult if you're searching for Johnson and Johnson!
>
> Here's a image of the analysis happening:
>
> http://imgur.com/BUuNT
>
> Two questions,
>
> 1. I would have expected the catenated token to show up at the same
> position as the left hand side token, since they begin with the same
> letters. Does that not make sense?
>
> 2. Does it make sense to filter out apostrophes prior to WordDelimiter
> to prevent this from happening, or will that cause other issues?
>
> Thanks,
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game

Reply via email to