I agree that it would make more sense for the catenated word ("johnsons") to be at the same position as the leading word ("johnson").

But, what are some example queries that would "fail" given this behavior? "johnson and johnson" would not falsely match since you have position increment enabled for stop word removal (but would falsely match if you used a sloppy phrase query or did not have position increment enabled).

-- Jack Krupansky

-----Original Message----- From: Michael Della Bitta
Sent: Tuesday, July 31, 2012 12:03 PM
To: solr-user@lucene.apache.org
Subject: Word Delimiter issue

Hello all,

We're running into a weird issue with Word Delimiter and apostrophes.
For a text field that uses the out of the box field definition:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
       <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
       -->
<!-- Case insensitive stop word removal.
         add enablePositionIncrements=true in both the index and query
         analyzers to leave a 'gap' for more accurate phrase queries.
       -->
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="com.jodange.solr.KStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="com.jodange.solr.KStemFilterFactory"/>
</analyzer>
</fieldType>

(note that com.jodange.solr.KStemFilterFactory is a backport of KStem
for Solr 1.4 we hacked together.)

The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates
two tokens for 'johnson.' Basically it looks like WordDelimiter is
splitting on the apostrophe in "Johnson's", emitting the token
'johnson' for the left part, and both the tokens 's' and 'johnsons'
for the right part, and later, stemming takes that down to 'johnson'.

Which is kind of difficult if you're searching for Johnson and Johnson!

Here's a image of the analysis happening:

http://imgur.com/BUuNT

Two questions,

1. I would have expected the catenated token to show up at the same
position as the left hand side token, since they begin with the same
letters. Does that not make sense?

2. Does it make sense to filter out apostrophes prior to WordDelimiter
to prevent this from happening, or will that cause other issues?

Thanks,

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game

Reply via email to