It looks like the fact that this duplicate token is generated by
WordDelimiter after StopFilter means that it's not filtered out.

In any case, a search on "david david" against this field does find
documents with values like "David's" as well as "David, David,
David..."

Michael Della Bitta

------------------------------------------------
Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
www.appinions.com
Where Influence Isn’t a Game


On Tue, Jul 31, 2012 at 2:07 PM, Michael Della Bitta
<michael.della.bi...@appinions.com> wrote:
> Yes, that had occurred to me too, but I wasn't exposed to the original
> query from the developer who was having the trouble, just the text and
> strange analysis. I'll confer with him to make sure there's actually
> something to work on here.
>
> Michael Della Bitta
>
> ------------------------------------------------
> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
> www.appinions.com
> Where Influence Isn’t a Game
>
>
> On Tue, Jul 31, 2012 at 1:37 PM, Jack Krupansky <j...@basetechnology.com> 
> wrote:
>> I agree that it would make more sense for the catenated word ("johnsons") to
>> be at the same position as the leading word ("johnson").
>>
>> But, what are some example queries that would "fail" given this behavior?
>> "johnson and johnson" would not falsely match since you have position
>> increment enabled for stop word removal (but would falsely match if you used
>> a sloppy phrase query or did not have position increment enabled).
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Michael Della Bitta
>> Sent: Tuesday, July 31, 2012 12:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: Word Delimiter issue
>>
>>
>> Hello all,
>>
>> We're running into a weird issue with Word Delimiter and apostrophes.
>> For a text field that uses the out of the box field definition:
>>
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <!-- in this example, we will only use synonyms at query time
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>        -->
>> <!-- Case insensitive stop word removal.
>>          add enablePositionIncrements=true in both the index and query
>>          analyzers to leave a 'gap' for more accurate phrase queries.
>>        -->
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="com.jodange.solr.KStemFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
>> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="com.jodange.solr.KStemFilterFactory"/>
>> </analyzer>
>> </fieldType>
>>
>> (note that com.jodange.solr.KStemFilterFactory is a backport of KStem
>> for Solr 1.4 we hacked together.)
>>
>> The phrase "That is not true!” Ms. Johnson’s jaw dropped." generates
>> two tokens for 'johnson.' Basically it looks like WordDelimiter is
>> splitting on the apostrophe in "Johnson's", emitting the token
>> 'johnson' for the left part, and both the tokens 's' and 'johnsons'
>> for the right part, and later, stemming takes that down to 'johnson'.
>>
>> Which is kind of difficult if you're searching for Johnson and Johnson!
>>
>> Here's a image of the analysis happening:
>>
>> http://imgur.com/BUuNT
>>
>> Two questions,
>>
>> 1. I would have expected the catenated token to show up at the same
>> position as the left hand side token, since they begin with the same
>> letters. Does that not make sense?
>>
>> 2. Does it make sense to filter out apostrophes prior to WordDelimiter
>> to prevent this from happening, or will that cause other issues?
>>
>> Thanks,
>>
>> Michael Della Bitta
>>
>> ------------------------------------------------
>> Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017
>> www.appinions.com
>> Where Influence Isn’t a Game

Reply via email to