RE: Dismax handler - whitespace and special character behaviour

Demian Katz Tue, 25 Oct 2011 10:31:55 -0700

I just sent an email to the list about DisMax interacting with 
WordDelimiterFilterFactory, and I think our problems are at least partially 
related -- I think the reason you are seeing an OR where you expect an AND is 
that you have autoGeneratePhraseQueries set to false, which changes the way 
DisMax handles the output of the WordDelimiterFilterFactory (among others).  
Unfortunately, I don't have a solution for you...  but you might want to keep 
an eye on my thread in case replies there shed any additional light.


- Demian

> -----Original Message-----
> From: Rohk [mailto:khor...@gmail.com]
> Sent: Tuesday, October 25, 2011 10:33 AM
> To: solr-user@lucene.apache.org
> Subject: Dismax handler - whitespace and special character behaviour
> 
> Hello,
> 
> I've got strange results when I have special characters in my query.
> 
> Here is my request :
> 
> q=histoire-
> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100
> %
> 
> Parsed query :
> 
> <str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>
> 
> I've got 17000 results because Solr is doing an OR (should be AND).
> 
> I have no problem when I'm using a whitespace instead of a special char
> :
> 
> q=histoire
> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100
> %
> 
> <str name="parsedquery_toString">+(((any:histoir) (any:franc))~2)
> ()</str>
> 
> 2000 results for this query.
> 
> Here is my schema.xml (relevant parts) :
> 
> <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="false">
>       <analyzer type="index">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.CommonGramsFilterFactory"
> words="stopwords_french.txt" ignoreCase="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_french.txt" enablePositionIncrements="true"/>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="French" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!--<filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
>         <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="0"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.CommonGramsFilterFactory"
> words="stopwords_french.txt" ignoreCase="true"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_french.txt" enablePositionIncrements="true"/>
>         <filter class="solr.SnowballPorterFilterFactory"
> language="French" protected="protwords.txt"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>       </analyzer>
>     </fieldType>
> 
> I tried with a PatternTokenizerFactory to tokenize on whitespaces &
> special
> chars but no change...
> Even with a charFilter (PatternReplaceCharFilterFactory) to replace
> special
> characters by whitespace, it doesn't work...
> 
> First line of analysis via solr admin, with verbose output, for query =
> 'histoire-france' :
> 
> org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=
> , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
> text    histoire france
> 
> The '-' is replaced by ' ', then tokenized by
> WhitespaceTokenizerFactory.
> However I still have different number of results for 'histoire-france'
> and
> 'histoire france'.
> 
> My current workaround is to replace all special chars by whitespaces
> before
> sending query to Solr, but it is not satisfying.
> 
> Did i miss something ?

RE: Dismax handler - whitespace and special character behaviour

Reply via email to