Re: RE: Dismax handler - whitespace and special character behaviour

Khorail Wed, 26 Oct 2011 01:17:27 -0700

In fact I tried without WordDelimiterFilterFactory (using aPatternTokenizerFactory to tokenize on special chars) and I still have thesame problem. Apparently dismax handler thinks that 'france-histoire' is asingle word even if I tokenize on '-'


Le , Demian Katz <demian.k...@villanova.edu> a écrit :

I just sent an email to the list about DisMax interacting withWordDelimiterFilterFactory, and I think our problems are at leastpartially related -- I think the reason you are seeing an OR where youexpect an AND is that you have autoGeneratePhraseQueries set to false,which changes the way DisMax handles the output of theWordDelimiterFilterFactory (among others). Unfortunately, I don't have asolution for you... but you might want to keep an eye on my thread incase replies there shed any additional light.

- Demian

> -----Original Message-----

> From: Rohk [mailto:khor...@gmail.com]

> Sent: Tuesday, October 25, 2011 10:33 AM

> To: solr-user@lucene.apache.org

> Subject: Dismax handler - whitespace and special character behaviour

> Hello,

> I've got strange results when I have special characters in my query.

> Here is my request :

> q=histoire-

> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100

> %

> Parsed query :

> +((any:histoir any:franc)) ()

> I've got 17000 results because Solr is doing an OR (should be AND).

> I have no problem when I'm using a whitespace instead of a special char

> :

> q=histoire

> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100

> %

> +(((any:histoir) (any:franc))~2)

> ()

> 2000 results for this query.

> Here is my schema.xml (relevant parts) :

>
> positionIncrementGap="100" autoGeneratePhraseQueries="false">

>
> generateWordParts="1" generateNumberParts="1" catenateWords="1"

> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"

> preserveOriginal="1"/>

>
> words="stopwords_french.txt" ignoreCase="true"/>

>
> words="stopwords_french.txt" enablePositionIncrements="true"/>

>
> language="French" protected="protwords.txt"/>

>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->

>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"

> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"

> preserveOriginal="0"/>

>
> words="stopwords_french.txt" ignoreCase="true"/>

>
> words="stopwords_french.txt" enablePositionIncrements="true"/>

>
> language="French" protected="protwords.txt"/>

> I tried with a PatternTokenizerFactory to tokenize on whitespaces &

> special

> chars but no change...

> Even with a charFilter (PatternReplaceCharFilterFactory) to replace

> special

> characters by whitespace, it doesn't work...

> First line of analysis via solr admin, with verbose output, for query =

> 'histoire-france' :

> org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=

> , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}

> text histoire france

> The '-' is replaced by ' ', then tokenized by

> WhitespaceTokenizerFactory.

> However I still have different number of results for 'histoire-france'

> and

> 'histoire france'.

> My current workaround is to replace all special chars by whitespaces

> before

> sending query to Solr, but it is not satisfying.

> Did i miss something ?

Re: RE: Dismax handler - whitespace and special character behaviour

Reply via email to