In fact I tried without WordDelimiterFilterFactory (using a PatternTokenizerFactory to tokenize on special chars) and I still have the same problem. Apparently dismax handler thinks that 'france-histoire' is a single word even if I tokenize on '-'

Le , Demian Katz <demian.k...@villanova.edu> a écrit :
I just sent an email to the list about DisMax interacting with WordDelimiterFilterFactory, and I think our problems are at least partially related -- I think the reason you are seeing an OR where you expect an AND is that you have autoGeneratePhraseQueries set to false, which changes the way DisMax handles the output of the WordDelimiterFilterFactory (among others). Unfortunately, I don't have a solution for you... but you might want to keep an eye on my thread in case replies there shed any additional light.



- Demian



> -----Original Message-----

> From: Rohk [mailto:khor...@gmail.com]

> Sent: Tuesday, October 25, 2011 10:33 AM

> To: solr-user@lucene.apache.org

> Subject: Dismax handler - whitespace and special character behaviour

>

> Hello,

>

> I've got strange results when I have special characters in my query.

>

> Here is my request :

>

> q=histoire-

> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100

> %

>

> Parsed query :

>

> +((any:histoir any:franc)) ()

>

> I've got 17000 results because Solr is doing an OR (should be AND).

>

> I have no problem when I'm using a whitespace instead of a special char

> :

>

> q=histoire

> france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100

> %

>

> +(((any:histoir) (any:franc))~2)

> ()

>

> 2000 results for this query.

>

> Here is my schema.xml (relevant parts) :

>

>
> positionIncrementGap="100" autoGeneratePhraseQueries="false">

>

>

>
> generateWordParts="1" generateNumberParts="1" catenateWords="1"

> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"

> preserveOriginal="1"/>

>

>
> words="stopwords_french.txt" ignoreCase="true"/>

>
> words="stopwords_french.txt" enablePositionIncrements="true"/>

>
> language="French" protected="protwords.txt"/>

>

>

>

>

>

>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->

>
> generateWordParts="1" generateNumberParts="1" catenateWords="0"

> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"

> preserveOriginal="0"/>

>

>
> words="stopwords_french.txt" ignoreCase="true"/>

>
> words="stopwords_french.txt" enablePositionIncrements="true"/>

>
> language="French" protected="protwords.txt"/>

>

>

>

>

>

> I tried with a PatternTokenizerFactory to tokenize on whitespaces &

> special

> chars but no change...

> Even with a charFilter (PatternReplaceCharFilterFactory) to replace

> special

> characters by whitespace, it doesn't work...

>

> First line of analysis via solr admin, with verbose output, for query =

> 'histoire-france' :

>

> org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement=

> , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}

> text histoire france

>

> The '-' is replaced by ' ', then tokenized by

> WhitespaceTokenizerFactory.

> However I still have different number of results for 'histoire-france'

> and

> 'histoire france'.

>

> My current workaround is to replace all special chars by whitespaces

> before

> sending query to Solr, but it is not satisfying.

>

> Did i miss something ?


Reply via email to