Re: Multilingual indexing, search results, edismax and stopwords

Jack Krupansky Sun, 23 Mar 2014 11:39:06 -0700

Setting the default query operator to AND is the preferred approach:q.op=AND.

That said, I'm not sure that counting ignored and empty terms towards the mm% makes sense. IOW, if a term transforms to nothing, either because it is astop word or empty synonym replacement or pure punctuation, I don't think itshould count as a term. I think this is worth a Jira.


-- Jack Krupansky

-----Original Message-----From: kastania44

Sent: Thursday, March 20, 2014 11:00 AM
To: solr-user@lucene.apache.org
Subject: Multilingual indexing, search results, edismax and stopwords

On our drupal multilingual system we use apache Solr 3.5.
The problem is well known on different blogs, sites I read.
The search results are not the one we want.
On our code in hook apachesolr_query_alter we override the defaultOperator:
$query->replaceParam('mm', '90%');
The requirement is, when I search for: biological analyses, I want to fetch
only the results which have both of the words.
When I search for: biological and chemical analyses, I want it to fetch only
the results which have biological , chemical, analyses. The and is not
indexed due to stopwords.

If I set mm to 100% and my query has stopwords it will not fetch any result.
If I set mm to 100$ and my query does not have stopwords it will fetch the
desired results.
If I set mm anything between 50%-99% it fetches not wanted results, as
results that contain only one of the searched keywords, or words like the
searched keywords, like analyse (even if I searched for analyses).

If I search using + before the words that are mandatory it works ok, but it
is not user friently, to ask from the user to type + before each word
exvcept from the stopwords.

Do I make any sense?

Below are some of our configuration details:

All the indexed fields are of type text_language,
e.g from our schema.xml
/<field name="label" type="text" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>
<field name="i18n_label_en" type="text_en" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>
<field name="i18n_label_fr" type="text_fr" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>/
All the text fieldtypes have the same configuration except from the
protected, words, dictionary parameters which are language specific.
e.g from our schema.xml
/<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
     <analyzer type="index">
       <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent_en.txt"/>
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>


       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
       <filter class="solr.LengthFilterFactory" min="2" max="100"/>
       <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
maxSubwordSize="15" onlyLongestMatch="true"/>
       <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords_en.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent_en.txt"/>
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
       <filter class="solr.LengthFilterFactory" min="2" max="100"/>
       <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
maxSubwordSize="15" onlyLongestMatch="true"/>
       <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords_en.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>/

<solrQueryParser defaultOperator="AND"/>

solrconfig.xml

 /<requestHandler name="pinkPony" class="solr.SearchHandler"
default="true">
   <lst name="defaults">
     <str name="defType">edismax</str>
     <str name="echoParams">explicit</str>
     <bool name="omitHeader">true</bool>
     <float name="tie">0.01</float>

     <int name="timeAllowed">${solr.pinkPony.timeAllowed:-1}</int>
     <str name="q.alt">*:*</str>


     <str name="spellcheck">false</str>

     <str name="spellcheck.onlyMorePopular">true</str>
     <str name="spellcheck.extendedResults">false</str>

     <str name="spellcheck.count">1</str>
   </lst>
   <arr name="last-components">
     <str>spellcheck</str>
   </arr>
 </requestHandler>/


ANY ideas are appreciated!



--

View this message in context:http://lucene.472066.n3.nabble.com/Multilingual-indexing-search-results-edismax-and-stopwords-tp4125746.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: Multilingual indexing, search results, edismax and stopwords

Reply via email to