Multilingual indexing, search results, edismax and stopwords

kastania44 Thu, 20 Mar 2014 08:01:07 -0700

On our drupal multilingual system we use apache Solr 3.5.
The problem is well known on different blogs, sites I read.
The search results are not the one we want.
On our code in hook apachesolr_query_alter we override the defaultOperator: 
$query->replaceParam('mm', '90%');
The requirement is, when I search for: biological analyses, I want to fetch
only the results which have both of the words.
When I search for: biological and chemical analyses, I want it to fetch only
the results which have biological , chemical, analyses. The and is not
indexed due to stopwords.


If I set mm to 100% and my query has stopwords it will not fetch any result.
If I set mm to 100$ and my query does not have stopwords it will fetch the
desired results.
If I set mm anything between 50%-99% it fetches not wanted results, as
results that contain only one of the searched keywords, or words like the
searched keywords, like analyse (even if I searched for analyses).

If I search using + before the words that are mandatory it works ok, but it
is not user friently, to ask from the user to type + before each word
exvcept from the stopwords.

Do I make any sense? 

Below are some of our configuration details:

All the indexed fields are of type text_language, 
e.g from our schema.xml
/<field name="label" type="text" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>
<field name="i18n_label_en" type="text_en" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>
<field name="i18n_label_fr" type="text_fr" indexed="true" stored="true"
termVectors="true" omitNorms="true"/>/
All the text fieldtypes have the same configuration except from the
protected, words, dictionary parameters which are language specific.
e.g from our schema.xml
/<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent_en.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        
        
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="100"/>
        <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
maxSubwordSize="15" onlyLongestMatch="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords_en.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent_en.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_en.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
        <filter class="solr.LengthFilterFactory" min="2" max="100"/>
        <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
maxSubwordSize="15" onlyLongestMatch="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords_en.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>/

 <solrQueryParser defaultOperator="AND"/>

solrconfig.xml

  /<requestHandler name="pinkPony" class="solr.SearchHandler"
default="true">
    <lst name="defaults">
      <str name="defType">edismax</str>
      <str name="echoParams">explicit</str>
      <bool name="omitHeader">true</bool>
      <float name="tie">0.01</float>
      
      <int name="timeAllowed">${solr.pinkPony.timeAllowed:-1}</int>
      <str name="q.alt">*:*</str>

      
      <str name="spellcheck">false</str>
      
      <str name="spellcheck.onlyMorePopular">true</str>
      <str name="spellcheck.extendedResults">false</str>
      
      <str name="spellcheck.count">1</str>
    </lst>
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>/


ANY ideas are appreciated!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multilingual-indexing-search-results-edismax-and-stopwords-tp4125746.html
Sent from the Solr - User mailing list archive at Nabble.com.

Multilingual indexing, search results, edismax and stopwords

Reply via email to