Strange suggestions with spell checker

Jens Hoffrichter Mon, 25 Jul 2011 08:46:05 -0700

Hello all,

I'm getting a strange suggestion for a purposely mistyped word in Solr 1.4.1


I search for the term "snia", and I would expect the term "sina" to be
suggested, as this is a fairly common word in quite a bit of the indexed
documents.

Instead, I'm getting india as a suggestion, which is only indexed once, and
has (at least as far as my understanding of the algorithm goes) a greater
Levenshtein distance than sina.

The configuration for the spellchecker is pretty straigforward, basically
taken directly from the examples:

 <searchComponent name="spellcheck" class="solr.SpellCheckComponent">

    <str name="queryAnalyzerFieldType">textSpell</str>


    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spell</str>
      <str name="buildOnOptimize">true</str>
      <str name="buildOnCommit">true</str>
      <str name="spellcheckIndexDir">./spellchecker1</str>
      <str name="comparatorClass">freq</str>
      <float name="thresholdTokenFrequency">.01</float>
    </lst>

I have tried to use the comparatorClass there (as frequency would probably
yield better results for me), but only saw after that it is only available
for Solr4.

The complete suggestions I get from the standard search component is:

<lst name="spellcheck">
  <lst name="suggestions">
     <lst name="snia">
     <int name="numFound">5</int>
     <int name="startOffset">0</int>
     <int name="endOffset">4</int>
     <int name="origFreq">0</int>
     <arr name="suggestion">
     <lst>
       <str name="word">india</str>
       <int name="freq">1</int>
     </lst>
     <lst>
       <str name="word">sina</str>
       <int name="freq">30</int>
     </lst>
     <lst>
        <str name="word">soa</str>
        <int name="freq">4</int>
      </lst>
      <lst>
        <str name="word">unit</str>
        <int name="freq">3</int>
      </lst>
      <lst>
         <str name="word">sei</str>
         <int name="freq">2</int>
      </lst>
    </arr>
</lst>
<bool name="correctlySpelled">false</bool>
</lst>
</lst>

Apart from the india suggestions, the other ones are okay, though I need to
tune my stopwords for the (German) indexer a bit more.

Is there any explanation why india is chosen over sina in the suggestions?
Is there anything I can tweak in the configuration to get the desired
result?

If some information is missing, don't hestitate to ask, I will try to supply
it then.

Many thanks in advance,
Jens

Strange suggestions with spell checker

Reply via email to