Problems with Spellchecker in 3.1

Bob Sandiford Tue, 26 Apr 2011 09:54:44 -0700

Hi, all.

Sorry for any duplication - seems like what I sent yesterday never made it 
through...



We're having some troubles with the Solr Spellcheck Response.  We're running 
version 3.1.



Overview:  If we search for something really ugly like:



      "kljhklsdjahfkljsdhf book rck"



then when we get back the response, there's a suggestions list for 'rck', but 
no suggestions list for the other two words.  For 'book', that's fine, because 
it is 'spelled correctly' (i.e. we got hits on the word) and there shouldn't be 
any suggestions.  For the ugly thing, though, there aren't any hits.



The problem is that when we're handling the result, we can't tell the 
difference between no suggestions for a 'correctly spelled' term, and no 
suggestions for something that's odd like this.



(Now - this is happening with searches that aren't as obviously garbage - i.e. 
words that are real words, just that just don't show up in the index and have 
no suggestions - this was just to illustrate the point).



Our setup:

We're running multiple shards, which may be part of the issue.  For example, 
'book' might be found in one of the shards, but not another.



I don't *think* this has anything to do with our schema, since it's really how 
the Search Suggestions are being returned to us.  But, here are some bits and 
pieces:

>From schema.xml:



   <!-- Text field for spell checking -->

   <field           name="textSpell"    type="text"     indexed="true"      
stored="false"   multiValued="true" omitNorms="true"/>





>From solrconfig.xml:



   <!-- The spell check component can return a list of alternative spelling

  suggestions.  -->

  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">



    <str name="queryAnalyzerFieldType">textSpell</str>



    <lst name="spellchecker">

      <str name="name">default</str>

      <str name="field">textSpell</str>

      <str name="spellcheckIndexDir">./spellchecker</str>

    </lst>



  </searchComponent>



What we'd really like to see is the response coming back with an indication 
that a word wasn't found / had no suggestions.  We've hacked around in the code 
a little bit to do this, but were wondering if anyone has come across this, and 
what approaches you've taken.



We created new classes which extend IndexBasedSpellChecker and 
SpellCheckComponent, as follows (package and imports excluded for (sort of) 
brevity).  The methods are as taken from the overridden classes, with changes 
noted by "SD" type comments...





/**

* This has a slight modification of Solr's 
AbstractLuceneSpellChecker.getSuggestions(SpellingOptions).

* The modification allows correctly spelled words to be returned in the 
suggestion.  This modification working in tandem

* with the SirsiDynixSpellCheckComponent allows words with no suggestions to be 
returned from the spell check component

* even in a sharded search.

* Changes are marked with SD in the comments.

*/

public class SirsiDynixIndexBasedSpellChecker extends IndexBasedSpellChecker{

  @Override

  public SpellingResult getSuggestions(SpellingOptions options) throws 
IOException {

      boolean shardRequest = false;

      SolrParams params = options.customParams;

      if(params!=null)

      {

            shardRequest = "true".equals(params.get(ShardParams.IS_SHARD));

      }

    SpellingResult result = new SpellingResult(options.tokens);

    IndexReader reader = determineReader(options.reader);

    Term term = field != null ? new Term(field, "") : null;

    float theAccuracy = (options.accuracy == Float.MIN_VALUE) ? 
spellChecker.getAccuracy() : options.accuracy;



    int count = Math.max(options.count, 
AbstractLuceneSpellChecker.DEFAULT_SUGGESTION_COUNT);

    for (Token token : options.tokens) {

      String tokenText = new String(token.buffer(), 0, token.length());

      String[] suggestions = spellChecker.suggestSimilar(tokenText,

              count,

            field != null ? reader : null, //workaround LUCENE-1295

            field,

            options.onlyMorePopular, theAccuracy);

      if (suggestions.length == 1 && suggestions[0].equals(tokenText)) {

            //These are spelled the same, continue on

        List<String> suggList = Arrays.asList(suggestions); //SD added

        result.add(token, suggList);                        //SD added

        continue;

      }



      if (options.extendedResults == true && reader != null && field != null) {

        term = term.createTerm(tokenText);

        result.add(token, reader.docFreq(term));

        int countLimit = Math.min(options.count, suggestions.length);

        if(countLimit>0)

        {

              for (int i = 0; i < countLimit; i++) {

                term = term.createTerm(suggestions[i]);

                result.add(token, suggestions[i], reader.docFreq(term));

              }

        } else if(shardRequest) {

            List<String> suggList = Collections.emptyList();

            result.add(token, suggList);

        }

      } else {

        if (suggestions.length > 0) {

          List<String> suggList = Arrays.asList(suggestions);

          if (suggestions.length > options.count) {

            suggList = suggList.subList(0, options.count);

          }

          result.add(token, suggList);

        } else if(shardRequest) {

            List<String> suggList = Collections.emptyList();

            result.add(token, suggList);

        }

      }

    }

    return result;

  }

}







/**

* This is a slight modification of Solr's 
SpellCheckComponent.toNamedList(boolean, SpellingResult, String, boolean, 
boolean).

* The modification is designed so this class may work in tandem with 
SirsiDynixIndexBasedSpellChecker to return mispelled

* words with no suggestions.

*/

public class SirsiDynixSpellCheckComponent extends SpellCheckComponent{



  @Override

  protected NamedList toNamedList(boolean shardRequest, SpellingResult 
spellingResult, String origQuery, boolean extendedResults, boolean collate) {

    NamedList result = new NamedList();

    Map<Token, LinkedHashMap<String, Integer>> suggestions = 
spellingResult.getSuggestions();

    boolean hasFreqInfo = spellingResult.hasTokenFrequencyInfo();

    boolean isCorrectlySpelled = false;



    int numSuggestions = 0;

    for(LinkedHashMap<String, Integer> theSuggestion : suggestions.values())

    {

      if(theSuggestion.size()>0)

      {

            numSuggestions++;

      }

    }



    // will be flipped to false if any of the suggestions are not in the index 
and hasFreqInfo is true

    if(numSuggestions > 0) {

      isCorrectlySpelled = true;

    }



    for (Map.Entry<Token, LinkedHashMap<String, Integer>> entry : 
suggestions.entrySet()) {

      Token inputToken = entry.getKey();

      Map<String, Integer> theSuggestions = entry.getValue();

      if (theSuggestions != null) {//SD removed  "&& (theSuggestions.size()>0 
|| shardRequest)"  This is to allow misspelled words with no suggestions.  
(theSuggestions.size()>=0  is always true hence the removal)

        if(theSuggestions.size()>0 && !shardRequest && 
theSuggestions.containsKey(new String(inputToken.buffer(), 0, 
inputToken.length()))) { //SD added if block

            continue;  //if this is not a shardRequest and the word is not 
mispelled, don't add it to  the list of mispelled words

        }

        SimpleOrderedMap suggestionList = new SimpleOrderedMap();

        suggestionList.add("numFound", theSuggestions.size());

        suggestionList.add("startOffset", inputToken.startOffset());

        suggestionList.add("endOffset", inputToken.endOffset());



        // Logical structure of normal (non-extended) results:

        // "suggestion":["alt1","alt2"]

        //

        // Logical structure of the extended results:

        // "suggestion":[

        //     {"word":"alt1","freq":7},

        //     {"word":"alt2","freq":4}

        // ]

        if (extendedResults && hasFreqInfo) {

          suggestionList.add("origFreq", 
spellingResult.getTokenFrequency(inputToken));



          ArrayList<SimpleOrderedMap> sugs = new ArrayList<SimpleOrderedMap>();

          suggestionList.add("suggestion", sugs);

          for (Map.Entry<String, Integer> suggEntry : 
theSuggestions.entrySet()) {

            SimpleOrderedMap sugEntry = new SimpleOrderedMap();

            sugEntry.add("word",suggEntry.getKey());

            sugEntry.add("freq",suggEntry.getValue());

            sugs.add(sugEntry);

          }

        } else {

          suggestionList.add("suggestion", theSuggestions.keySet());

        }



        if (hasFreqInfo) {

          isCorrectlySpelled = isCorrectlySpelled && 
spellingResult.getTokenFrequency(inputToken) > 0;

        }

        result.add(new String(inputToken.buffer(), 0, inputToken.length()), 
suggestionList);

      }

    }

    if (hasFreqInfo) {

      result.add("correctlySpelled", isCorrectlySpelled);

    } else if(extendedResults && suggestions.size() == 0) { // if the word is 
misspelled, its added to suggestions with freqinfo

      result.add("correctlySpelled", true);

    }

    return result;

  }

}







Here's the xml we're getting back from the search (before applying the modified 
code):



<?xml version="1.0" encoding="UTF-8"?>

<response>



<lst name="responseHeader">

  <int name="status">0</int>

  <int name="QTime">56</int>

  <lst name="params">

    <str name="spellcheck">true</str>

    <str name="facet">true</str>

    <str name="sort">score desc, RELEVANCE_SORT_nsort desc</str>

    <str name="shards.qt">spellcheckedStandard</str>

    <str name="hl.mergeContiguous">true</str>

    <str name="facet.limit">1000</str>

    <str name="hl">true</str>

    <str name="fl"> ELECTRONIC_ACCESS_display ISBN_display TITLE_boost 
FORMAT_display score MEDIA_TYPE_display AUTHOR_boost LOCALURL_display 
UPC_display id DOC_ID_display CHILD_SITE_display DS_EC PRIMARY_AUTHOR_boost 
PRIMARY_TITLE_boost DS_ID TOPIC_display ASSET_NAME_display OCLC_display</str>

    <str 
name="shards">localhost:8983/solr/SD_ILS/,localhost:8983/solr/SD_ASSET/</str>

    <arr name="facet.field">

      <str>AUTHOR_facet</str>

      <str>FORMAT_facet</str>

      <str>LANGUAGE_facet</str>

      <str>PUBDATE_nfacet</str>

      <str>SUBJECT_facet</str>

      <str>ABCDEF_cfacet</str>

    </arr>

    <str name="qt">spellcheckedStandard</str>

    <arr name="fq">

      <str>ACCESS_LEVEL_nfacet:"0"</str>

      <str>CLEARANCE_nfacet:"0"</str>

      <str>NEED_TO_KNOWS_facet:"@@EMPTY@@"</str>

      <str>CITIZENSHIPS_facet:"@@EMPTY@@"</str>

      <str>RESTRICTIONS_facet:"@@EMPTY@@"</str>

    </arr>

    <str name="facet.mincount">1</str>

    <str name="indent">true</str>

    <str name="hl.fl">*</str>

    <str name="rows">12</str>

    <str name="hl.snippets">5</str>

    <str name="start">0</str>

    <str name="q">TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^200.0 OR 
PRIMARY_AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^100.0 OR 
DOC_TEXT:"kljhklsdjahfkljsdhf book rck"~100^2 OR 
PRIMARY_TITLE_boost:"kljhklsdjahfkljsdhf book rck"~100^1000.0 OR 
AUTHOR_boost:"kljhklsdjahfkljsdhf book rck"~100^20.0 OR 
textFuzzy:kljhklsdjahfkljsdhf~0.7 AND textFuzzy:book~0.7 AND 
textFuzzy:rck~0.7</str>

  </lst>

</lst>

<result name="response" numFound="0" start="0" maxScore="0.0"/> <lst 
name="facet_counts">

  <lst name="facet_queries"/>

  <lst name="facet_fields">

    <lst name="AUTHOR_facet"/>

    <lst name="FORMAT_facet"/>

    <lst name="LANGUAGE_facet"/>

    <lst name="PUBDATE_nfacet"/>

    <lst name="SUBJECT_facet"/>

    <lst name="ABCDEF_cfacet"/>

  </lst>

  <lst name="facet_dates"/>

  <lst name="facet_ranges"/>

</lst>

<lst name="highlighting"/>

<lst name="spellcheck">

  <lst name="suggestions">

    <lst name="rck">

      <int name="numFound">5</int>

      <int name="startOffset">362</int>

      <int name="endOffset">365</int>

      <int name="origFreq">0</int>

      <arr name="suggestion">

        <lst>

          <str name="word">rock</str>

          <int name="freq">24000</int>

        </lst>

        <lst>

          <str name="word">rick</str>

          <int name="freq">6048</int>

        </lst>

        <lst>

          <str name="word">rack</str>

          <int name="freq">84</int>

        </lst>

        <lst>

          <str name="word">reck</str>

          <int name="freq">78</int>

        </lst>

        <lst>

          <str name="word">ruck</str>

          <int name="freq">30</int>

        </lst>

      </arr>

    </lst>

    <bool name="correctlySpelled">false</bool>

  </lst>

</lst>

</response>







Thanks!


Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com<http://www.sirsidynix.com/>

Problems with Spellchecker in 3.1

Reply via email to