Re: Differences in output of spell checkers

Grant Ingersoll Thu, 05 Feb 2009 05:24:04 -0800

I added some docs at: http://wiki.apache.org/solr/SpellCheckComponentand http://wiki.apache.org/solr/FileBasedSpellChecker


More inline.


On Feb 5, 2009, at 4:38 AM, Marcus Stratmann wrote:

Hello,
Are you sending in the same query to both? Frequency and word onlyget printed when extendedResults == true. correctlySpelled onlygets printed when there is Index frequency information. For theFileBasedSpellChecker, there is no Frequency information, so itisn't returned.
Yes, I am using this request in both cases:
spellcheck?spellcheck=true&spellcheck.dictionary=title&spellcheck.q=gane&q=gane&spellcheck.extendedResults=true
Concerning FileBasedSpellChecker I wasn't able to find any onlinedocumentation, is there any? For the start I was using "trial anerror". I'm still wondering which format the input file needs to have.
You write that there is no frequency information forFileBasedSpellChecker. Does that mean, that every word in the indexhas the same "weight" (besides the distance from the word beingspell checked)? Then how does spelling work? Every word in the indexthat is close enough (distance) to the original is considered andthe one with the smallest distance is returned?


Correct.  The code in the Lucene SpellChecker is:
// edit distance
      sugWord.score = sd.getDistance(word,sugWord.string);
      if (sugWord.score < min) {
        continue;
      }

      if (ir != null && field != null) { // use the user index

sugWord.freq = ir.docFreq(new Term(field,sugWord.string)); // freq in the index

        // don't suggest a word that is not present in the field

if ((morePopular && goalFreq > sugWord.freq) || sugWord.freq< 1) {

          continue;
        }
      }

See http://www.lucidimagination.com/search/?q=spellcheck+issues#/p:solr for some discussion on this. Also see http://issues.apache.org/jira/browse/LUCENE-1532and https://issues.apache.org/jira/browse/LUCENE-1417

What effext has spellcheck.onlyMorePopular when there are nofrequencies?

Spellchecking frequency information is retrieved using a passed inIndexReader. Since the FileBasedSC has no IndexReader, no frequencyinformation is available. Thus, onlyMorePopular has no effect, nor doother factors that use the IndexReader (extendedResults). In theSpellCheckComponent, the relevant code is:

if (extendedResults && hasFreqInfo) {

suggestionList.add("origFreq",spellingResult.getTokenFrequency(inputToken));for (Map.Entry<String, Integer> suggEntry :theSuggestions.entrySet()) {SimpleOrderedMap<Object> suggestionItem = newSimpleOrderedMap<Object>();

            suggestionItem.add("frequency", suggEntry.getValue());
            suggestionItem.add("word", suggEntry.getKey());
            suggestionList.add("suggestion", suggestionItem);
          }
        } else {
          suggestionList.add("suggestion", theSuggestions.keySet());
        }

See http://lucene.apache.org/java/2_4_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.htmlfor more details on how the underlying Lucene SC works..

Sorry if this is answered somewhere in the docs, a link would beenough for me in this case.
The logic for constructing this is all handled in theSpellCheckComponent.toNamedList() method and is completelyseparated from the individual SpellChecker implementations.
If I understand you correctly, this means that the output is just an"image" of the used data structures? From the developer's view thisis very natural, but from the user's view it is annoying to havedifferent output depending on the handler used. Anyway, this isactually no big problem for me, I was just wondering why my parser(used for IndexBasedSpellChecker) didn't work forFileBasedSpellChecker.

It's not an image of the data structure, but a reflection of yourchoice not to provide frequency information. We can't produce infowhere there is none. If, however, you want those two items to havethe same structure, a patch would be welcome.

BTW, one workaround is to simply create an index from your file andthen use the IndexBasedSpellChecker. Each line equals one document.You could even assign weights that way.



Thanks,
Marcus


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (mail archives, docs, wiki, JIRA) all inone place:

http://www.lucidimagination.com/search

Re: Differences in output of spell checkers

Reply via email to