I added some docs at: http://wiki.apache.org/solr/SpellCheckComponent and http://wiki.apache.org/solr/FileBasedSpellChecker

More inline.

On Feb 5, 2009, at 4:38 AM, Marcus Stratmann wrote:

Hello,

Are you sending in the same query to both? Frequency and word only get printed when extendedResults == true. correctlySpelled only gets printed when there is Index frequency information. For the FileBasedSpellChecker, there is no Frequency information, so it isn't returned.

Yes, I am using this request in both cases:
spellcheck? spellcheck = true &spellcheck .dictionary =title&spellcheck.q=gane&q=gane&spellcheck.extendedResults=true

Concerning FileBasedSpellChecker I wasn't able to find any online documentation, is there any? For the start I was using "trial an error". I'm still wondering which format the input file needs to have.

You write that there is no frequency information for FileBasedSpellChecker. Does that mean, that every word in the index has the same "weight" (besides the distance from the word being spell checked)? Then how does spelling work? Every word in the index that is close enough (distance) to the original is considered and the one with the smallest distance is returned?

Correct.  The code in the Lucene SpellChecker is:
// edit distance
      sugWord.score = sd.getDistance(word,sugWord.string);
      if (sugWord.score < min) {
        continue;
      }

      if (ir != null && field != null) { // use the user index
sugWord.freq = ir.docFreq(new Term(field, sugWord.string)); // freq in the index
        // don't suggest a word that is not present in the field
if ((morePopular && goalFreq > sugWord.freq) || sugWord.freq < 1) {
          continue;
        }
      }

See http://www.lucidimagination.com/search/?q=spellcheck+issues#/ p:solr for some discussion on this. Also see http://issues.apache.org/jira/browse/LUCENE-1532 and https://issues.apache.org/jira/browse/LUCENE-1417


What effext has spellcheck.onlyMorePopular when there are no frequencies?

Spellchecking frequency information is retrieved using a passed in IndexReader. Since the FileBasedSC has no IndexReader, no frequency information is available. Thus, onlyMorePopular has no effect, nor do other factors that use the IndexReader (extendedResults). In the SpellCheckComponent, the relevant code is:
if (extendedResults && hasFreqInfo) {
suggestionList.add("origFreq", spellingResult.getTokenFrequency(inputToken)); for (Map.Entry<String, Integer> suggEntry : theSuggestions.entrySet()) { SimpleOrderedMap<Object> suggestionItem = new SimpleOrderedMap<Object>();
            suggestionItem.add("frequency", suggEntry.getValue());
            suggestionItem.add("word", suggEntry.getKey());
            suggestionList.add("suggestion", suggestionItem);
          }
        } else {
          suggestionList.add("suggestion", theSuggestions.keySet());
        }


See http://lucene.apache.org/java/2_4_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html for more details on how the underlying Lucene SC works..




Sorry if this is answered somewhere in the docs, a link would be enough for me in this case.

The logic for constructing this is all handled in the SpellCheckComponent.toNamedList() method and is completely separated from the individual SpellChecker implementations.

If I understand you correctly, this means that the output is just an "image" of the used data structures? From the developer's view this is very natural, but from the user's view it is annoying to have different output depending on the handler used. Anyway, this is actually no big problem for me, I was just wondering why my parser (used for IndexBasedSpellChecker) didn't work for FileBasedSpellChecker.

It's not an image of the data structure, but a reflection of your choice not to provide frequency information. We can't produce info where there is none. If, however, you want those two items to have the same structure, a patch would be welcome.

BTW, one workaround is to simply create an index from your file and then use the IndexBasedSpellChecker. Each line equals one document. You could even assign weights that way.




Thanks,
Marcus

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (mail archives, docs, wiki, JIRA) all in one place:
http://www.lucidimagination.com/search










Reply via email to