I added some docs at: http://wiki.apache.org/solr/SpellCheckComponent
and http://wiki.apache.org/solr/FileBasedSpellChecker
More inline.
On Feb 5, 2009, at 4:38 AM, Marcus Stratmann wrote:
Hello,
Are you sending in the same query to both? Frequency and word only
get printed when extendedResults == true. correctlySpelled only
gets printed when there is Index frequency information. For the
FileBasedSpellChecker, there is no Frequency information, so it
isn't returned.
Yes, I am using this request in both cases:
spellcheck?
spellcheck
=
true
&spellcheck
.dictionary
=title&spellcheck.q=gane&q=gane&spellcheck.extendedResults=true
Concerning FileBasedSpellChecker I wasn't able to find any online
documentation, is there any? For the start I was using "trial an
error". I'm still wondering which format the input file needs to have.
You write that there is no frequency information for
FileBasedSpellChecker. Does that mean, that every word in the index
has the same "weight" (besides the distance from the word being
spell checked)? Then how does spelling work? Every word in the index
that is close enough (distance) to the original is considered and
the one with the smallest distance is returned?
Correct. The code in the Lucene SpellChecker is:
// edit distance
sugWord.score = sd.getDistance(word,sugWord.string);
if (sugWord.score < min) {
continue;
}
if (ir != null && field != null) { // use the user index
sugWord.freq = ir.docFreq(new Term(field,
sugWord.string)); // freq in the index
// don't suggest a word that is not present in the field
if ((morePopular && goalFreq > sugWord.freq) || sugWord.freq
< 1) {
continue;
}
}
See http://www.lucidimagination.com/search/?q=spellcheck+issues#/
p:solr for some discussion on this. Also see http://issues.apache.org/jira/browse/LUCENE-1532
and https://issues.apache.org/jira/browse/LUCENE-1417
What effext has spellcheck.onlyMorePopular when there are no
frequencies?
Spellchecking frequency information is retrieved using a passed in
IndexReader. Since the FileBasedSC has no IndexReader, no frequency
information is available. Thus, onlyMorePopular has no effect, nor do
other factors that use the IndexReader (extendedResults). In the
SpellCheckComponent, the relevant code is:
if (extendedResults && hasFreqInfo) {
suggestionList.add("origFreq",
spellingResult.getTokenFrequency(inputToken));
for (Map.Entry<String, Integer> suggEntry :
theSuggestions.entrySet()) {
SimpleOrderedMap<Object> suggestionItem = new
SimpleOrderedMap<Object>();
suggestionItem.add("frequency", suggEntry.getValue());
suggestionItem.add("word", suggEntry.getKey());
suggestionList.add("suggestion", suggestionItem);
}
} else {
suggestionList.add("suggestion", theSuggestions.keySet());
}
See http://lucene.apache.org/java/2_4_0/api/contrib-spellchecker/org/apache/lucene/search/spell/SpellChecker.html
for more details on how the underlying Lucene SC works..
Sorry if this is answered somewhere in the docs, a link would be
enough for me in this case.
The logic for constructing this is all handled in the
SpellCheckComponent.toNamedList() method and is completely
separated from the individual SpellChecker implementations.
If I understand you correctly, this means that the output is just an
"image" of the used data structures? From the developer's view this
is very natural, but from the user's view it is annoying to have
different output depending on the handler used. Anyway, this is
actually no big problem for me, I was just wondering why my parser
(used for IndexBasedSpellChecker) didn't work for
FileBasedSpellChecker.
It's not an image of the data structure, but a reflection of your
choice not to provide frequency information. We can't produce info
where there is none. If, however, you want those two items to have
the same structure, a patch would be welcome.
BTW, one workaround is to simply create an index from your file and
then use the IndexBasedSpellChecker. Each line equals one document.
You could even assign weights that way.
Thanks,
Marcus
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (mail archives, docs, wiki, JIRA) all in
one place:
http://www.lucidimagination.com/search