Hapax legomena (terms with DF of 1) are very often typos. You can automatically 
build a stopword file from these. If you want to be picky, you can use only 
words with a very small distance from words with much larger DF.

----- Original Message -----
| From: "Robert Muir" <rcm...@gmail.com>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, October 10, 2012 5:40:23 PM
| Subject: Re: Using additional dictionary with DirectSolrSpellChecker
| 
| On Wed, Oct 10, 2012 at 9:02 AM, O. Klein <kl...@octoweb.nl> wrote:
| > I don't want to tweak the threshold. For majority of cases it works
| > fine.
| >
| > It's for cases where term has low frequency but is spelled
| > correctly.
| >
| > If you lower the threshold you would also get incorrect spelled
| > terms as
| > suggestions.
| >
| 
| Yeah there is no real magic here when the corpus contains typos. this
| existing docFreq heuristic was just borrowed from the old index-based
| spellchecker.
| 
| I do wonder if using # of occurrences (totalTermFreq) instead of # of
| documents with the term (docFreq) would improve the heuristic.
| 
| In all cases I think if you want to also integrate a dictionary or
| something, it seems like this could somehow be done with the
| File-based spellchecker?
| 

Reply via email to