Hapax legomena (terms with DF of 1) are very often typos. You can automatically build a stopword file from these. If you want to be picky, you can use only words with a very small distance from words with much larger DF.
----- Original Message ----- | From: "Robert Muir" <rcm...@gmail.com> | To: solr-user@lucene.apache.org | Sent: Wednesday, October 10, 2012 5:40:23 PM | Subject: Re: Using additional dictionary with DirectSolrSpellChecker | | On Wed, Oct 10, 2012 at 9:02 AM, O. Klein <kl...@octoweb.nl> wrote: | > I don't want to tweak the threshold. For majority of cases it works | > fine. | > | > It's for cases where term has low frequency but is spelled | > correctly. | > | > If you lower the threshold you would also get incorrect spelled | > terms as | > suggestions. | > | | Yeah there is no real magic here when the corpus contains typos. this | existing docFreq heuristic was just borrowed from the old index-based | spellchecker. | | I do wonder if using # of occurrences (totalTermFreq) instead of # of | documents with the term (docFreq) would improve the heuristic. | | In all cases I think if you want to also integrate a dictionary or | something, it seems like this could somehow be done with the | File-based spellchecker? |