Re: File based wordlists for spellchecker

Tomasz Wegrzanowski Tue, 15 Nov 2011 10:30:49 -0800

On 15 November 2011 15:55, Dyer, James <james.d...@ingrambook.com> wrote:
> Writing your own spellchecker to do what you propose might be difficult.  At 
> issue is the fact that both the "index-based" and "file-based" spellcheckers 
> are designed to work off a Lucene index and use the document frequency 
> reported by Lucene to base their decisions.  Both spell checkers build a 
> separate Lucene index on the fly to use as a dictionary just for this purpose.


I'm fine with spellchecker index, it will be small compared with
everything else.

I don't want every original record to have extra copyField since they
would probably be prohibitively huge.

> But maybe you don't need to go down that path.  If your original field is not 
> being stemmed or aggresively analyzed, then you can base your spellchecker on 
> the original field, and there is no need to do a <copyField> for a spell 
> check index.  If you have to do a <copyField> for the dictionary due to 
> stemming, etc in the original, you may be pleasantly surprised that the 
> overhead for the copyField is a lot less than you thought.  Be sure to set it 
> as stored=false,indexed=true and omitNorms=true.  I'd recommend trying this 
> before anything else as it just might work.

My original index is stemmed and very aggressively analyzed, copyField
would be necessary.

> If you're worried about the size of the dictionary that gets built on the 
> fly, then I would look into possibly upgrading to Trunk/4.0 and using 
> DirectSolrSpellChecker, which does not build a separate dictionary.  If going 
> to Trunk is out of the question, it might be possible for you to have it 
> store your dictionary to a different disk if disk space is your issue.
>
> If you end up writing your own spellchecker, take a look at 
> org.apache.lucene.search.spell.SpellChecker.  You'll need to write a 
> "suggestSimilar" method that does what you want.  Possibly you can store your 
> terms and frequencies in a hey/value hash and use that to order the results.  
> You then would need to write a wrapper for Solr, similar to 
> org.apache.solr.spelling.FileBasedSpellChecker.  Like I mentioned, this would 
> be a lot of work and it would take a lot of thought to make it perform well, 
> etc.

Doesn't IndexBasedSpellChecker simply extract (word, freq) pairs from index,
puts them into spellcheckingIndex, and forgets about the index altogether?

If so, then I'd only need to override index building, and reuse that.

Am I correct here, or does it actually go back to the original index?

Re: File based wordlists for spellchecker

Reply via email to