Re: Fastest way to import a giant word list into Solr/Lucene?

Walter Underwood Fri, 30 Oct 2015 17:03:07 -0700

Dedicated spell-checkers have better algorithms than Solr. They usually handle 
transposed characters as well as inserted, deleted, or substituted characters. 
This is an enhanced version of Levinshtein distance. It is called 
Damerau-Levenshtein and is too expensive to use in Solr search. Spell 
correctors can also use a bigger distance than 2, unlike Solr.


The Peter Norvig corrector also handles words that have been run together. The 
Norvig corrector has been translated to many different computer languages.

The Norvig corrector is an interesting approach. It is well worth reading this 
short article to learn more about spelling correction. 

http://norvig.com/spell-correct.html <http://norvig.com/spell-correct.html>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 30, 2015, at 4:45 PM, Robert Oschler <robert.osch...@gmail.com> wrote:
> 
> Hello Walter and Mikhail,
> 
> Thank you for your answers.  Do those spell checkers have the same or
> better fuzzy matching capability that SOLR/Lucene has (Lichtenstein, max
> distance 2)?  That's a critical requirement for my application.  I take it
> by your suggestion of these spell checker apps they can easily be extended
> with a user defined, supplementary dictionary, yes?
> 
> Thanks.
> 
> On Fri, Oct 30, 2015 at 3:07 PM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
> 
>> Perhaps
>> FileBasedSpellChecker
>> https://cwiki.apache.org/confluence/display/solr/Spell+Checking
>> 
>> On Fri, Oct 30, 2015 at 9:37 PM, Robert Oschler <robert.osch...@gmail.com>
>> wrote:
>> 
>>> Hello everyone,
>>> 
>>> I have a gigantic list of industry terms that I want to import into a
>>> Solr/Lucene instance running on an AWS box.  What is the fastest way to
>>> import the list into my Solr/Lucene instance?  I have admin/sudo
>> privileges
>>> on the box.
>>> 
>>> Also, is there a document that shows me how to set up my Solr/Lucene
>> config
>>> file to be optimized for fast searches on single word entries using fuzzy
>>> search?  I intend to use this Solr/Lucene instance to do spell checking
>> on
>>> the big industry word list I mentioned above.  Each data record will be a
>>> single word from the file.  I'll want to take a single word query and do
>> a
>>> fuzzy search on the word against the index (Lichtenstein, max distance 2
>> as
>>> per Solr/Lucene's fuzzy search feature).  So what parameters will
>> configure
>>> Solr/Lucene to be optimized for such a search?  Also, if a document shows
>>> the best index/read parameters to support single word fuzzy searching
>> then
>>> that would be a big help too.  Note, the contents of the index will
>> change
>>> very infrequently if that affects the optimal parameter mix.
>>> 
>>> 
>>> --
>>> Thanks,
>>> Robert Oschler
>>> Twitter -> http://twitter.com/roschler
>>> http://www.RobotsRule.com/
>>> http://www.Robodance.com/
>>> 
>> 
>> 
>> 
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>> 
>> <http://www.griddynamics.com>
>> <mkhlud...@griddynamics.com>
>> 
> 
> 
> 
> -- 
> Thanks,
> Robert Oschler
> Twitter -> http://twitter.com/roschler
> http://www.RobotsRule.com/
> http://www.Robodance.com/

Re: Fastest way to import a giant word list into Solr/Lucene?

Reply via email to