Re: Search for misspelled words in corpus

Otis Gospodnetic Sat, 08 Jun 2013 23:49:51 -0700

Hm, I was purposely avoiding mentioning ngrams because just ngramming
all indexed tokens would balloon the index.... My assumption was that
only *some* words are misspelled, in which case it may be better not
to ngram all tokens....


Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Sun, Jun 9, 2013 at 2:30 AM, Jagdish Nomula <jagd...@simplyhired.com> wrote:
> Another theoretical answer for this question is ngrams approach. You can
> index the word and its trigrams. Query the index, by the string as well as
> its trigrams, with a % match search. You than pass the exhaustive resultset
> through a more expensive scoring such as Smith Waterman.
>
> Thanks,
>
> Jagdish
>
>
> On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:
>
>> n-grams might help, followed by a edit distance metric such as Jaro-Winkler
>> or Smith-Waterman-Gotoh to further filter out.
>>
>>
>> On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic <
>> otis.gospodne...@gmail.com
>> > wrote:
>>
>> > Interesting problem.  The first thing that comes to mind is to do
>> > "word expansion" during indexing.  Kind of like synonym expansion, but
>> > maybe a bit more dynamic. If you can have a dictionary of correctly
>> > spelled words, then for each token emitted by the tokenizer you could
>> > look up the dictionary and expand the token to all other words that
>> > are similar/close enough.  This would not be super fast, and you'd
>> > likely have to add some custom heuristic for figuring out what
>> > "similar/close enough" means, but it might work.
>> >
>> > I'd love to hear other ideas...
>> >
>> > Otis
>> > --
>> > Solr & ElasticSearch Support
>> > http://sematext.com/
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
>> > <kamesh...@gmail.com> wrote:
>> > > Hi,
>> > >
>> > > I have a problem where our text corpus on which we need to do search
>> > > contains many misspelled words. Same word could also be misspelled in
>> > > several different ways. It could also have documents that have correct
>> > > spellings However, the search term that we give in query would always
>> be
>> > > correct spelling. Now when we search on a term, we would like to get
>> all
>> > > the documents that contain both correct and misspelled forms of the
>> > search
>> > > term.
>> > > We tried fuzzy search, but it doesn't work as per our expectations. It
>> > > returns any close match, not specifically misspelled words. For
>> example,
>> > if
>> > > I'm searching for a word like "fight", I would like to return the
>> > documents
>> > > that have words like "figth" and "feight", not documents with words
>> like
>> > > "sight" and "light".
>> > > Is there any suggested approach for doing this?
>> > >
>> > > regards,
>> > > Kamesh
>> >
>>
>
>
>
> --
> ***Jagdish Nomula*
> Sr. Manager Search
> Simply Hired, Inc.
> 370 San Aleso Ave., Ste 200
> Sunnyvale, CA 94085
>
> office - 408.400.4700
> cell - 408.431.2916
> email - jagd...@simplyhired.com <yourem...@simplyhired.com>
>
> www.simplyhired.com

Re: Search for misspelled words in corpus

Reply via email to