Re: Search for misspelled words in corpus

Jagdish Nomula Sun, 09 Jun 2013 10:12:36 -0700

ngrams will definitely increase the index. But the increase in size might
not be super high as the total possible set of dictionary size is 26^3 and
we are just storing docs list with each ngram.


Another variation of the above ideas would be to add a pre-processing step,
where-in you analyze the input corpus to explore the words which can be
mis-spelt. You can use any of the word based LSH algorithms to do this and
then index selectlively.

This is a theoretical answer. You would have to cherry pick
solutions/approaches for your use case.

Thanks,




On Sat, Jun 8, 2013 at 11:49 PM, Otis Gospodnetic <
otis.gospodne...@gmail.com> wrote:

> Hm, I was purposely avoiding mentioning ngrams because just ngramming
> all indexed tokens would balloon the index.... My assumption was that
> only *some* words are misspelled, in which case it may be better not
> to ngram all tokens....
>
> Otis
> --
> Solr & ElasticSearch Support
> http://sematext.com/
>
>
>
>
>
> On Sun, Jun 9, 2013 at 2:30 AM, Jagdish Nomula <jagd...@simplyhired.com>
> wrote:
> > Another theoretical answer for this question is ngrams approach. You can
> > index the word and its trigrams. Query the index, by the string as well
> as
> > its trigrams, with a % match search. You than pass the exhaustive
> resultset
> > through a more expensive scoring such as Smith Waterman.
> >
> > Thanks,
> >
> > Jagdish
> >
> >
> > On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <sk...@sloan.mit.edu>
> wrote:
> >
> >> n-grams might help, followed by a edit distance metric such as
> Jaro-Winkler
> >> or Smith-Waterman-Gotoh to further filter out.
> >>
> >>
> >> On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic <
> >> otis.gospodne...@gmail.com
> >> > wrote:
> >>
> >> > Interesting problem.  The first thing that comes to mind is to do
> >> > "word expansion" during indexing.  Kind of like synonym expansion, but
> >> > maybe a bit more dynamic. If you can have a dictionary of correctly
> >> > spelled words, then for each token emitted by the tokenizer you could
> >> > look up the dictionary and expand the token to all other words that
> >> > are similar/close enough.  This would not be super fast, and you'd
> >> > likely have to add some custom heuristic for figuring out what
> >> > "similar/close enough" means, but it might work.
> >> >
> >> > I'd love to hear other ideas...
> >> >
> >> > Otis
> >> > --
> >> > Solr & ElasticSearch Support
> >> > http://sematext.com/
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల
> >> > <kamesh...@gmail.com> wrote:
> >> > > Hi,
> >> > >
> >> > > I have a problem where our text corpus on which we need to do search
> >> > > contains many misspelled words. Same word could also be misspelled
> in
> >> > > several different ways. It could also have documents that have
> correct
> >> > > spellings However, the search term that we give in query would
> always
> >> be
> >> > > correct spelling. Now when we search on a term, we would like to get
> >> all
> >> > > the documents that contain both correct and misspelled forms of the
> >> > search
> >> > > term.
> >> > > We tried fuzzy search, but it doesn't work as per our expectations.
> It
> >> > > returns any close match, not specifically misspelled words. For
> >> example,
> >> > if
> >> > > I'm searching for a word like "fight", I would like to return the
> >> > documents
> >> > > that have words like "figth" and "feight", not documents with words
> >> like
> >> > > "sight" and "light".
> >> > > Is there any suggested approach for doing this?
> >> > >
> >> > > regards,
> >> > > Kamesh
> >> >
> >>
> >
> >
> >
> > --
> > ***Jagdish Nomula*
> > Sr. Manager Search
> > Simply Hired, Inc.
> > 370 San Aleso Ave., Ste 200
> > Sunnyvale, CA 94085
> >
> > office - 408.400.4700
> > cell - 408.431.2916
> > email - jagd...@simplyhired.com <yourem...@simplyhired.com>
> >
> > www.simplyhired.com
>



-- 
***Jagdish Nomula*
Sr. Manager Search
Simply Hired, Inc.
370 San Aleso Ave., Ste 200
Sunnyvale, CA 94085

office - 408.400.4700
cell - 408.431.2916
email - jagd...@simplyhired.com <yourem...@simplyhired.com>

www.simplyhired.com

Re: Search for misspelled words in corpus

Reply via email to