Hm, I was purposely avoiding mentioning ngrams because just ngramming all indexed tokens would balloon the index.... My assumption was that only *some* words are misspelled, in which case it may be better not to ngram all tokens....
Otis -- Solr & ElasticSearch Support http://sematext.com/ On Sun, Jun 9, 2013 at 2:30 AM, Jagdish Nomula <jagd...@simplyhired.com> wrote: > Another theoretical answer for this question is ngrams approach. You can > index the word and its trigrams. Query the index, by the string as well as > its trigrams, with a % match search. You than pass the exhaustive resultset > through a more expensive scoring such as Smith Waterman. > > Thanks, > > Jagdish > > > On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <sk...@sloan.mit.edu> wrote: > >> n-grams might help, followed by a edit distance metric such as Jaro-Winkler >> or Smith-Waterman-Gotoh to further filter out. >> >> >> On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic < >> otis.gospodne...@gmail.com >> > wrote: >> >> > Interesting problem. The first thing that comes to mind is to do >> > "word expansion" during indexing. Kind of like synonym expansion, but >> > maybe a bit more dynamic. If you can have a dictionary of correctly >> > spelled words, then for each token emitted by the tokenizer you could >> > look up the dictionary and expand the token to all other words that >> > are similar/close enough. This would not be super fast, and you'd >> > likely have to add some custom heuristic for figuring out what >> > "similar/close enough" means, but it might work. >> > >> > I'd love to hear other ideas... >> > >> > Otis >> > -- >> > Solr & ElasticSearch Support >> > http://sematext.com/ >> > >> > >> > >> > >> > >> > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల >> > <kamesh...@gmail.com> wrote: >> > > Hi, >> > > >> > > I have a problem where our text corpus on which we need to do search >> > > contains many misspelled words. Same word could also be misspelled in >> > > several different ways. It could also have documents that have correct >> > > spellings However, the search term that we give in query would always >> be >> > > correct spelling. Now when we search on a term, we would like to get >> all >> > > the documents that contain both correct and misspelled forms of the >> > search >> > > term. >> > > We tried fuzzy search, but it doesn't work as per our expectations. It >> > > returns any close match, not specifically misspelled words. For >> example, >> > if >> > > I'm searching for a word like "fight", I would like to return the >> > documents >> > > that have words like "figth" and "feight", not documents with words >> like >> > > "sight" and "light". >> > > Is there any suggested approach for doing this? >> > > >> > > regards, >> > > Kamesh >> > >> > > > > -- > ***Jagdish Nomula* > Sr. Manager Search > Simply Hired, Inc. > 370 San Aleso Ave., Ste 200 > Sunnyvale, CA 94085 > > office - 408.400.4700 > cell - 408.431.2916 > email - jagd...@simplyhired.com <yourem...@simplyhired.com> > > www.simplyhired.com