ngrams will definitely increase the index. But the increase in size might not be super high as the total possible set of dictionary size is 26^3 and we are just storing docs list with each ngram.
Another variation of the above ideas would be to add a pre-processing step, where-in you analyze the input corpus to explore the words which can be mis-spelt. You can use any of the word based LSH algorithms to do this and then index selectlively. This is a theoretical answer. You would have to cherry pick solutions/approaches for your use case. Thanks, On Sat, Jun 8, 2013 at 11:49 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hm, I was purposely avoiding mentioning ngrams because just ngramming > all indexed tokens would balloon the index.... My assumption was that > only *some* words are misspelled, in which case it may be better not > to ngram all tokens.... > > Otis > -- > Solr & ElasticSearch Support > http://sematext.com/ > > > > > > On Sun, Jun 9, 2013 at 2:30 AM, Jagdish Nomula <jagd...@simplyhired.com> > wrote: > > Another theoretical answer for this question is ngrams approach. You can > > index the word and its trigrams. Query the index, by the string as well > as > > its trigrams, with a % match search. You than pass the exhaustive > resultset > > through a more expensive scoring such as Smith Waterman. > > > > Thanks, > > > > Jagdish > > > > > > On Sat, Jun 8, 2013 at 11:03 PM, Shashi Kant <sk...@sloan.mit.edu> > wrote: > > > >> n-grams might help, followed by a edit distance metric such as > Jaro-Winkler > >> or Smith-Waterman-Gotoh to further filter out. > >> > >> > >> On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic < > >> otis.gospodne...@gmail.com > >> > wrote: > >> > >> > Interesting problem. The first thing that comes to mind is to do > >> > "word expansion" during indexing. Kind of like synonym expansion, but > >> > maybe a bit more dynamic. If you can have a dictionary of correctly > >> > spelled words, then for each token emitted by the tokenizer you could > >> > look up the dictionary and expand the token to all other words that > >> > are similar/close enough. This would not be super fast, and you'd > >> > likely have to add some custom heuristic for figuring out what > >> > "similar/close enough" means, but it might work. > >> > > >> > I'd love to hear other ideas... > >> > > >> > Otis > >> > -- > >> > Solr & ElasticSearch Support > >> > http://sematext.com/ > >> > > >> > > >> > > >> > > >> > > >> > On Wed, Jun 5, 2013 at 9:10 AM, కామేశ్వర రావు భైరవభట్ల > >> > <kamesh...@gmail.com> wrote: > >> > > Hi, > >> > > > >> > > I have a problem where our text corpus on which we need to do search > >> > > contains many misspelled words. Same word could also be misspelled > in > >> > > several different ways. It could also have documents that have > correct > >> > > spellings However, the search term that we give in query would > always > >> be > >> > > correct spelling. Now when we search on a term, we would like to get > >> all > >> > > the documents that contain both correct and misspelled forms of the > >> > search > >> > > term. > >> > > We tried fuzzy search, but it doesn't work as per our expectations. > It > >> > > returns any close match, not specifically misspelled words. For > >> example, > >> > if > >> > > I'm searching for a word like "fight", I would like to return the > >> > documents > >> > > that have words like "figth" and "feight", not documents with words > >> like > >> > > "sight" and "light". > >> > > Is there any suggested approach for doing this? > >> > > > >> > > regards, > >> > > Kamesh > >> > > >> > > > > > > > > -- > > ***Jagdish Nomula* > > Sr. Manager Search > > Simply Hired, Inc. > > 370 San Aleso Ave., Ste 200 > > Sunnyvale, CA 94085 > > > > office - 408.400.4700 > > cell - 408.431.2916 > > email - jagd...@simplyhired.com <yourem...@simplyhired.com> > > > > www.simplyhired.com > -- ***Jagdish Nomula* Sr. Manager Search Simply Hired, Inc. 370 San Aleso Ave., Ste 200 Sunnyvale, CA 94085 office - 408.400.4700 cell - 408.431.2916 email - jagd...@simplyhired.com <yourem...@simplyhired.com> www.simplyhired.com