Re: Search for misspelled words in corpus

2013-06-09 Thread కామేశ్వర రావు భైరవభట్ల
Hi Upayavira, The word I am searching for is "fight". Terms like "figth", "figh" are spelling mistakes of fight. So I would like to find them. "sight" is obviously not a spelling mistake of "fight". Even if it was a typo, I don't really want to match "sight" with "fight". regards, Kamesh On Sun,

Re: Search for misspelled words in corpus

2013-06-09 Thread కామేశ్వర రావు భైరవభట్ల
Thanks everyone for the replies. I too had the same idea of a pre-processing step. So, I first analyzed the corpus using a dictionary and got all the misspelled words and created a separate index with those words in Solr. Now, when I search for a given query word, first I search for the exact match

Re: Search for misspelled words in corpus

2013-06-09 Thread Upayavira
You haven't stated why figh is correct and sight isn't. Is it because the first letter is different? Upayavira On Wed, Jun 5, 2013, at 02:10 PM, కామేశ్వర రావు భైరవభట్ల wrote: > Hi, > > I have a problem where our text corpus on which we need to do search > contains many misspelled words. Same wor

Re: Search for misspelled words in corpus

2013-06-09 Thread Jagdish Nomula
ngrams will definitely increase the index. But the increase in size might not be super high as the total possible set of dictionary size is 26^3 and we are just storing docs list with each ngram. Another variation of the above ideas would be to add a pre-processing step, where-in you analyze the i

Re: Search for misspelled words in corpus

2013-06-08 Thread Otis Gospodnetic
Hm, I was purposely avoiding mentioning ngrams because just ngramming all indexed tokens would balloon the index My assumption was that only *some* words are misspelled, in which case it may be better not to ngram all tokens Otis -- Solr & ElasticSearch Support http://sematext.com/ On

Re: Search for misspelled words in corpus

2013-06-08 Thread Jagdish Nomula
Another theoretical answer for this question is ngrams approach. You can index the word and its trigrams. Query the index, by the string as well as its trigrams, with a % match search. You than pass the exhaustive resultset through a more expensive scoring such as Smith Waterman. Thanks, Jagdish

Re: Search for misspelled words in corpus

2013-06-08 Thread Shashi Kant
n-grams might help, followed by a edit distance metric such as Jaro-Winkler or Smith-Waterman-Gotoh to further filter out. On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic wrote: > Interesting problem. The first thing that comes to mind is to do > "word expansion" during indexing. Kind of lik

Re: Search for misspelled words in corpus

2013-06-08 Thread Otis Gospodnetic
Interesting problem. The first thing that comes to mind is to do "word expansion" during indexing. Kind of like synonym expansion, but maybe a bit more dynamic. If you can have a dictionary of correctly spelled words, then for each token emitted by the tokenizer you could look up the dictionary a