Hi Upayavira,
The word I am searching for is "fight". Terms like "figth", "figh" are
spelling mistakes of fight. So I would like to find them. "sight" is
obviously not a spelling mistake of "fight". Even if it was a typo, I don't
really want to match "sight" with "fight".
regards,
Kamesh
On Sun,
Thanks everyone for the replies. I too had the same idea of a
pre-processing step. So, I first analyzed the corpus using a dictionary and
got all the misspelled words and created a separate index with those words
in Solr. Now, when I search for a given query word, first I search for the
exact match
You haven't stated why figh is correct and sight isn't. Is it because
the first letter is different?
Upayavira
On Wed, Jun 5, 2013, at 02:10 PM, కామేశ్వర రావు భైరవభట్ల wrote:
> Hi,
>
> I have a problem where our text corpus on which we need to do search
> contains many misspelled words. Same wor
ngrams will definitely increase the index. But the increase in size might
not be super high as the total possible set of dictionary size is 26^3 and
we are just storing docs list with each ngram.
Another variation of the above ideas would be to add a pre-processing step,
where-in you analyze the i
Hm, I was purposely avoiding mentioning ngrams because just ngramming
all indexed tokens would balloon the index My assumption was that
only *some* words are misspelled, in which case it may be better not
to ngram all tokens
Otis
--
Solr & ElasticSearch Support
http://sematext.com/
On
Another theoretical answer for this question is ngrams approach. You can
index the word and its trigrams. Query the index, by the string as well as
its trigrams, with a % match search. You than pass the exhaustive resultset
through a more expensive scoring such as Smith Waterman.
Thanks,
Jagdish
n-grams might help, followed by a edit distance metric such as Jaro-Winkler
or Smith-Waterman-Gotoh to further filter out.
On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic wrote:
> Interesting problem. The first thing that comes to mind is to do
> "word expansion" during indexing. Kind of lik
Interesting problem. The first thing that comes to mind is to do
"word expansion" during indexing. Kind of like synonym expansion, but
maybe a bit more dynamic. If you can have a dictionary of correctly
spelled words, then for each token emitted by the tokenizer you could
look up the dictionary a