we have implemented one supporting "did you mean" and preffix suggestion
for Chinese. But we base our working on solr 1.4 and we did many
modifications so it will cost time to integrate it to current solr/lucene.

     Here are our solution. glad to see any advices.

     1. offline words and phrases discovery.
           we discovery new words and new phrases by mining query logs

     2. online matching algorithm
           for each word, e.g., 贝多芬
           we convert it to pinyin bei duo fen, then we indexing it using
n-gram, which means gram3:bei gram3:eid ...
           to get "did you mean" result, we convert query 背朵分 into n-gram,
it's a boolean or query, so there are many results( the words' pinyin
similar to query will be ranked top)
          Then we reranks top 500 results by fine-grained algorithm
          we use edit distance to align query and result, we also take
character into consideration. e.g query 十度,matches are 十渡 and 是度,their
pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
both query and match
          also you need consider the hotness(popular degree) of different
words/phrases. which can be known from query logs

          Another question is to convert Chinese into pinyin. because some
character has more than one pinyin.
         e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
words/phrases first. word segmentation is a basic problem is Chinese IR


2011/10/21 Floyd Wu <floyd...@gmail.com>

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
>
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
>
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
>
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
>
> Floyd
>

Reply via email to