Hi Li Li,

Thanks for your detail explanation. Basically I have similar
implementation like yours. I just want to know if there is a better
and total solution. I'll keep trying and see if I have any improvement
that can share with you and the community.

Any idea or advice are welcome .

Floyd



2011/10/21 Li Li <fancye...@gmail.com>:
>    we have implemented one supporting "did you mean" and preffix suggestion
> for Chinese. But we base our working on solr 1.4 and we did many
> modifications so it will cost time to integrate it to current solr/lucene.
>
>     Here are our solution. glad to see any advices.
>
>     1. offline words and phrases discovery.
>           we discovery new words and new phrases by mining query logs
>
>     2. online matching algorithm
>           for each word, e.g., 贝多芬
>           we convert it to pinyin bei duo fen, then we indexing it using
> n-gram, which means gram3:bei gram3:eid ...
>           to get "did you mean" result, we convert query 背朵分 into n-gram,
> it's a boolean or query, so there are many results( the words' pinyin
> similar to query will be ranked top)
>          Then we reranks top 500 results by fine-grained algorithm
>          we use edit distance to align query and result, we also take
> character into consideration. e.g query 十度,matches are 十渡 and 是度,their
> pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
> both query and match
>          also you need consider the hotness(popular degree) of different
> words/phrases. which can be known from query logs
>
>          Another question is to convert Chinese into pinyin. because some
> character has more than one pinyin.
>         e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
> words/phrases first. word segmentation is a basic problem is Chinese IR
>
>
> 2011/10/21 Floyd Wu <floyd...@gmail.com>
>
>> Does anybody know how to implement this idea in SOLR. Please kindly
>> point me a direction.
>>
>> For example, when user enter a keyword in Chinese "��多芬" (this is
>> Beethoven in Chinese)
>> but key in a wrong combination of characters  "背多分" (this is
>> pronouncation the same with previous keyword "��多芬").
>>
>> There in solr index exist token "��多芬" actually. How to hit documents
>> where "��多芬" exist when "背多分" is enter.
>>
>> This is basic function of commercial search engine especially in
>> Chinese processing. I wonder how to implements in SOLR and where is
>> the start point.
>>
>> Floyd
>>
>

Reply via email to