Hi Li Li, Thanks for your detail explanation. Basically I have similar implementation like yours. I just want to know if there is a better and total solution. I'll keep trying and see if I have any improvement that can share with you and the community.
Any idea or advice are welcome . Floyd 2011/10/21 Li Li <fancye...@gmail.com>: > we have implemented one supporting "did you mean" and preffix suggestion > for Chinese. But we base our working on solr 1.4 and we did many > modifications so it will cost time to integrate it to current solr/lucene. > > Here are our solution. glad to see any advices. > > 1. offline words and phrases discovery. > we discovery new words and new phrases by mining query logs > > 2. online matching algorithm > for each word, e.g., 贝多芬 > we convert it to pinyin bei duo fen, then we indexing it using > n-gram, which means gram3:bei gram3:eid ... > to get "did you mean" result, we convert query 背朵分 into n-gram, > it's a boolean or query, so there are many results( the words' pinyin > similar to query will be ranked top) > Then we reranks top 500 results by fine-grained algorithm > we use edit distance to align query and result, we also take > character into consideration. e.g query 十度,matches are 十渡 and 是度,their > pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in > both query and match > also you need consider the hotness(popular degree) of different > words/phrases. which can be known from query logs > > Another question is to convert Chinese into pinyin. because some > character has more than one pinyin. > e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and > words/phrases first. word segmentation is a basic problem is Chinese IR > > > 2011/10/21 Floyd Wu <floyd...@gmail.com> > >> Does anybody know how to implement this idea in SOLR. Please kindly >> point me a direction. >> >> For example, when user enter a keyword in Chinese "��多芬" (this is >> Beethoven in Chinese) >> but key in a wrong combination of characters "背多分" (this is >> pronouncation the same with previous keyword "��多芬"). >> >> There in solr index exist token "��多芬" actually. How to hit documents >> where "��多芬" exist when "背多分" is enter. >> >> This is basic function of commercial search engine especially in >> Chinese processing. I wonder how to implements in SOLR and where is >> the start point. >> >> Floyd >> >