Hi Floyd, Typically you'd do this by creating a custom analyzer that
- segments Chinese text into words - Converts from words to pinyin or zhuyin Your index would have both the actual Hanzi characters, plus (via copyfield) this phonetic version. During search, you'd use dismax to search against both fields, with a higher weighting to the Hanzi field. But segmentation can be error prone, and requires embedding specialized code that you typically license (for high quality results) from a commercial vendor. So my first cut approach would be to use the current synonym support to map each Hanzi to all possible pronunciations. There are numerous open source datasets that contain this information. Note that there might be performance issues with having such a huge set of synonyms. Then, by weighting phrase matches sufficiently high (again using dismax) I think you could get reasonable results. -- Ken On Oct 21, 2011, at 7:33am, Floyd Wu wrote: > Does anybody know how to implement this idea in SOLR. Please kindly > point me a direction. > > For example, when user enter a keyword in Chinese "貝多芬" (this is > Beethoven in Chinese) > but key in a wrong combination of characters "背多分" (this is > pronouncation the same with previous keyword "貝多芬"). > > There in solr index exist token "貝多芬" actually. How to hit documents > where "貝多芬" exist when "背多分" is enter. > > This is basic function of commercial search engine especially in > Chinese processing. I wonder how to implements in SOLR and where is > the start point. > > Floyd -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr