Hi Floyd,

Typically you'd do this by creating a custom analyzer that

 - segments Chinese text into words
 - Converts from words to pinyin or zhuyin

Your index would have both the actual Hanzi characters, plus (via copyfield) 
this phonetic version.

During search, you'd use dismax to search against both fields, with a higher 
weighting to the Hanzi field.

But segmentation can be error prone, and requires embedding specialized code 
that you typically license (for high quality results) from a commercial vendor.

So my first cut approach would be to use the current synonym support to map 
each Hanzi to all possible pronunciations. There are numerous open source 
datasets that contain this information. Note that there might be performance 
issues with having such a huge set of synonyms.

Then, by weighting phrase matches sufficiently high (again using dismax) I 
think you could get reasonable results.

-- Ken
 
On Oct 21, 2011, at 7:33am, Floyd Wu wrote:

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
> 
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
> 
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
> 
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
> 
> Floyd

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr



Reply via email to