Re: Question regarding searching Chinese characters

Alexandre Rafalovitch Fri, 20 Jul 2018 04:42:02 -0700

This is probably your start, if not read already:
https://lucene.apache.org/solr/guide/7_4/language-analysis.html


Otherwise, I think your answer would be somewhere around using ICU4J,
IBM's library for dealing with Unicode: http://site.icu-project.org/
(mentioned on the same page above)
Specifically, transformations:
http://userguide.icu-project.org/transforms/general

With that, maybe you map both alphabets into latin. I did that once
for Thai for a demo:
https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34

The challenge is to figure out all the magic rules for that. You'd
have to dig through the ICU documentation and other web pages. I found
this one for example:
http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0

There is also 12 part series on Solr and Asian text processing, though
it is a bit old now: http://discovery-grindstone.blogspot.com/

Hope one of these things help.

Regards,
   Alex.


On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> wrote:
> Hi all,
>
> We have a problem. Some of our historical documents have mixed together
> simplified and Chinese characters. There seems to be no problem when
> searching either traditional or simplified separately - that is, if a
> particular string/phrase is all in traditional or simplified, it finds it -
> but it does not find the string/phrase if the two different characters (one
> traditional, one simplified) are mixed together in the SAME string/phrase.
>
> Has anyone ever handled this problem before? I know some libraries seem to
> have implemented something that seems to be able to handle this, but I'm
> not sure how they did so!
>
> Amanda
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925

Re: Question regarding searching Chinese characters

Reply via email to