This is probably your start, if not read already: https://lucene.apache.org/solr/guide/7_4/language-analysis.html
Otherwise, I think your answer would be somewhere around using ICU4J, IBM's library for dealing with Unicode: http://site.icu-project.org/ (mentioned on the same page above) Specifically, transformations: http://userguide.icu-project.org/transforms/general With that, maybe you map both alphabets into latin. I did that once for Thai for a demo: https://github.com/arafalov/solr-thai-test/blob/master/collection1/conf/schema.xml#L34 The challenge is to figure out all the magic rules for that. You'd have to dig through the ICU documentation and other web pages. I found this one for example: http://avajava.com/tutorials/lessons/what-are-the-system-transliterators-available-with-icu4j.html;jsessionid=BEAB0AF05A588B97B8A2393054D908C0 There is also 12 part series on Solr and Asian text processing, though it is a bit old now: http://discovery-grindstone.blogspot.com/ Hope one of these things help. Regards, Alex. On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> wrote: > Hi all, > > We have a problem. Some of our historical documents have mixed together > simplified and Chinese characters. There seems to be no problem when > searching either traditional or simplified separately - that is, if a > particular string/phrase is all in traditional or simplified, it finds it - > but it does not find the string/phrase if the two different characters (one > traditional, one simplified) are mixed together in the SAME string/phrase. > > Has anyone ever handled this problem before? I know some libraries seem to > have implemented something that seems to be able to handle this, but I'm > not sure how they did so! > > Amanda > ------ > Dr. Amanda Shuman > Post-doc researcher, University of Freiburg, The Maoist Legacy Project > <http://www.maoistlegacy.uni-freiburg.de/> > PhD, University of California, Santa Cruz > http://www.amandashuman.net/ > http://www.prchistoryresources.org/ > Office: +49 (0) 761 203 4925