Looks like we need a charfilter version of the ICU transforms. That could run before the tokenizer.
I’ve never built a charfilter, but it seems like this would be a good first project for someone who wants to contribute. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <tomoko.uchida.1...@gmail.com> > wrote: > > Exactly. More concretely, the starting point is: replacing your analyzer > > <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/> > > to > > <analyzer> > <tokenizer class="solr.HMMChineseTokenizerFactory"/> > <filter class="solr.ICUTransformFilterFactory" > id="Traditional-Simplified"/> > </analyzer> > > and see if the results are as expected. Then research another filters if > your requirements is not met. > > Just a reminder: HMMChineseTokenizerFactory do not handle traditional > characters as I noted previous in post, so ICUTransformFilterFactory is an > incomplete workaround. > > 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>: > >> I expect that this is the line that does the transformation: >> >> <filter class="solr.ICUTransformFilterFactory" >> id="Traditional-Simplified"/> >> >> This mapping is a standard feature of ICU. More info on ICU transforms is >> in this doc, though not much detail on this particular transform. >> >> http://userguide.icu-project.org/transforms/general >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> >> wrote: >>> >>> I think so. I used the exact as in github >>> >>> <fieldType name="text_cjk" class="solr.TextField" >>> positionIncrementGap="10000" autoGeneratePhraseQueries="false"> >>> <analyzer> >>> <tokenizer class="solr.ICUTokenizerFactory" /> >>> <filter class="solr.CJKWidthFilterFactory"/> >>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >>> <filter class="solr.ICUTransformFilterFactory" >> id="Traditional-Simplified"/> >>> <filter class="solr.ICUTransformFilterFactory" >> id="Katakana-Hiragana"/> >>> <filter class="solr.ICUFoldingFilterFactory"/> >>> <filter class="solr.CJKBigramFilterFactory" han="true" >>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >>> </analyzer> >>> </fieldType> >>> >>> >>> >>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com >>> >>> wrote: >>> >>>> Thanks! That does indeed look promising... This can be added on top of >>>> Smart Chinese, right? Or is it an alternative? >>>> >>>> >>>> ------ >>>> Dr. Amanda Shuman >>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project >>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>> PhD, University of California, Santa Cruz >>>> http://www.amandashuman.net/ >>>> http://www.prchistoryresources.org/ >>>> Office: +49 (0) 761 203 4925 >>>> >>>> >>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com> >>>> wrote: >>>> >>>>> I think CJKFoldingFilter will work for you. I put 舊小說 in index and >> then >>>>> each of A, B or C or D in query and they seems to be matching and CJKFF >>>> is >>>>> transforming the 舊 to 旧 >>>>> >>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Lack of my chinese language knowledge but if you want, I can do quick >>>>> test >>>>>> for you in Analysis tab if you can give me what to put in index and >>>> query >>>>>> window... >>>>>> >>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com >>> >>>>>> wrote: >>>>>> >>>>>>> Have you tried to use CJKFoldingFilter https://g >>>>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would >>>> cover >>>>>>> your use case but I am using this filter and so far no issues. >>>>>>> >>>>>>> Thnx >>>>>>> >>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < >>>> amanda.shu...@gmail.com >>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks, Alex - I have seen a few of those links but never considered >>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue >> is >>>>>>>> basically what is laid out in the old blogspot post, namely this >>>> point: >>>>>>>> >>>>>>>> >>>>>>>> "Why approach CJK resource discovery differently? >>>>>>>> >>>>>>>> 2. Search results must be as script agnostic as possible. >>>>>>>> >>>>>>>> There is more than one way to write each word. "Simplified" >>>> characters >>>>>>>> were >>>>>>>> emphasized for printed materials in mainland China starting in the >>>>> 1950s; >>>>>>>> "Traditional" characters were used in printed materials prior to the >>>>>>>> 1950s, >>>>>>>> and are still used in Taiwan, Hong Kong and Macau today. >>>>>>>> Since the characters are distinct, it's as if Chinese materials are >>>>>>>> written >>>>>>>> in two scripts. >>>>>>>> Another way to think about it: every written Chinese word has at >>>> least >>>>>>>> two >>>>>>>> completely different spellings. And it can be mix-n-match: a word >>>> can >>>>>>>> be >>>>>>>> written with one traditional and one simplified character. >>>>>>>> Example: Given a user query 舊小說 (traditional for old fiction), >> the >>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说 >>>>> (simplified >>>>>>>> characters for old fiction)" >>>>>>>> >>>>>>>> So, using the example provided above, we are dealing with materials >>>>>>>> produced in the 1950s-1970s that do even weirder things like: >>>>>>>> >>>>>>>> A. 舊小說 >>>>>>>> >>>>>>>> can also be >>>>>>>> >>>>>>>> B. 旧小说 (all simplified) >>>>>>>> or >>>>>>>> C. 旧小說 (first character simplified, last character traditional) >>>>>>>> or >>>>>>>> D. 舊小 说 (first character traditional, last character simplified) >>>>>>>> >>>>>>>> Thankfully the middle character was never simplified in recent >> times. >>>>>>>> >>>>>>>> From a historical standpoint, the mixed nature of the characters in >>>> the >>>>>>>> same word/phrase is because not all simplified characters were >>>> adopted >>>>> at >>>>>>>> the same time by everyone uniformly (good times...). >>>>>>>> >>>>>>>> The problem seems to be that Solr can easily handle A or B above, >> but >>>>>>>> NOT C >>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to >>>>> change >>>>>>>> that at this point... maybe I should figure out how to contact the >>>>>>>> creators >>>>>>>> of the analyzer and ask them? >>>>>>>> >>>>>>>> Amanda >>>>>>>> >>>>>>>> ------ >>>>>>>> Dr. Amanda Shuman >>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >>>> Project >>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>>>> PhD, University of California, Santa Cruz >>>>>>>> http://www.amandashuman.net/ >>>>>>>> http://www.prchistoryresources.org/ >>>>>>>> Office: +49 (0) 761 203 4925 >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < >>>>>>>> arafa...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is probably your start, if not read already: >>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html >>>>>>>>> >>>>>>>>> Otherwise, I think your answer would be somewhere around using >>>> ICU4J, >>>>>>>>> IBM's library for dealing with Unicode: >>>> http://site.icu-project.org/ >>>>>>>>> (mentioned on the same page above) >>>>>>>>> Specifically, transformations: >>>>>>>>> http://userguide.icu-project.org/transforms/general >>>>>>>>> >>>>>>>>> With that, maybe you map both alphabets into latin. I did that once >>>>>>>>> for Thai for a demo: >>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ >>>>>>>>> collection1/conf/schema.xml#L34 >>>>>>>>> >>>>>>>>> The challenge is to figure out all the magic rules for that. You'd >>>>>>>>> have to dig through the ICU documentation and other web pages. I >>>>> found >>>>>>>>> this one for example: >>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- >>>>>>>>> transliterators-available-with-icu4j.html;jsessionid= >>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0 >>>>>>>>> >>>>>>>>> There is also 12 part series on Solr and Asian text processing, >>>>> though >>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/ >>>>>>>>> >>>>>>>>> Hope one of these things help. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Alex. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> >>>>>>>> wrote: >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> We have a problem. Some of our historical documents have mixed >>>>>>>> together >>>>>>>>>> simplified and Chinese characters. There seems to be no problem >>>>> when >>>>>>>>>> searching either traditional or simplified separately - that is, >>>>> if a >>>>>>>>>> particular string/phrase is all in traditional or simplified, it >>>>>>>> finds >>>>>>>>> it - >>>>>>>>>> but it does not find the string/phrase if the two different >>>>>>>> characters >>>>>>>>> (one >>>>>>>>>> traditional, one simplified) are mixed together in the SAME >>>>>>>>> string/phrase. >>>>>>>>>> >>>>>>>>>> Has anyone ever handled this problem before? I know some >>>> libraries >>>>>>>> seem >>>>>>>>> to >>>>>>>>>> have implemented something that seems to be able to handle this, >>>>> but >>>>>>>> I'm >>>>>>>>>> not sure how they did so! >>>>>>>>>> >>>>>>>>>> Amanda >>>>>>>>>> ------ >>>>>>>>>> Dr. Amanda Shuman >>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >>>>>>>> Project >>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>>>>>> PhD, University of California, Santa Cruz >>>>>>>>>> http://www.amandashuman.net/ >>>>>>>>>> http://www.prchistoryresources.org/ >>>>>>>>>> Office: +49 (0) 761 203 4925 >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >> > > -- > Tomoko Uchida