Would ICUNormalizer2CharFilterFactory do? Or at least serve as a template of what needs to be done.
Regards, Alex. On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org> wrote: > Looks like we need a charfilter version of the ICU transforms. That could run > before the tokenizer. > > I’ve never built a charfilter, but it seems like this would be a good first > project for someone who wants to contribute. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <tomoko.uchida.1...@gmail.com> >> wrote: >> >> Exactly. More concretely, the starting point is: replacing your analyzer >> >> <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/> >> >> to >> >> <analyzer> >> <tokenizer class="solr.HMMChineseTokenizerFactory"/> >> <filter class="solr.ICUTransformFilterFactory" >> id="Traditional-Simplified"/> >> </analyzer> >> >> and see if the results are as expected. Then research another filters if >> your requirements is not met. >> >> Just a reminder: HMMChineseTokenizerFactory do not handle traditional >> characters as I noted previous in post, so ICUTransformFilterFactory is an >> incomplete workaround. >> >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>: >> >>> I expect that this is the line that does the transformation: >>> >>> <filter class="solr.ICUTransformFilterFactory" >>> id="Traditional-Simplified"/> >>> >>> This mapping is a standard feature of ICU. More info on ICU transforms is >>> in this doc, though not much detail on this particular transform. >>> >>> http://userguide.icu-project.org/transforms/general >>> >>> wunder >>> Walter Underwood >>> wun...@wunderwood.org >>> http://observer.wunderwood.org/ (my blog) >>> >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com> >>> wrote: >>>> >>>> I think so. I used the exact as in github >>>> >>>> <fieldType name="text_cjk" class="solr.TextField" >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false"> >>>> <analyzer> >>>> <tokenizer class="solr.ICUTokenizerFactory" /> >>>> <filter class="solr.CJKWidthFilterFactory"/> >>>> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >>>> <filter class="solr.ICUTransformFilterFactory" >>> id="Traditional-Simplified"/> >>>> <filter class="solr.ICUTransformFilterFactory" >>> id="Katakana-Hiragana"/> >>>> <filter class="solr.ICUFoldingFilterFactory"/> >>>> <filter class="solr.CJKBigramFilterFactory" han="true" >>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >>>> </analyzer> >>>> </fieldType> >>>> >>>> >>>> >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com >>>> >>>> wrote: >>>> >>>>> Thanks! That does indeed look promising... This can be added on top of >>>>> Smart Chinese, right? Or is it an alternative? >>>>> >>>>> >>>>> ------ >>>>> Dr. Amanda Shuman >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project >>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>> PhD, University of California, Santa Cruz >>>>> http://www.amandashuman.net/ >>>>> http://www.prchistoryresources.org/ >>>>> Office: +49 (0) 761 203 4925 >>>>> >>>>> >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com> >>>>> wrote: >>>>> >>>>>> I think CJKFoldingFilter will work for you. I put 舊小說 in index and >>> then >>>>>> each of A, B or C or D in query and they seems to be matching and CJKFF >>>>> is >>>>>> transforming the 舊 to 旧 >>>>>> >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Lack of my chinese language knowledge but if you want, I can do quick >>>>>> test >>>>>>> for you in Analysis tab if you can give me what to put in index and >>>>> query >>>>>>> window... >>>>>>> >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com >>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Have you tried to use CJKFoldingFilter https://g >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would >>>>> cover >>>>>>>> your use case but I am using this filter and so far no issues. >>>>>>>> >>>>>>>> Thnx >>>>>>>> >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < >>>>> amanda.shu...@gmail.com >>>>>>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks, Alex - I have seen a few of those links but never considered >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue >>> is >>>>>>>>> basically what is laid out in the old blogspot post, namely this >>>>> point: >>>>>>>>> >>>>>>>>> >>>>>>>>> "Why approach CJK resource discovery differently? >>>>>>>>> >>>>>>>>> 2. Search results must be as script agnostic as possible. >>>>>>>>> >>>>>>>>> There is more than one way to write each word. "Simplified" >>>>> characters >>>>>>>>> were >>>>>>>>> emphasized for printed materials in mainland China starting in the >>>>>> 1950s; >>>>>>>>> "Traditional" characters were used in printed materials prior to the >>>>>>>>> 1950s, >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today. >>>>>>>>> Since the characters are distinct, it's as if Chinese materials are >>>>>>>>> written >>>>>>>>> in two scripts. >>>>>>>>> Another way to think about it: every written Chinese word has at >>>>> least >>>>>>>>> two >>>>>>>>> completely different spellings. And it can be mix-n-match: a word >>>>> can >>>>>>>>> be >>>>>>>>> written with one traditional and one simplified character. >>>>>>>>> Example: Given a user query 舊小說 (traditional for old fiction), >>> the >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说 >>>>>> (simplified >>>>>>>>> characters for old fiction)" >>>>>>>>> >>>>>>>>> So, using the example provided above, we are dealing with materials >>>>>>>>> produced in the 1950s-1970s that do even weirder things like: >>>>>>>>> >>>>>>>>> A. 舊小說 >>>>>>>>> >>>>>>>>> can also be >>>>>>>>> >>>>>>>>> B. 旧小说 (all simplified) >>>>>>>>> or >>>>>>>>> C. 旧小說 (first character simplified, last character traditional) >>>>>>>>> or >>>>>>>>> D. 舊小 说 (first character traditional, last character simplified) >>>>>>>>> >>>>>>>>> Thankfully the middle character was never simplified in recent >>> times. >>>>>>>>> >>>>>>>>> From a historical standpoint, the mixed nature of the characters in >>>>> the >>>>>>>>> same word/phrase is because not all simplified characters were >>>>> adopted >>>>>> at >>>>>>>>> the same time by everyone uniformly (good times...). >>>>>>>>> >>>>>>>>> The problem seems to be that Solr can easily handle A or B above, >>> but >>>>>>>>> NOT C >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to >>>>>> change >>>>>>>>> that at this point... maybe I should figure out how to contact the >>>>>>>>> creators >>>>>>>>> of the analyzer and ask them? >>>>>>>>> >>>>>>>>> Amanda >>>>>>>>> >>>>>>>>> ------ >>>>>>>>> Dr. Amanda Shuman >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >>>>> Project >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>>>>> PhD, University of California, Santa Cruz >>>>>>>>> http://www.amandashuman.net/ >>>>>>>>> http://www.prchistoryresources.org/ >>>>>>>>> Office: +49 (0) 761 203 4925 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < >>>>>>>>> arafa...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> This is probably your start, if not read already: >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html >>>>>>>>>> >>>>>>>>>> Otherwise, I think your answer would be somewhere around using >>>>> ICU4J, >>>>>>>>>> IBM's library for dealing with Unicode: >>>>> http://site.icu-project.org/ >>>>>>>>>> (mentioned on the same page above) >>>>>>>>>> Specifically, transformations: >>>>>>>>>> http://userguide.icu-project.org/transforms/general >>>>>>>>>> >>>>>>>>>> With that, maybe you map both alphabets into latin. I did that once >>>>>>>>>> for Thai for a demo: >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/ >>>>>>>>>> collection1/conf/schema.xml#L34 >>>>>>>>>> >>>>>>>>>> The challenge is to figure out all the magic rules for that. You'd >>>>>>>>>> have to dig through the ICU documentation and other web pages. I >>>>>> found >>>>>>>>>> this one for example: >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system- >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid= >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0 >>>>>>>>>> >>>>>>>>>> There is also 12 part series on Solr and Asian text processing, >>>>>> though >>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/ >>>>>>>>>> >>>>>>>>>> Hope one of these things help. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Alex. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> >>>>>>>>> wrote: >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> We have a problem. Some of our historical documents have mixed >>>>>>>>> together >>>>>>>>>>> simplified and Chinese characters. There seems to be no problem >>>>>> when >>>>>>>>>>> searching either traditional or simplified separately - that is, >>>>>> if a >>>>>>>>>>> particular string/phrase is all in traditional or simplified, it >>>>>>>>> finds >>>>>>>>>> it - >>>>>>>>>>> but it does not find the string/phrase if the two different >>>>>>>>> characters >>>>>>>>>> (one >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME >>>>>>>>>> string/phrase. >>>>>>>>>>> >>>>>>>>>>> Has anyone ever handled this problem before? I know some >>>>> libraries >>>>>>>>> seem >>>>>>>>>> to >>>>>>>>>>> have implemented something that seems to be able to handle this, >>>>>> but >>>>>>>>> I'm >>>>>>>>>>> not sure how they did so! >>>>>>>>>>> >>>>>>>>>>> Amanda >>>>>>>>>>> ------ >>>>>>>>>>> Dr. Amanda Shuman >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy >>>>>>>>> Project >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/> >>>>>>>>>>> PhD, University of California, Santa Cruz >>>>>>>>>>> http://www.amandashuman.net/ >>>>>>>>>>> http://www.prchistoryresources.org/ >>>>>>>>>>> Office: +49 (0) 761 203 4925 >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >>> >> >> -- >> Tomoko Uchida >