Re: Question regarding searching Chinese characters

Walter Underwood Fri, 20 Jul 2018 09:40:32 -0700

Looks like we need a charfilter version of the ICU transforms. That could run 
before the tokenizer.


I’ve never built a charfilter, but it seems like this would be a good first 
project for someone who wants to contribute.

wunder
Walter Underwood
[email protected]
http://observer.wunderwood.org/  (my blog)

> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <[email protected]> 
> wrote:
> 
> Exactly. More concretely, the starting point is: replacing your analyzer
> 
> <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> 
> to
> 
> <analyzer>
>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>  <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
> 
> and see if the results are as expected. Then research another filters if
> your requirements is not met.
> 
> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
> characters as I noted previous in post, so ICUTransformFilterFactory is an
> incomplete workaround.
> 
> 2018年7月21日(土) 0:05 Walter Underwood <[email protected]>:
> 
>> I expect that this is the line that does the transformation:
>> 
>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>> 
>> This mapping is a standard feature of ICU. More info on ICU transforms is
>> in this doc, though not much detail on this particular transform.
>> 
>> http://userguide.icu-project.org/transforms/general
>> 
>> wunder
>> Walter Underwood
>> [email protected]
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <[email protected]>
>> wrote:
>>> 
>>> I think so.  I used the exact as in github
>>> 
>>> <fieldType name="text_cjk" class="solr.TextField"
>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>> <analyzer>
>>>   <tokenizer class="solr.ICUTokenizerFactory" />
>>>   <filter class="solr.CJKWidthFilterFactory"/>
>>>   <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>>>   <filter class="solr.ICUTransformFilterFactory"
>> id="Katakana-Hiragana"/>
>>>   <filter class="solr.ICUFoldingFilterFactory"/>
>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>> </analyzer>
>>> </fieldType>
>>> 
>>> 
>>> 
>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <[email protected]
>>> 
>>> wrote:
>>> 
>>>> Thanks! That does indeed look promising... This can be added on top of
>>>> Smart Chinese, right? Or is it an alternative?
>>>> 
>>>> 
>>>> ------
>>>> Dr. Amanda Shuman
>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>> PhD, University of California, Santa Cruz
>>>> http://www.amandashuman.net/
>>>> http://www.prchistoryresources.org/
>>>> Office: +49 (0) 761 203 4925
>>>> 
>>>> 
>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <[email protected]>
>>>> wrote:
>>>> 
>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>> then
>>>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>>>> is
>>>>> transforming the 舊 to 旧
>>>>> 
>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>>>> test
>>>>>> for you in Analysis tab if you can give me what to put in index and
>>>> query
>>>>>> window...
>>>>>> 
>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <[email protected]
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>>>> cover
>>>>>>> your use case but I am using this filter and so far no issues.
>>>>>>> 
>>>>>>> Thnx
>>>>>>> 
>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>>>> [email protected]
>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
>> is
>>>>>>>> basically what is laid out in the old blogspot post, namely this
>>>> point:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> "Why approach CJK resource discovery differently?
>>>>>>>> 
>>>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>>>> 
>>>>>>>> There is more than one way to write each word. "Simplified"
>>>> characters
>>>>>>>> were
>>>>>>>> emphasized for printed materials in mainland China starting in the
>>>>> 1950s;
>>>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>>>> 1950s,
>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>>>> written
>>>>>>>> in two scripts.
>>>>>>>> Another way to think about it:  every written Chinese word has at
>>>> least
>>>>>>>> two
>>>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>>>> can
>>>>>>>> be
>>>>>>>> written with one traditional  and one simplified character.
>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
>> the
>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>>>> (simplified
>>>>>>>> characters for old fiction)"
>>>>>>>> 
>>>>>>>> So, using the example provided above, we are dealing with materials
>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>>>> 
>>>>>>>> A. 舊小說
>>>>>>>> 
>>>>>>>> can also be
>>>>>>>> 
>>>>>>>> B. 旧小说 (all simplified)
>>>>>>>> or
>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>>>> or
>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>>>> 
>>>>>>>> Thankfully the middle character was never simplified in recent
>> times.
>>>>>>>> 
>>>>>>>> From a historical standpoint, the mixed nature of the characters in
>>>> the
>>>>>>>> same word/phrase is because not all simplified characters were
>>>> adopted
>>>>> at
>>>>>>>> the same time by everyone uniformly (good times...).
>>>>>>>> 
>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
>> but
>>>>>>>> NOT C
>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>>>> change
>>>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>>>> creators
>>>>>>>> of the analyzer and ask them?
>>>>>>>> 
>>>>>>>> Amanda
>>>>>>>> 
>>>>>>>> ------
>>>>>>>> Dr. Amanda Shuman
>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>> Project
>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>> http://www.amandashuman.net/
>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> This is probably your start, if not read already:
>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>>>> 
>>>>>>>>> Otherwise, I think your answer would be somewhere around using
>>>> ICU4J,
>>>>>>>>> IBM's library for dealing with Unicode:
>>>> http://site.icu-project.org/
>>>>>>>>> (mentioned on the same page above)
>>>>>>>>> Specifically, transformations:
>>>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>>>> 
>>>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>>>> for Thai for a demo:
>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>>>> collection1/conf/schema.xml#L34
>>>>>>>>> 
>>>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>>>> have to dig through the ICU documentation and other web pages. I
>>>>> found
>>>>>>>>> this one for example:
>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>>>> 
>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>>>> though
>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>>>> 
>>>>>>>>> Hope one of these things help.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>>  Alex.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <[email protected]>
>>>>>>>> wrote:
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>>>> together
>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>>>> when
>>>>>>>>>> searching either traditional or simplified separately - that is,
>>>>> if a
>>>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>>>> finds
>>>>>>>>> it -
>>>>>>>>>> but it does not find the string/phrase if the two different
>>>>>>>> characters
>>>>>>>>> (one
>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>>>> string/phrase.
>>>>>>>>>> 
>>>>>>>>>> Has anyone ever handled this problem before? I know some
>>>> libraries
>>>>>>>> seem
>>>>>>>>> to
>>>>>>>>>> have implemented something that seems to be able to handle this,
>>>>> but
>>>>>>>> I'm
>>>>>>>>>> not sure how they did so!
>>>>>>>>>> 
>>>>>>>>>> Amanda
>>>>>>>>>> ------
>>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>>>> Project
>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 
> 
> -- 
> Tomoko Uchida

Re: Question regarding searching Chinese characters

Reply via email to