Re: Question regarding searching Chinese characters

Alexandre Rafalovitch Fri, 20 Jul 2018 10:55:03 -0700

Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
template of what needs to be done.


Regards,
   Alex.

On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org> wrote:
> Looks like we need a charfilter version of the ICU transforms. That could run 
> before the tokenizer.
>
> I’ve never built a charfilter, but it seems like this would be a good first 
> project for someone who wants to contribute.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <tomoko.uchida.1...@gmail.com> 
>> wrote:
>>
>> Exactly. More concretely, the starting point is: replacing your analyzer
>>
>> <analyzer class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
>>
>> to
>>
>> <analyzer>
>>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>>  <filter class="solr.ICUTransformFilterFactory"
>> id="Traditional-Simplified"/>
>> </analyzer>
>>
>> and see if the results are as expected. Then research another filters if
>> your requirements is not met.
>>
>> Just a reminder: HMMChineseTokenizerFactory do not handle traditional
>> characters as I noted previous in post, so ICUTransformFilterFactory is an
>> incomplete workaround.
>>
>> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>:
>>
>>> I expect that this is the line that does the transformation:
>>>
>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Traditional-Simplified"/>
>>>
>>> This mapping is a standard feature of ICU. More info on ICU transforms is
>>> in this doc, though not much detail on this particular transform.
>>>
>>> http://userguide.icu-project.org/transforms/general
>>>
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com>
>>> wrote:
>>>>
>>>> I think so.  I used the exact as in github
>>>>
>>>> <fieldType name="text_cjk" class="solr.TextField"
>>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>>>> <analyzer>
>>>>   <tokenizer class="solr.ICUTokenizerFactory" />
>>>>   <filter class="solr.CJKWidthFilterFactory"/>
>>>>   <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Traditional-Simplified"/>
>>>>   <filter class="solr.ICUTransformFilterFactory"
>>> id="Katakana-Hiragana"/>
>>>>   <filter class="solr.ICUFoldingFilterFactory"/>
>>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
>>>> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>>>> </analyzer>
>>>> </fieldType>
>>>>
>>>>
>>>>
>>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>> Thanks! That does indeed look promising... This can be added on top of
>>>>> Smart Chinese, right? Or is it an alternative?
>>>>>
>>>>>
>>>>> ------
>>>>> Dr. Amanda Shuman
>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>> PhD, University of California, Santa Cruz
>>>>> http://www.amandashuman.net/
>>>>> http://www.prchistoryresources.org/
>>>>> Office: +49 (0) 761 203 4925
>>>>>
>>>>>
>>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index and
>>> then
>>>>>> each of A, B or C or D in query and they seems to be matching and CJKFF
>>>>> is
>>>>>> transforming the 舊 to 旧
>>>>>>
>>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Lack of my chinese language knowledge but if you want, I can do quick
>>>>>> test
>>>>>>> for you in Analysis tab if you can give me what to put in index and
>>>>> query
>>>>>>> window...
>>>>>>>
>>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com
>>>>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Have you tried to use CJKFoldingFilter https://g
>>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
>>>>> cover
>>>>>>>> your use case but I am using this filter and so far no issues.
>>>>>>>>
>>>>>>>> Thnx
>>>>>>>>
>>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
>>>>> amanda.shu...@gmail.com
>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Alex - I have seen a few of those links but never considered
>>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The issue
>>> is
>>>>>>>>> basically what is laid out in the old blogspot post, namely this
>>>>> point:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> "Why approach CJK resource discovery differently?
>>>>>>>>>
>>>>>>>>> 2.  Search results must be as script agnostic as possible.
>>>>>>>>>
>>>>>>>>> There is more than one way to write each word. "Simplified"
>>>>> characters
>>>>>>>>> were
>>>>>>>>> emphasized for printed materials in mainland China starting in the
>>>>>> 1950s;
>>>>>>>>> "Traditional" characters were used in printed materials prior to the
>>>>>>>>> 1950s,
>>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
>>>>>>>>> Since the characters are distinct, it's as if Chinese materials are
>>>>>>>>> written
>>>>>>>>> in two scripts.
>>>>>>>>> Another way to think about it:  every written Chinese word has at
>>>>> least
>>>>>>>>> two
>>>>>>>>> completely different spellings.  And it can be mix-n-match:  a word
>>>>> can
>>>>>>>>> be
>>>>>>>>> written with one traditional  and one simplified character.
>>>>>>>>> Example:   Given a user query 舊小說  (traditional for old fiction),
>>> the
>>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
>>>>>> (simplified
>>>>>>>>> characters for old fiction)"
>>>>>>>>>
>>>>>>>>> So, using the example provided above, we are dealing with materials
>>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
>>>>>>>>>
>>>>>>>>> A. 舊小說
>>>>>>>>>
>>>>>>>>> can also be
>>>>>>>>>
>>>>>>>>> B. 旧小说 (all simplified)
>>>>>>>>> or
>>>>>>>>> C. 旧小說 (first character simplified, last character traditional)
>>>>>>>>> or
>>>>>>>>> D. 舊小 说 (first character traditional, last character simplified)
>>>>>>>>>
>>>>>>>>> Thankfully the middle character was never simplified in recent
>>> times.
>>>>>>>>>
>>>>>>>>> From a historical standpoint, the mixed nature of the characters in
>>>>> the
>>>>>>>>> same word/phrase is because not all simplified characters were
>>>>> adopted
>>>>>> at
>>>>>>>>> the same time by everyone uniformly (good times...).
>>>>>>>>>
>>>>>>>>> The problem seems to be that Solr can easily handle A or B above,
>>> but
>>>>>>>>> NOT C
>>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure how to
>>>>>> change
>>>>>>>>> that at this point... maybe I should figure out how to contact the
>>>>>>>>> creators
>>>>>>>>> of the analyzer and ask them?
>>>>>>>>>
>>>>>>>>> Amanda
>>>>>>>>>
>>>>>>>>> ------
>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>> Project
>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
>>>>>>>>> arafa...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This is probably your start, if not read already:
>>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-analysis.html
>>>>>>>>>>
>>>>>>>>>> Otherwise, I think your answer would be somewhere around using
>>>>> ICU4J,
>>>>>>>>>> IBM's library for dealing with Unicode:
>>>>> http://site.icu-project.org/
>>>>>>>>>> (mentioned on the same page above)
>>>>>>>>>> Specifically, transformations:
>>>>>>>>>> http://userguide.icu-project.org/transforms/general
>>>>>>>>>>
>>>>>>>>>> With that, maybe you map both alphabets into latin. I did that once
>>>>>>>>>> for Thai for a demo:
>>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
>>>>>>>>>> collection1/conf/schema.xml#L34
>>>>>>>>>>
>>>>>>>>>> The challenge is to figure out all the magic rules for that. You'd
>>>>>>>>>> have to dig through the ICU documentation and other web pages. I
>>>>>> found
>>>>>>>>>> this one for example:
>>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
>>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
>>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
>>>>>>>>>>
>>>>>>>>>> There is also 12 part series on Solr and Asian text processing,
>>>>>> though
>>>>>>>>>> it is a bit old now: http://discovery-grindstone.blogspot.com/
>>>>>>>>>>
>>>>>>>>>> Hope one of these things help.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>  Alex.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> We have a problem. Some of our historical documents have mixed
>>>>>>>>> together
>>>>>>>>>>> simplified and Chinese characters. There seems to be no problem
>>>>>> when
>>>>>>>>>>> searching either traditional or simplified separately - that is,
>>>>>> if a
>>>>>>>>>>> particular string/phrase is all in traditional or simplified, it
>>>>>>>>> finds
>>>>>>>>>> it -
>>>>>>>>>>> but it does not find the string/phrase if the two different
>>>>>>>>> characters
>>>>>>>>>> (one
>>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
>>>>>>>>>> string/phrase.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone ever handled this problem before? I know some
>>>>> libraries
>>>>>>>>> seem
>>>>>>>>>> to
>>>>>>>>>>> have implemented something that seems to be able to handle this,
>>>>>> but
>>>>>>>>> I'm
>>>>>>>>>>> not sure how they did so!
>>>>>>>>>>>
>>>>>>>>>>> Amanda
>>>>>>>>>>> ------
>>>>>>>>>>> Dr. Amanda Shuman
>>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
>>>>>>>>> Project
>>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
>>>>>>>>>>> PhD, University of California, Santa Cruz
>>>>>>>>>>> http://www.amandashuman.net/
>>>>>>>>>>> http://www.prchistoryresources.org/
>>>>>>>>>>> Office: +49 (0) 761 203 4925
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>
>> --
>> Tomoko Uchida
>

Re: Question regarding searching Chinese characters

Reply via email to