Re: Question regarding searching Chinese characters

Tomoko Uchida Tue, 24 Jul 2018 05:55:06 -0700

Hi Amanda,

> do all I need to do is modify the settings from smartChinese to the ones
you posted here


Yes, the settings I posted should work for you, at least partially.
If you are happy with the results, it's OK!
But please take this as a starting point because it's not perfect.

> Or do I need to still do something with the SmartChineseAnalyzer?

Try the settings, then if you notice something strange and want to know why
and how to solve it, that may be the time to dive deep into. ;)

I cannot explain how analyzers works here... but you should start off with
the Solr documentation.
https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html

Regards,
Tomoko



2018年7月24日(火) 21:08 Amanda Shuman <amanda.shu...@gmail.com>:

> Hi Tomoko,
>
> Thanks so much for this explanation - I did not even know this was
> possible! I will try it out but I have one question: do all I need to do is
> modify the settings from smartChinese to the ones you posted here:
>
> <analyzer>
>   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
>   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
>   <filter class="solr.ICUTransformFilterFactory"
> id="Traditional-Simplified"/>
> </analyzer>
>
> Or do I need to still do something with the SmartChineseAnalyzer? I did not
> quite understand this in your first message:
>
> " I think you need two steps if you want to use HMMChineseTokenizer
> correctly.
>
> 1. transform all traditional characters to simplified ones and save to
> temporary files.
>     I do not have clear idea for doing this, but you can create a Java
> program that calls Lucene's ICUTransformFilter
> 2. then, index to Solr using SmartChineseAnalyzer."
>
> My understanding is that with the new settings you posted, I don't need to
> do these steps. Is that correct? Otherwise, I don't really know how to do
> step 1 with the java program....
>
> Thanks!
> Amanda
>
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 8:03 PM, Tomoko Uchida <
> tomoko.uchida.1...@gmail.com
> > wrote:
>
> > Yes, while traditional - simplified transformation would be out of the
> > scope of Unicode normalization,
> > you would like to add ICUNormalizer2CharFilterFactory anyway :)
> >
> > Let me refine my example settings:
> >
> > <analyzer>
> >   <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>
> >   <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> >   <filter class="solr.ICUTransformFilterFactory"
> > id="Traditional-Simplified"/>
> > </analyzer>
> >
> > Regards,
> > Tomoko
> >
> >
> > 2018年7月21日(土) 2:54 Alexandre Rafalovitch <arafa...@gmail.com>:
> >
> > > Would  ICUNormalizer2CharFilterFactory do? Or at least serve as a
> > > template of what needs to be done.
> > >
> > > Regards,
> > >    Alex.
> > >
> > > On 20 July 2018 at 12:40, Walter Underwood <wun...@wunderwood.org>
> > wrote:
> > > > Looks like we need a charfilter version of the ICU transforms. That
> > > could run before the tokenizer.
> > > >
> > > > I’ve never built a charfilter, but it seems like this would be a good
> > > first project for someone who wants to contribute.
> > > >
> > > > wunder
> > > > Walter Underwood
> > > > wun...@wunderwood.org
> > > > http://observer.wunderwood.org/  (my blog)
> > > >
> > > >> On Jul 20, 2018, at 8:24 AM, Tomoko Uchida <
> > > tomoko.uchida.1...@gmail.com> wrote:
> > > >>
> > > >> Exactly. More concretely, the starting point is: replacing your
> > analyzer
> > > >>
> > > >> <analyzer
> > > class="org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer"/>
> > > >>
> > > >> to
> > > >>
> > > >> <analyzer>
> > > >>  <tokenizer class="solr.HMMChineseTokenizerFactory"/>
> > > >>  <filter class="solr.ICUTransformFilterFactory"
> > > >> id="Traditional-Simplified"/>
> > > >> </analyzer>
> > > >>
> > > >> and see if the results are as expected. Then research another
> filters
> > if
> > > >> your requirements is not met.
> > > >>
> > > >> Just a reminder: HMMChineseTokenizerFactory do not handle
> traditional
> > > >> characters as I noted previous in post, so ICUTransformFilterFactory
> > is
> > > an
> > > >> incomplete workaround.
> > > >>
> > > >> 2018年7月21日(土) 0:05 Walter Underwood <wun...@wunderwood.org>:
> > > >>
> > > >>> I expect that this is the line that does the transformation:
> > > >>>
> > > >>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Traditional-Simplified"/>
> > > >>>
> > > >>> This mapping is a standard feature of ICU. More info on ICU
> > transforms
> > > is
> > > >>> in this doc, though not much detail on this particular transform.
> > > >>>
> > > >>> http://userguide.icu-project.org/transforms/general
> > > >>>
> > > >>> wunder
> > > >>> Walter Underwood
> > > >>> wun...@wunderwood.org
> > > >>> http://observer.wunderwood.org/  (my blog)
> > > >>>
> > > >>>> On Jul 20, 2018, at 7:43 AM, Susheel Kumar <susheel2...@gmail.com
> >
> > > >>> wrote:
> > > >>>>
> > > >>>> I think so.  I used the exact as in github
> > > >>>>
> > > >>>> <fieldType name="text_cjk" class="solr.TextField"
> > > >>>> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
> > > >>>> <analyzer>
> > > >>>>   <tokenizer class="solr.ICUTokenizerFactory" />
> > > >>>>   <filter class="solr.CJKWidthFilterFactory"/>
> > > >>>>   <filter
> > > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
> > > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Traditional-Simplified"/>
> > > >>>>   <filter class="solr.ICUTransformFilterFactory"
> > > >>> id="Katakana-Hiragana"/>
> > > >>>>   <filter class="solr.ICUFoldingFilterFactory"/>
> > > >>>>   <filter class="solr.CJKBigramFilterFactory" han="true"
> > > >>>> hiragana="true" katakana="true" hangul="true"
> outputUnigrams="true"
> > />
> > > >>>> </analyzer>
> > > >>>> </fieldType>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <
> > > amanda.shu...@gmail.com
> > > >>>>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Thanks! That does indeed look promising... This can be added on
> top
> > > of
> > > >>>>> Smart Chinese, right? Or is it an alternative?
> > > >>>>>
> > > >>>>>
> > > >>>>> ------
> > > >>>>> Dr. Amanda Shuman
> > > >>>>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> > > Project
> > > >>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>> PhD, University of California, Santa Cruz
> > > >>>>> http://www.amandashuman.net/
> > > >>>>> http://www.prchistoryresources.org/
> > > >>>>> Office: +49 (0) 761 203 4925
> > > >>>>>
> > > >>>>>
> > > >>>>> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <
> > > susheel2...@gmail.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>>> I think CJKFoldingFilter will work for you.  I put 舊小說 in index
> > and
> > > >>> then
> > > >>>>>> each of A, B or C or D in query and they seems to be matching
> and
> > > CJKFF
> > > >>>>> is
> > > >>>>>> transforming the 舊 to 旧
> > > >>>>>>
> > > >>>>>> On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <
> > > susheel2...@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> Lack of my chinese language knowledge but if you want, I can do
> > > quick
> > > >>>>>> test
> > > >>>>>>> for you in Analysis tab if you can give me what to put in index
> > and
> > > >>>>> query
> > > >>>>>>> window...
> > > >>>>>>>
> > > >>>>>>> On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <
> > > susheel2...@gmail.com
> > > >>>>
> > > >>>>>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Have you tried to use CJKFoldingFilter https://g
> > > >>>>>>>> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this
> > would
> > > >>>>> cover
> > > >>>>>>>> your use case but I am using this filter and so far no issues.
> > > >>>>>>>>
> > > >>>>>>>> Thnx
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> > > >>>>> amanda.shu...@gmail.com
> > > >>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Thanks, Alex - I have seen a few of those links but never
> > > considered
> > > >>>>>>>>> transliteration! We use lucene's Smart Chinese analyzer. The
> > > issue
> > > >>> is
> > > >>>>>>>>> basically what is laid out in the old blogspot post, namely
> > this
> > > >>>>> point:
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> "Why approach CJK resource discovery differently?
> > > >>>>>>>>>
> > > >>>>>>>>> 2.  Search results must be as script agnostic as possible.
> > > >>>>>>>>>
> > > >>>>>>>>> There is more than one way to write each word. "Simplified"
> > > >>>>> characters
> > > >>>>>>>>> were
> > > >>>>>>>>> emphasized for printed materials in mainland China starting
> in
> > > the
> > > >>>>>> 1950s;
> > > >>>>>>>>> "Traditional" characters were used in printed materials prior
> > to
> > > the
> > > >>>>>>>>> 1950s,
> > > >>>>>>>>> and are still used in Taiwan, Hong Kong and Macau today.
> > > >>>>>>>>> Since the characters are distinct, it's as if Chinese
> materials
> > > are
> > > >>>>>>>>> written
> > > >>>>>>>>> in two scripts.
> > > >>>>>>>>> Another way to think about it:  every written Chinese word
> has
> > at
> > > >>>>> least
> > > >>>>>>>>> two
> > > >>>>>>>>> completely different spellings.  And it can be mix-n-match:
> a
> > > word
> > > >>>>> can
> > > >>>>>>>>> be
> > > >>>>>>>>> written with one traditional  and one simplified character.
> > > >>>>>>>>> Example:   Given a user query 舊小說  (traditional for old
> > fiction),
> > > >>> the
> > > >>>>>>>>> results should include matches for 舊小說 (traditional) and 旧小说
> > > >>>>>> (simplified
> > > >>>>>>>>> characters for old fiction)"
> > > >>>>>>>>>
> > > >>>>>>>>> So, using the example provided above, we are dealing with
> > > materials
> > > >>>>>>>>> produced in the 1950s-1970s that do even weirder things like:
> > > >>>>>>>>>
> > > >>>>>>>>> A. 舊小說
> > > >>>>>>>>>
> > > >>>>>>>>> can also be
> > > >>>>>>>>>
> > > >>>>>>>>> B. 旧小说 (all simplified)
> > > >>>>>>>>> or
> > > >>>>>>>>> C. 旧小說 (first character simplified, last character
> traditional)
> > > >>>>>>>>> or
> > > >>>>>>>>> D. 舊小 说 (first character traditional, last character
> > simplified)
> > > >>>>>>>>>
> > > >>>>>>>>> Thankfully the middle character was never simplified in
> recent
> > > >>> times.
> > > >>>>>>>>>
> > > >>>>>>>>> From a historical standpoint, the mixed nature of the
> > characters
> > > in
> > > >>>>> the
> > > >>>>>>>>> same word/phrase is because not all simplified characters
> were
> > > >>>>> adopted
> > > >>>>>> at
> > > >>>>>>>>> the same time by everyone uniformly (good times...).
> > > >>>>>>>>>
> > > >>>>>>>>> The problem seems to be that Solr can easily handle A or B
> > above,
> > > >>> but
> > > >>>>>>>>> NOT C
> > > >>>>>>>>> or D using the Smart Chinese analyzer. I'm not really sure
> how
> > to
> > > >>>>>> change
> > > >>>>>>>>> that at this point... maybe I should figure out how to
> contact
> > > the
> > > >>>>>>>>> creators
> > > >>>>>>>>> of the analyzer and ask them?
> > > >>>>>>>>>
> > > >>>>>>>>> Amanda
> > > >>>>>>>>>
> > > >>>>>>>>> ------
> > > >>>>>>>>> Dr. Amanda Shuman
> > > >>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> Legacy
> > > >>>>> Project
> > > >>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>>>>>> PhD, University of California, Santa Cruz
> > > >>>>>>>>> http://www.amandashuman.net/
> > > >>>>>>>>> http://www.prchistoryresources.org/
> > > >>>>>>>>> Office: +49 (0) 761 203 4925
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > > >>>>>>>>> arafa...@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> This is probably your start, if not read already:
> > > >>>>>>>>>> https://lucene.apache.org/solr/guide/7_4/language-
> > analysis.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> Otherwise, I think your answer would be somewhere around
> using
> > > >>>>> ICU4J,
> > > >>>>>>>>>> IBM's library for dealing with Unicode:
> > > >>>>> http://site.icu-project.org/
> > > >>>>>>>>>> (mentioned on the same page above)
> > > >>>>>>>>>> Specifically, transformations:
> > > >>>>>>>>>> http://userguide.icu-project.org/transforms/general
> > > >>>>>>>>>>
> > > >>>>>>>>>> With that, maybe you map both alphabets into latin. I did
> that
> > > once
> > > >>>>>>>>>> for Thai for a demo:
> > > >>>>>>>>>> https://github.com/arafalov/solr-thai-test/blob/master/
> > > >>>>>>>>>> collection1/conf/schema.xml#L34
> > > >>>>>>>>>>
> > > >>>>>>>>>> The challenge is to figure out all the magic rules for that.
> > > You'd
> > > >>>>>>>>>> have to dig through the ICU documentation and other web
> > pages. I
> > > >>>>>> found
> > > >>>>>>>>>> this one for example:
> > > >>>>>>>>>> http://avajava.com/tutorials/lessons/what-are-the-system-
> > > >>>>>>>>>> transliterators-available-with-icu4j.html;jsessionid=
> > > >>>>>>>>>> BEAB0AF05A588B97B8A2393054D908C0
> > > >>>>>>>>>>
> > > >>>>>>>>>> There is also 12 part series on Solr and Asian text
> > processing,
> > > >>>>>> though
> > > >>>>>>>>>> it is a bit old now: http://discovery-grindstone.
> > blogspot.com/
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hope one of these things help.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>>  Alex.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 20 July 2018 at 03:54, Amanda Shuman <
> > > amanda.shu...@gmail.com>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>>> Hi all,
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We have a problem. Some of our historical documents have
> > mixed
> > > >>>>>>>>> together
> > > >>>>>>>>>>> simplified and Chinese characters. There seems to be no
> > problem
> > > >>>>>> when
> > > >>>>>>>>>>> searching either traditional or simplified separately -
> that
> > > is,
> > > >>>>>> if a
> > > >>>>>>>>>>> particular string/phrase is all in traditional or
> simplified,
> > > it
> > > >>>>>>>>> finds
> > > >>>>>>>>>> it -
> > > >>>>>>>>>>> but it does not find the string/phrase if the two different
> > > >>>>>>>>> characters
> > > >>>>>>>>>> (one
> > > >>>>>>>>>>> traditional, one simplified) are mixed together in the SAME
> > > >>>>>>>>>> string/phrase.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Has anyone ever handled this problem before? I know some
> > > >>>>> libraries
> > > >>>>>>>>> seem
> > > >>>>>>>>>> to
> > > >>>>>>>>>>> have implemented something that seems to be able to handle
> > > this,
> > > >>>>>> but
> > > >>>>>>>>> I'm
> > > >>>>>>>>>>> not sure how they did so!
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Amanda
> > > >>>>>>>>>>> ------
> > > >>>>>>>>>>> Dr. Amanda Shuman
> > > >>>>>>>>>>> Post-doc researcher, University of Freiburg, The Maoist
> > Legacy
> > > >>>>>>>>> Project
> > > >>>>>>>>>>> <http://www.maoistlegacy.uni-freiburg.de/>
> > > >>>>>>>>>>> PhD, University of California, Santa Cruz
> > > >>>>>>>>>>> http://www.amandashuman.net/
> > > >>>>>>>>>>> http://www.prchistoryresources.org/
> > > >>>>>>>>>>> Office: +49 (0) 761 203 4925
> > > >>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>
> > > >>>
> > > >>
> > > >> --
> > > >> Tomoko Uchida
> > > >
> > >
> >
> >
> > --
> > Tomoko Uchida
> >
>


-- 
Tomoko Uchida

Re: Question regarding searching Chinese characters

Reply via email to