Re: Question regarding searching Chinese characters

Susheel Kumar Fri, 20 Jul 2018 07:43:32 -0700

I think so.  I used the exact as in github

<fieldType name="text_cjk" class="solr.TextField"
positionIncrementGap="10000" autoGeneratePhraseQueries="false">
  <analyzer>
    <tokenizer class="solr.ICUTokenizerFactory" />
    <filter class="solr.CJKWidthFilterFactory"/>
    <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
    <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/>
    <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.CJKBigramFilterFactory" han="true"
hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
  </analyzer>
</fieldType>




On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com>
wrote:

> Thanks! That does indeed look promising... This can be added on top of
> Smart Chinese, right? Or is it an alternative?
>
>
> ------
> Dr. Amanda Shuman
> Post-doc researcher, University of Freiburg, The Maoist Legacy Project
> <http://www.maoistlegacy.uni-freiburg.de/>
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 4925
>
>
> On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com>
> wrote:
>
> > I think CJKFoldingFilter will work for you.  I put 舊小說 in index and then
> > each of A, B or C or D in query and they seems to be matching and CJKFF
> is
> > transforming the 舊 to 旧
> >
> > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com>
> > wrote:
> >
> > > Lack of my chinese language knowledge but if you want, I can do quick
> > test
> > > for you in Analysis tab if you can give me what to put in index and
> query
> > > window...
> > >
> > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com>
> > > wrote:
> > >
> > >> Have you tried to use CJKFoldingFilter https://g
> > >> ithub.com/sul-dlss/CJKFoldingFilter.  I am not sure if this would
> cover
> > >> your use case but I am using this filter and so far no issues.
> > >>
> > >> Thnx
> > >>
> > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman <
> amanda.shu...@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Thanks, Alex - I have seen a few of those links but never considered
> > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is
> > >>> basically what is laid out in the old blogspot post, namely this
> point:
> > >>>
> > >>>
> > >>> "Why approach CJK resource discovery differently?
> > >>>
> > >>> 2.  Search results must be as script agnostic as possible.
> > >>>
> > >>> There is more than one way to write each word. "Simplified"
> characters
> > >>> were
> > >>> emphasized for printed materials in mainland China starting in the
> > 1950s;
> > >>> "Traditional" characters were used in printed materials prior to the
> > >>> 1950s,
> > >>> and are still used in Taiwan, Hong Kong and Macau today.
> > >>> Since the characters are distinct, it's as if Chinese materials are
> > >>> written
> > >>> in two scripts.
> > >>> Another way to think about it:  every written Chinese word has at
> least
> > >>> two
> > >>> completely different spellings.  And it can be mix-n-match:  a word
> can
> > >>> be
> > >>> written with one traditional  and one simplified character.
> > >>> Example:   Given a user query 舊小說  (traditional for old fiction), the
> > >>> results should include matches for 舊小說 (traditional) and 旧小说
> > (simplified
> > >>> characters for old fiction)"
> > >>>
> > >>> So, using the example provided above, we are dealing with materials
> > >>> produced in the 1950s-1970s that do even weirder things like:
> > >>>
> > >>> A. 舊小說
> > >>>
> > >>> can also be
> > >>>
> > >>> B. 旧小说 (all simplified)
> > >>> or
> > >>> C. 旧小說 (first character simplified, last character traditional)
> > >>> or
> > >>> D. 舊小 说 (first character traditional, last character simplified)
> > >>>
> > >>> Thankfully the middle character was never simplified in recent times.
> > >>>
> > >>> From a historical standpoint, the mixed nature of the characters in
> the
> > >>> same word/phrase is because not all simplified characters were
> adopted
> > at
> > >>> the same time by everyone uniformly (good times...).
> > >>>
> > >>> The problem seems to be that Solr can easily handle A or B above, but
> > >>> NOT C
> > >>> or D using the Smart Chinese analyzer. I'm not really sure how to
> > change
> > >>> that at this point... maybe I should figure out how to contact the
> > >>> creators
> > >>> of the analyzer and ask them?
> > >>>
> > >>> Amanda
> > >>>
> > >>> ------
> > >>> Dr. Amanda Shuman
> > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy
> Project
> > >>> <http://www.maoistlegacy.uni-freiburg.de/>
> > >>> PhD, University of California, Santa Cruz
> > >>> http://www.amandashuman.net/
> > >>> http://www.prchistoryresources.org/
> > >>> Office: +49 (0) 761 203 4925
> > >>>
> > >>>
> > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch <
> > >>> arafa...@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > This is probably your start, if not read already:
> > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html
> > >>> >
> > >>> > Otherwise, I think your answer would be somewhere around using
> ICU4J,
> > >>> > IBM's library for dealing with Unicode:
> http://site.icu-project.org/
> > >>> > (mentioned on the same page above)
> > >>> > Specifically, transformations:
> > >>> > http://userguide.icu-project.org/transforms/general
> > >>> >
> > >>> > With that, maybe you map both alphabets into latin. I did that once
> > >>> > for Thai for a demo:
> > >>> > https://github.com/arafalov/solr-thai-test/blob/master/
> > >>> > collection1/conf/schema.xml#L34
> > >>> >
> > >>> > The challenge is to figure out all the magic rules for that. You'd
> > >>> > have to dig through the ICU documentation and other web pages. I
> > found
> > >>> > this one for example:
> > >>> > http://avajava.com/tutorials/lessons/what-are-the-system-
> > >>> > transliterators-available-with-icu4j.html;jsessionid=
> > >>> > BEAB0AF05A588B97B8A2393054D908C0
> > >>> >
> > >>> > There is also 12 part series on Solr and Asian text processing,
> > though
> > >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/
> > >>> >
> > >>> > Hope one of these things help.
> > >>> >
> > >>> > Regards,
> > >>> >    Alex.
> > >>> >
> > >>> >
> > >>> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com>
> > >>> wrote:
> > >>> > > Hi all,
> > >>> > >
> > >>> > > We have a problem. Some of our historical documents have mixed
> > >>> together
> > >>> > > simplified and Chinese characters. There seems to be no problem
> > when
> > >>> > > searching either traditional or simplified separately - that is,
> > if a
> > >>> > > particular string/phrase is all in traditional or simplified, it
> > >>> finds
> > >>> > it -
> > >>> > > but it does not find the string/phrase if the two different
> > >>> characters
> > >>> > (one
> > >>> > > traditional, one simplified) are mixed together in the SAME
> > >>> > string/phrase.
> > >>> > >
> > >>> > > Has anyone ever handled this problem before? I know some
> libraries
> > >>> seem
> > >>> > to
> > >>> > > have implemented something that seems to be able to handle this,
> > but
> > >>> I'm
> > >>> > > not sure how they did so!
> > >>> > >
> > >>> > > Amanda
> > >>> > > ------
> > >>> > > Dr. Amanda Shuman
> > >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy
> > >>> Project
> > >>> > > <http://www.maoistlegacy.uni-freiburg.de/>
> > >>> > > PhD, University of California, Santa Cruz
> > >>> > > http://www.amandashuman.net/
> > >>> > > http://www.prchistoryresources.org/
> > >>> > > Office: +49 (0) 761 203 4925
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Question regarding searching Chinese characters

Reply via email to