I think so. I used the exact as in github <fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="10000" autoGeneratePhraseQueries="false"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory" /> <filter class="solr.CJKWidthFilterFactory"/> <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> <filter class="solr.ICUTransformFilterFactory" id="Traditional-Simplified"/> <filter class="solr.ICUTransformFilterFactory" id="Katakana-Hiragana"/> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.CJKBigramFilterFactory" han="true" hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> </analyzer> </fieldType>
On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman <amanda.shu...@gmail.com> wrote: > Thanks! That does indeed look promising... This can be added on top of > Smart Chinese, right? Or is it an alternative? > > > ------ > Dr. Amanda Shuman > Post-doc researcher, University of Freiburg, The Maoist Legacy Project > <http://www.maoistlegacy.uni-freiburg.de/> > PhD, University of California, Santa Cruz > http://www.amandashuman.net/ > http://www.prchistoryresources.org/ > Office: +49 (0) 761 203 4925 > > > On Fri, Jul 20, 2018 at 3:11 PM, Susheel Kumar <susheel2...@gmail.com> > wrote: > > > I think CJKFoldingFilter will work for you. I put 舊小說 in index and then > > each of A, B or C or D in query and they seems to be matching and CJKFF > is > > transforming the 舊 to 旧 > > > > On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar <susheel2...@gmail.com> > > wrote: > > > > > Lack of my chinese language knowledge but if you want, I can do quick > > test > > > for you in Analysis tab if you can give me what to put in index and > query > > > window... > > > > > > On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar <susheel2...@gmail.com> > > > wrote: > > > > > >> Have you tried to use CJKFoldingFilter https://g > > >> ithub.com/sul-dlss/CJKFoldingFilter. I am not sure if this would > cover > > >> your use case but I am using this filter and so far no issues. > > >> > > >> Thnx > > >> > > >> On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman < > amanda.shu...@gmail.com > > > > > >> wrote: > > >> > > >>> Thanks, Alex - I have seen a few of those links but never considered > > >>> transliteration! We use lucene's Smart Chinese analyzer. The issue is > > >>> basically what is laid out in the old blogspot post, namely this > point: > > >>> > > >>> > > >>> "Why approach CJK resource discovery differently? > > >>> > > >>> 2. Search results must be as script agnostic as possible. > > >>> > > >>> There is more than one way to write each word. "Simplified" > characters > > >>> were > > >>> emphasized for printed materials in mainland China starting in the > > 1950s; > > >>> "Traditional" characters were used in printed materials prior to the > > >>> 1950s, > > >>> and are still used in Taiwan, Hong Kong and Macau today. > > >>> Since the characters are distinct, it's as if Chinese materials are > > >>> written > > >>> in two scripts. > > >>> Another way to think about it: every written Chinese word has at > least > > >>> two > > >>> completely different spellings. And it can be mix-n-match: a word > can > > >>> be > > >>> written with one traditional and one simplified character. > > >>> Example: Given a user query 舊小說 (traditional for old fiction), the > > >>> results should include matches for 舊小說 (traditional) and 旧小说 > > (simplified > > >>> characters for old fiction)" > > >>> > > >>> So, using the example provided above, we are dealing with materials > > >>> produced in the 1950s-1970s that do even weirder things like: > > >>> > > >>> A. 舊小說 > > >>> > > >>> can also be > > >>> > > >>> B. 旧小说 (all simplified) > > >>> or > > >>> C. 旧小說 (first character simplified, last character traditional) > > >>> or > > >>> D. 舊小 说 (first character traditional, last character simplified) > > >>> > > >>> Thankfully the middle character was never simplified in recent times. > > >>> > > >>> From a historical standpoint, the mixed nature of the characters in > the > > >>> same word/phrase is because not all simplified characters were > adopted > > at > > >>> the same time by everyone uniformly (good times...). > > >>> > > >>> The problem seems to be that Solr can easily handle A or B above, but > > >>> NOT C > > >>> or D using the Smart Chinese analyzer. I'm not really sure how to > > change > > >>> that at this point... maybe I should figure out how to contact the > > >>> creators > > >>> of the analyzer and ask them? > > >>> > > >>> Amanda > > >>> > > >>> ------ > > >>> Dr. Amanda Shuman > > >>> Post-doc researcher, University of Freiburg, The Maoist Legacy > Project > > >>> <http://www.maoistlegacy.uni-freiburg.de/> > > >>> PhD, University of California, Santa Cruz > > >>> http://www.amandashuman.net/ > > >>> http://www.prchistoryresources.org/ > > >>> Office: +49 (0) 761 203 4925 > > >>> > > >>> > > >>> On Fri, Jul 20, 2018 at 1:40 PM, Alexandre Rafalovitch < > > >>> arafa...@gmail.com> > > >>> wrote: > > >>> > > >>> > This is probably your start, if not read already: > > >>> > https://lucene.apache.org/solr/guide/7_4/language-analysis.html > > >>> > > > >>> > Otherwise, I think your answer would be somewhere around using > ICU4J, > > >>> > IBM's library for dealing with Unicode: > http://site.icu-project.org/ > > >>> > (mentioned on the same page above) > > >>> > Specifically, transformations: > > >>> > http://userguide.icu-project.org/transforms/general > > >>> > > > >>> > With that, maybe you map both alphabets into latin. I did that once > > >>> > for Thai for a demo: > > >>> > https://github.com/arafalov/solr-thai-test/blob/master/ > > >>> > collection1/conf/schema.xml#L34 > > >>> > > > >>> > The challenge is to figure out all the magic rules for that. You'd > > >>> > have to dig through the ICU documentation and other web pages. I > > found > > >>> > this one for example: > > >>> > http://avajava.com/tutorials/lessons/what-are-the-system- > > >>> > transliterators-available-with-icu4j.html;jsessionid= > > >>> > BEAB0AF05A588B97B8A2393054D908C0 > > >>> > > > >>> > There is also 12 part series on Solr and Asian text processing, > > though > > >>> > it is a bit old now: http://discovery-grindstone.blogspot.com/ > > >>> > > > >>> > Hope one of these things help. > > >>> > > > >>> > Regards, > > >>> > Alex. > > >>> > > > >>> > > > >>> > On 20 July 2018 at 03:54, Amanda Shuman <amanda.shu...@gmail.com> > > >>> wrote: > > >>> > > Hi all, > > >>> > > > > >>> > > We have a problem. Some of our historical documents have mixed > > >>> together > > >>> > > simplified and Chinese characters. There seems to be no problem > > when > > >>> > > searching either traditional or simplified separately - that is, > > if a > > >>> > > particular string/phrase is all in traditional or simplified, it > > >>> finds > > >>> > it - > > >>> > > but it does not find the string/phrase if the two different > > >>> characters > > >>> > (one > > >>> > > traditional, one simplified) are mixed together in the SAME > > >>> > string/phrase. > > >>> > > > > >>> > > Has anyone ever handled this problem before? I know some > libraries > > >>> seem > > >>> > to > > >>> > > have implemented something that seems to be able to handle this, > > but > > >>> I'm > > >>> > > not sure how they did so! > > >>> > > > > >>> > > Amanda > > >>> > > ------ > > >>> > > Dr. Amanda Shuman > > >>> > > Post-doc researcher, University of Freiburg, The Maoist Legacy > > >>> Project > > >>> > > <http://www.maoistlegacy.uni-freiburg.de/> > > >>> > > PhD, University of California, Santa Cruz > > >>> > > http://www.amandashuman.net/ > > >>> > > http://www.prchistoryresources.org/ > > >>> > > Office: +49 (0) 761 203 4925 > > >>> > > > >>> > > >> > > >> > > > > > >