Re: Question regarding searching Chinese characters

2018-08-14 Thread Christopher Beer
Hi all, Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re currently working on upgrading from Solr 4 to 7 and we’re looking forward to using the new dictionary-based word splitting in the ICUTokenizer. We have many of the same challenges as Amanda mentioned, and th

Re: Question regarding searching Chinese characters

2018-07-24 Thread Tomoko Uchida
Hi Amanda, > do all I need to do is modify the settings from smartChinese to the ones you posted here Yes, the settings I posted should work for you, at least partially. If you are happy with the results, it's OK! But please take this as a starting point because it's not perfect. > Or do I need

Re: Question regarding searching Chinese characters

2018-07-24 Thread Amanda Shuman
Hi Tomoko, Thanks so much for this explanation - I did not even know this was possible! I will try it out but I have one question: do all I need to do is modify the settings from smartChinese to the ones you posted here: Or do I need to still do something with the SmartChineseAnalyzer

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Yes, while traditional - simplified transformation would be out of the scope of Unicode normalization, you would like to add ICUNormalizer2CharFilterFactory anyway :) Let me refine my example settings: Regards, Tomoko 2018年7月21日(土) 2:54 Alexandre Rafalovitch : > Would ICUNormalize

Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
Would ICUNormalizer2CharFilterFactory do? Or at least serve as a template of what needs to be done. Regards, Alex. On 20 July 2018 at 12:40, Walter Underwood wrote: > Looks like we need a charfilter version of the ICU transforms. That could run > before the tokenizer. > > I’ve never built a

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
Looks like we need a charfilter version of the ICU transforms. That could run before the tokenizer. I’ve never built a charfilter, but it seems like this would be a good first project for someone who wants to contribute. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.o

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Exactly. More concretely, the starting point is: replacing your analyzer to and see if the results are as expected. Then research another filters if your requirements is not met. Just a reminder: HMMChineseTokenizerFactory do not handle traditional characters as I noted previous in po

Re: Question regarding searching Chinese characters

2018-07-20 Thread Walter Underwood
I expect that this is the line that does the transformation: This mapping is a standard feature of ICU. More info on ICU transforms is in this doc, though not much detail on this particular transform. http://userguide.icu-project.org/transforms/general wunder Walter Underwood wun...@wunde

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think so. I used the exact as in github On Fri, Jul 20, 2018 at 10:12 AM, Amanda Shuman wrote: > Thanks! That does indeed look promising... This can be added on top of > Smart Chinese, right? Or is it an alternative? > > > -- > Dr. Amanda Shum

Re: Question regarding searching Chinese characters

2018-07-20 Thread Tomoko Uchida
Hi, There is ICUTransformFilter (that included Solr distribution) which also should be work for you. See the example settings: https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#icu-transform-filter Combine it with HMMChineseTokenizer. https://lucene.apache.org/solr/guide/7_4/langu

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks! That does indeed look promising... This can be added on top of Smart Chinese, right? Or is it an alternative? -- Dr. Amanda Shuman Post-doc researcher, University of Freiburg, The Maoist Legacy Project PhD, University of California, Santa Cru

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
I think CJKFoldingFilter will work for you. I put 舊小說 in index and then each of A, B or C or D in query and they seems to be matching and CJKFF is transforming the 舊 to 旧 On Fri, Jul 20, 2018 at 9:08 AM, Susheel Kumar wrote: > Lack of my chinese language knowledge but if you want, I can do quic

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Lack of my chinese language knowledge but if you want, I can do quick test for you in Analysis tab if you can give me what to put in index and query window... On Fri, Jul 20, 2018 at 8:59 AM, Susheel Kumar wrote: > Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/ > CJKFoldingF

Re: Question regarding searching Chinese characters

2018-07-20 Thread Susheel Kumar
Have you tried to use CJKFoldingFilter https://github.com/sul-dlss/CJKFoldingFilter. I am not sure if this would cover your use case but I am using this filter and so far no issues. Thnx On Fri, Jul 20, 2018 at 8:44 AM, Amanda Shuman wrote: > Thanks, Alex - I have seen a few of those links but

Re: Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Thanks, Alex - I have seen a few of those links but never considered transliteration! We use lucene's Smart Chinese analyzer. The issue is basically what is laid out in the old blogspot post, namely this point: "Why approach CJK resource discovery differently? 2. Search results must be as scrip

Re: Question regarding searching Chinese characters

2018-07-20 Thread Alexandre Rafalovitch
This is probably your start, if not read already: https://lucene.apache.org/solr/guide/7_4/language-analysis.html Otherwise, I think your answer would be somewhere around using ICU4J, IBM's library for dealing with Unicode: http://site.icu-project.org/ (mentioned on the same page above) Specifical

Question regarding searching Chinese characters

2018-07-20 Thread Amanda Shuman
Hi all, We have a problem. Some of our historical documents have mixed together simplified and Chinese characters. There seems to be no problem when searching either traditional or simplified separately - that is, if a particular string/phrase is all in traditional or simplified, it finds it - but