Hi all, Thanks for this enlightening thread. As it happens, at Stanford Libraries we’re currently working on upgrading from Solr 4 to 7 and we’re looking forward to using the new dictionary-based word splitting in the ICUTokenizer.
We have many of the same challenges as Amanda mentioned, and thanks to the advice on this thread, we’ve taken a stab at a CharFilter to do the traditional -> simplified transformation [1] and it seems to be promising and we've sent it out for testing by our subject matter experts for evaluation. Thanks, Chris [1] https://github.com/sul-dlss/CJKFilterUtils/blob/master/src/main/java/edu/stanford/lucene/analysis/ICUTransformCharFilter.java On 2018/07/24 12:54:35, Tomoko Uchida <t...@gmail.com> wrote: Hi Amanda,> do all I need to do is modify the settings from smartChinese to the ones> you posted here> Yes, the settings I posted should work for you, at least partially.> If you are happy with the results, it's OK!> But please take this as a starting point because it's not perfect.> Or do I need to still do something with the SmartChineseAnalyzer?> Try the settings, then if you notice something strange and want to know why> and how to solve it, that may be the time to dive deep into. ;)> I cannot explain how analyzers works here... but you should start off with> the Solr documentation.> https://lucene.apache.org/solr/guide/7_0/understanding-analyzers-tokenizers-and-filters.html> Regards,> Tomoko> 2018年7月24日(火) 21:08 Amanda Shuman <am...@gmail.com>:> Hi Tomoko,> Thanks so much for this explanation - I did not even know this was> possible! I will try it out but I have one question: do all I need to do is> modify the settings from smartChinese to the ones you posted here:> <analyzer>> <charFilter class="solr.ICUNormalizer2CharFilterFactory"/>> <tokenizer class="solr.HMMChineseTokenizerFactory"/>> <filter class="solr.ICUTransformFilterFactory"> id="Traditional-Simplified"/>> </analyzer>> Or do I need to still do something with the SmartChineseAnalyzer? I did not> quite understand this in your first message:> " I think you need two steps if you want to use HMMChineseTokenizer> correctly.> 1. transform all traditional characters to simplified ones and save to> temporary files.> I do not have clear idea for doing this, but you can create a Java> program that calls Lucene's ICUTransformFilter> 2. then, index to Solr using SmartChineseAnalyzer."> My understanding is that with the new settings you posted, I don't need to> do these steps. Is that correct? Otherwise, I don't really know how to do> step 1 with the java program....> Thanks!> Amanda> ------> Dr. Amanda Shuman> Post-doc researcher, University of Freiburg, The Maoist Legacy Project> <http://www.maoistlegacy.uni-freiburg.de/>> PhD, University of California, Santa Cruz> http://www.amandashuman.net/> http://www.prchistoryresources.org/> Office: +49 (0) 761 203 4925>