I tracked down an example from a sample solr config of a CJK setup with bigrams and no CJK tokenizer:
< fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.CJKWidthFilterFactory"/> <!-- for any non-CJK --> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> </analyzer> </fieldType> Seems like it could be a good approach, but I also saw mention of an ICU Tokenizer that might be well suited to Chinese text, but may be intended for a multilingual field? ( https://cwiki.apache.org/confluence/display/solr/Tokenizers#Tokenizers-ICUTokenizer). Anyone have an familiarity with ICU vs Standard for a field that will store only Chinese text. -Tom On Fri, Dec 5, 2014 at 5:41 PM, Tom Zimmermann <zimm.to...@gmail.com> wrote: > Thanks for the links. The dzone lnk was nice and concise, but > unfortunately makes use of the now deprecated CJK tokenizer. Does anyone > out there have some examples or experience working with the recommended > replacement for CJK? > > Thanks, > TZ >