Thank you, Alex, Kuro and Simon. I've had a chance to look into this a bit more.
I was under the (wrong) belief that the ICUTokenizer splits on individual Chinese characters like the StandardAnalyzer after (mis)reading these two sources (http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation and https://issues.apache.org/jira/browse/LUCENE-2906 ). However, after brief experimentation and (http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/icu-tokenizer.html), I learned that the ICUTokenizer is using dictionary lookup to perform some basic segmentation. 施 瓦 辛 格 生于 奧地利 施 蒂 利亞 州 的 塔 爾 My initial concern was with how this would play with the CJKBigramFilter. After further brief experimentation and looking at the test cases, I think that I found that (thanks to Robert Muir) it "just works." So, even though the ICUTokenizer is doing some segmentation on words, the CJKBigramFilter is returning the same overlapping bigrams for both the StandardTokenizer and the ICUTokenizer. So, I'm left with this as a candidate for the "text_all" field (I'll probably add a stop filter, too): <fieldType name="text_all" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.ICUTokenizerFactory"/> <!-- normalize width before bigram, as e.g. half-width dakuten combine --> <filter class="solr.CJKWidthFilterFactory"/> <!-- for any non-CJK --> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/> </analyzer> </fieldType> Any and all feedback welcome. Again, the goal is to create a field that is as robust as possible against all languages as a fallback to the language specific fields. Thank you. Best, Tim -----Original Message----- From: T. Kuro Kurosaka [mailto:k...@healthline.com] Sent: Friday, June 20, 2014 5:38 PM To: solr-user@lucene.apache.org Subject: Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs On 06/20/2014 04:04 AM, Allison, Timothy B. wrote: > Let's say a predominantly English document contains a Chinese sentence. If > the English field uses the WhitespaceTokenizer with a basic > WordDelimiterFilter, the Chinese sentence could be tokenized as one big token > (if it doesn't have any punctuation, of course) and will be effectively > unsearchable...barring use of wildcards. In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer generates a token per han character. So they are searcheable though precision suffers. But in your scenario, Chinese text is rare, so some precision loss may not be a real issue. Kuro