On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:
Let's say a predominantly English document contains a Chinese sentence. If the
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter,
the Chinese sentence could be tokenized as one big token (if it doesn't have
any punctuation, of course) and will be effectively unsearchable...barring use
of wildcards.
In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
generates a token per han character. So they are searcheable though
precision suffers. But in your scenario, Chinese text is rare, so some
precision
loss may not be a real issue.
Kuro