On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:
Let's say a predominantly English document contains a Chinese sentence.  If the 
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, 
the Chinese sentence could be tokenized as one big token (if it doesn't have 
any punctuation, of course) and will be effectively unsearchable...barring use 
of wildcards.

In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
generates a token per han character. So they are searcheable though
precision suffers. But in your scenario, Chinese text is rare, so some precision
loss may not be a real issue.

Kuro

Reply via email to