Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

T. Kuro Kurosaka Fri, 20 Jun 2014 14:39:12 -0700

On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:

Let's say a predominantly English document contains a Chinese sentence.  If the 
English field uses the WhitespaceTokenizer with a basic WordDelimiterFilter, 
the Chinese sentence could be tokenized as one big token (if it doesn't have 
any punctuation, of course) and will be effectively unsearchable...barring use 
of wildcards.


In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
generates a token per han character. So they are searcheable though

precision suffers. But in your scenario, Chinese text is rare, so someprecision

loss may not be a real issue.

Kuro

Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to