RE: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Allison, Timothy B. Thu, 26 Jun 2014 06:28:24 -0700

Thank you, Alex, Kuro and Simon.  I've had a chance to look into this a bit 
more.

I was under the (wrong) belief that the ICUTokenizer splits on individual 
Chinese characters like the StandardAnalyzer after (mis)reading these two 
sources 
(http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation
 and https://issues.apache.org/jira/browse/LUCENE-2906 ).

However, after brief experimentation and 
(http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/icu-tokenizer.html),
 I learned that the ICUTokenizer is using dictionary lookup to perform some 
basic segmentation.

施 瓦 辛 格 生于 奧地利 施 蒂 利亞 州 的 塔 爾

My initial concern was with how this would play with the CJKBigramFilter.  
After further brief experimentation and looking at the test cases, I think that 
I found that (thanks to Robert Muir) it "just works."  So, even though the 
ICUTokenizer is doing some segmentation on words, the CJKBigramFilter is 
returning the same overlapping bigrams for both the StandardTokenizer and the 
ICUTokenizer.

So, I'm left with this as a candidate for the "text_all" field (I'll probably 
add a stop filter, too):

    <fieldType name="text_all" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.ICUTokenizerFactory"/>
        <!-- normalize width before bigram, as e.g. half-width dakuten combine  
-->
        <filter class="solr.CJKWidthFilterFactory"/>
        <!-- for any non-CJK -->
        <filter class="solr.ICUFoldingFilterFactory"/>
        <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>
      </analyzer>
    </fieldType>

Any and all feedback welcome.   Again, the goal is to create a field that is as 
robust as possible against all languages as a fallback to the language specific 
fields.

Thank you.

        Best,

                  Tim

-----Original Message-----
From: T. Kuro Kurosaka [mailto:k...@healthline.com] 
Sent: Friday, June 20, 2014 5:38 PM
To: solr-user@lucene.apache.org
Subject: Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field 
that might include non-whitespace langs

On 06/20/2014 04:04 AM, Allison, Timothy B. wrote:
> Let's say a predominantly English document contains a Chinese sentence.  If 
> the English field uses the WhitespaceTokenizer with a basic 
> WordDelimiterFilter, the Chinese sentence could be tokenized as one big token 
> (if it doesn't have any punctuation, of course) and will be effectively 
> unsearchable...barring use of wildcards.

In my experiment with Solr 4.6.1, both StandardTokenizer and ICUTokenizer
generates a token per han character. So they are searcheable though
precision suffers. But in your scenario, Chinese text is rare, so some 
precision
loss may not be a real issue.

Kuro

RE: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to