Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Shawn Heisey Thu, 26 Jun 2014 08:31:34 -0700

On 6/26/2014 7:27 AM, Allison, Timothy B. wrote:
> So, I'm left with this as a candidate for the "text_all" field (I'll probably 
> add a stop filter, too):
>
>     <fieldType name="text_all" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer>
>         <tokenizer class="solr.ICUTokenizerFactory"/>
>         <!-- normalize width before bigram, as e.g. half-width dakuten 
> combine  -->
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <!-- for any non-CJK -->
>         <filter class="solr.ICUFoldingFilterFactory"/>
>         <filter class="solr.CJKBigramFilterFactory" outputUnigrams="true"/>
>       </analyzer>
>     </fieldType>
>
> Any and all feedback welcome.   Again, the goal is to create a field that is 
> as robust as possible against all languages as a fallback to the language 
> specific fields.


I believe that ICUFoldingFilter does everything that CJKWidthFilter
does, so you can probably remove that filter.  Width folding is
mentioned in the javadocs:

http://lucene.apache.org/core/4_8_0/analyzers-icu/org/apache/lucene/analysis/icu/ICUFoldingFilter.html

If I'm wrong about that, someone please let me know.

Thanks,
Shawn

Re: ICUTokenizer or StandardTokenizer or ??? for "text_all" type field that might include non-whitespace langs

Reply via email to