Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Koji Sekiguchi Mon, 05 Jul 2010 15:08:27 -0700

No, all tokenizer can be used with mappingcharfilter

Koji Sekiguchi from mobile



On 2010/07/06, at 0:32, Saïd Radhouani <r.steve....@gmail.com> wrote:

> Thanks Koji for the reply and for updating wiki. As it's written now in wiki, 
> it sounds (at least to me) like MappingCharFilterFactory works only with 
> WhitespaceTokenizerFactory.
> 
> Did you really mean that? Because this filter  works also with other 
> tkenizers. For instance, in my text type, I'm using StandardTokenizerFactory 
> for document processing, and  WhitespaceTokenizerFactory for query processing.
> 
> I also noticed that, in whatever order you put this filter in the definition 
> of a field type, it's always applied (during text processing) before the 
> tokenizer and all the other filters. Is there a reason for that? Is there a 
> possibility to force the filter to be applied at a certain order among the 
> other filters?
> 
> Thanks,
> -S
> 
> On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:
> 
>> 
>>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory 
>>> must be used with MappingCharFilterFactory. But, when I use these tokenizer 
>>> and filter together, I get a sever error saying that the filed type 
>>> containing these filter and tokenizer is unknown. However, when I use this 
>>> filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>>> 
>>> 
>> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
>> Tokenizers can take Reader argument in constructor. But after that,
>> because they can take CharStream argument in constructor,
>> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
>> are aware of CharStream). I'll update the wiki.
>> 
>> Koji
>> 
>> -- 
>> http://www.rondhuit.com/en/
>> 
>

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Reply via email to