Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Saïd Radhouani Mon, 05 Jul 2010 08:34:19 -0700

Thanks Koji for the reply and for updating wiki. As it's written now in wiki, 
it sounds (at least to me) like MappingCharFilterFactory works only with 
WhitespaceTokenizerFactory.

Did you really mean that? Because this filter  works also with other tkenizers. 
For instance, in my text type, I'm using StandardTokenizerFactory for document 
processing, and  WhitespaceTokenizerFactory for query processing.

I also noticed that, in whatever order you put this filter in the definition of 
a field type, it's always applied (during text processing) before the tokenizer 
and all the other filters. Is there a reason for that? Is there a possibility 
to force the filter to be applied at a certain order among the other filters?

Thanks,
-S

On Jul 5, 2010, at 4:28 PM, Koji Sekiguchi wrote:

> 
>> In the same wiki, they say that CharStreamAwareWhitespaceTokenizerFactory 
>> must be used with MappingCharFilterFactory. But, when I use these tokenizer 
>> and filter together, I get a sever error saying that the filed type 
>> containing these filter and tokenizer is unknown. However, when I use this 
>> filter with StandardTokenizerFactory  or WhitespaceTokenizerFactory!
>> 
>>   
> The wiki is not correct today. Before Lucene 2.9 (and Solr 1.4),
> Tokenizers can take Reader argument in constructor. But after that,
> because they can take CharStream argument in constructor,
> *CharStreamAware* Tokenizers are no longer needed (all Tokenizers
> are aware of CharStream). I'll update the wiki.
> 
> Koji
> 
> -- 
> http://www.rondhuit.com/en/
>

Re: Unicode processing - Issue with CharStreamAwareWhitespaceTokenizerFactory

Reply via email to