Re: Choosing tokenizer based on language of document

Erick Erickson Thu, 05 Apr 2012 14:36:57 -0700

This is really difficult to imagine working well. Even if you
do choose the appropriate analysis chain (and it must
be a chain here), and manage to appropriately tokenize
for each language, what happens at query time?


How do you expect to get matches on, say, Ukranian when
the tokens of the query are in Erse?

This feels like an XY problem, can you explain at a
higher level what your requirements are?

Best
Erick

On Wed, Apr 4, 2012 at 8:29 AM, Prakashganesh, Prabhu
<prabhu.prakashgan...@dowjones.com> wrote:
> Hi,
>      I have documents in different languages and I want to choose the 
> tokenizer to use for a document based on the language of the document. The 
> language of the document is already known and is indexed in a field. What I 
> want to do is when I index the text in the document, I want to choose the 
> tokenizer to use based on the value of the language field. I want to use one 
> field for the text in the document (defining multiple fields for each 
> language is not an option). It seems like I can define a tokenizer for a 
> field, so I guess what I need to do is to write a custom tokenizer that looks 
> at the language field value of the document and calls the appropriate 
> tokenizer for that language (e.g. StandardTokenizer for English, CJKTokenizer 
> for CJK languages etc..). From whatever I have read, it seems quite straight 
> forward to write a custom tokenizer, but how would this custom tokenizer know 
> the language of the document? Is there some way I can pass in this value to 
> the tokenizer? Or is there some way the tokenizer will have access to other 
> fields in the document?. Would be really helpful if someone can provide an 
> answer
>
> Thanks
> Prabhu

Re: Choosing tokenizer based on language of document

Reply via email to