Re: Good example of multiple tokenizers for a single field

Jonathan Rochkind Mon, 29 Nov 2010 14:41:33 -0800

You can only use one tokenizer on given field, I think. But a tokenizerisn't in fact the only thing that can tokenize, an ordinary filter canchange tokenization too, so you could use two filters in a row.

You could also write your own custom tokenizer that does what you want,although I'm not entirely sure if you turn exactly what you say intocode it will actually do what you want, I think it's more complicated, Ithink you'll need a tokenizer that looks for contiguous blocks of bytesthat are UTF-8 CJK and does one thing to them, and contiguous blocks ofbytes that are not UTF8 CJK and does another thing to them; rather thanjust "first do one to the whole string and then do another."

Dealing with mixed language fields is tricky, I know of no generalpurpose good solutions, in part just because of the semantics involved.

If you have some strings for the field you know are CJK, adn others youknow are English, the easiest thing to do is NOT put them in the samefield, but put them in different fields, and use dismax (for example) tosearch both fields on query. But if you can't even tell at index timewhich is which, or if you have strings that themselves include both CJKand English interspersed with each other, that might not work.

For my own case, where everything is just interspersed in the fields andI don't really know what language it is, here's what I do, which isdefinitely not great for CJK, but is better than nothing:


* As a tokenizer, I use the WhitespaceTokenizer.

* Then I apply a custom filter that looks for CJK chars, andre-tokenizes any CJK chars into one-token-per-char. This custom filterwas written by someone other than me; it is open source; but I'm notsure if it's actually in a public repo, or how well documented it is. Ican put you in touch with the author to try and ask. There may also be amore standard filter other than the custom one I'm using that does thesame thing?


Jonathan

Jonathan

On 11/29/2010 5:30 PM, Jacob Elder wrote:

The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.

If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?

On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

You can use only one tokenizer per analyzer. You'd better use separate
fields +
fieldTypes for different languages.

I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both

English

and Japanese words in a single message. Obviously, we need to be using

CJK

in addition to WhitespaceTokenizerFactory.

I've found some references to using copyFields or NGrams but I can't

quite

grasp what the whole solution would look like.

Re: Good example of multiple tokenizers for a single field

Reply via email to