You can only use one tokenizer on given field, I think. But a tokenizer
isn't in fact the only thing that can tokenize, an ordinary filter can
change tokenization too, so you could use two filters in a row.
You could also write your own custom tokenizer that does what you want,
although I'm not entirely sure if you turn exactly what you say into
code it will actually do what you want, I think it's more complicated, I
think you'll need a tokenizer that looks for contiguous blocks of bytes
that are UTF-8 CJK and does one thing to them, and contiguous blocks of
bytes that are not UTF8 CJK and does another thing to them; rather than
just "first do one to the whole string and then do another."
Dealing with mixed language fields is tricky, I know of no general
purpose good solutions, in part just because of the semantics involved.
If you have some strings for the field you know are CJK, adn others you
know are English, the easiest thing to do is NOT put them in the same
field, but put them in different fields, and use dismax (for example) to
search both fields on query. But if you can't even tell at index time
which is which, or if you have strings that themselves include both CJK
and English interspersed with each other, that might not work.
For my own case, where everything is just interspersed in the fields and
I don't really know what language it is, here's what I do, which is
definitely not great for CJK, but is better than nothing:
* As a tokenizer, I use the WhitespaceTokenizer.
* Then I apply a custom filter that looks for CJK chars, and
re-tokenizes any CJK chars into one-token-per-char. This custom filter
was written by someone other than me; it is open source; but I'm not
sure if it's actually in a public repo, or how well documented it is. I
can put you in touch with the author to try and ask. There may also be a
more standard filter other than the custom one I'm using that does the
same thing?
Jonathan
Jonathan
On 11/29/2010 5:30 PM, Jacob Elder wrote:
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.
If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?
On Mon, Nov 29, 2010 at 5:22 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:
You can use only one tokenizer per analyzer. You'd better use separate
fields +
fieldTypes for different languages.
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both
English
and Japanese words in a single message. Obviously, we need to be using
CJK
in addition to WhitespaceTokenizerFactory.
I've found some references to using copyFields or NGrams but I can't
quite
grasp what the whole solution would look like.