Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote: > On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > > past, we were using a patched version of StandardTokenizer which treated > > @twitteruser and #hashtag bett

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder wrote: > > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the > current stable StandardTokenizer handle CJK? > yes

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote: > (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) > > On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind > wrote: > > > > Wait, standardtokenizer already handles CJK and will put each CJK char > into > > it's own token?

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind wrote: > > Wait, standardtokenizer already handles CJK and will put each CJK char into > it's own token?  Really? I had no idea!  Is that documented anywhere, or you > just

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jonathan Rochkind
On 11/29/2010 5:43 PM, Robert Muir wrote: On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: * As a tokenizer, I use the WhitespaceTokenizer. * Then I apply a custom filter that looks for CJK chars, and re-tokenizes any CJK chars into one-token-per-char. This custom filter was written b

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Robert Muir
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > past, we were using a patched version of StandardTokenizer which treated > @twitteruser and #hashtag better, but this became a release engineering > nightmare so we swi

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. Perhaps I could rephrase the question a

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
+1 That's exactly what we need, too. On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote: > On 11/29/2010 3:15 PM, Jacob Elder wrote: > >> I am looking for a clear example of using more than one tokenizer for a >> source single field. My application has a single "body" field which until >> rece

RE: Good example of multiple tokenizers for a single field

2010-11-30 Thread jan.kurella
We had the same problem for our fields and we wrote a Tokenizer using the icu4j library. Breaking tokens at script changes, and dealing with them according the script and the configured Breakiterators. This works out very well, as we also add the "scrip" information to the token so later filter

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Shawn Heisey
On 11/29/2010 3:15 PM, Jacob Elder wrote: I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single mes

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: > > * As a tokenizer, I use the WhitespaceTokenizer. > > * Then I apply a custom filter that looks for CJK chars, and re-tokenizes > any CJK chars into one-token-per-char. This custom filter was written by > someone other than me; it is ope

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jonathan Rochkind
You can only use one tokenizer on given field, I think. But a tokenizer isn't in fact the only thing that can tokenize, an ordinary filter can change tokenization too, so you could use two filters in a row. You could also write your own custom tokenizer that does what you want, although I'm no

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote: > StandardTokenizer doesn't handle some of the tokens we need, like > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or > Korean. Am I wrong about that? it uses the unigram method for CJK ideographs... the CJKtokenizer

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or Korean. Am I wrong about that? On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > > The problem

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > The problem is that the field is not guaranteed to contain just a single > language. I'm looking for some way to pass it first through CJK, then > Whitespace. > > If I'm totally off-target here, is there a recommended way of dealing with > mixe

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace. If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jels

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Markus Jelsma
You can use only one tokenizer per analyzer. You'd better use separate fields + fieldTypes for different languages. > I am looking for a clear example of using more than one tokenizer for a > source single field. My application has a single "body" field which until > recently was all latin charac