Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote: > On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > > past, we were using a patched version of StandardTokenizer which treated > > @twitteruser and #hashtag bett

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder wrote: > > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the > current stable StandardTokenizer handle CJK? > yes

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jacob Elder
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote: > (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) > > On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind > wrote: > > > > Wait, standardtokenizer already handles CJK and will put each CJK char > into > > it's own token?

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Robert Muir
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all) On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind wrote: > > Wait, standardtokenizer already handles CJK and will put each CJK char into > it's own token?  Really? I had no idea!  Is that documented anywhere, or you > just

Re: Good example of multiple tokenizers for a single field

2010-12-01 Thread Jonathan Rochkind
On 11/29/2010 5:43 PM, Robert Muir wrote: On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: * As a tokenizer, I use the WhitespaceTokenizer. * Then I apply a custom filter that looks for CJK chars, and re-tokenizes any CJK chars into one-token-per-char. This custom filter was written b

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Robert Muir
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote: > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the > past, we were using a patched version of StandardTokenizer which treated > @twitteruser and #hashtag better, but this became a release engineering > nightmare so we swi

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace. Perhaps I could rephrase the question a

Re: Good example of multiple tokenizers for a single field

2010-11-30 Thread Jacob Elder
+1 That's exactly what we need, too. On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote: > On 11/29/2010 3:15 PM, Jacob Elder wrote: > >> I am looking for a clear example of using more than one tokenizer for a >> source single field. My application has a single "body" field which until >> rece

RE: Good example of multiple tokenizers for a single field

2010-11-30 Thread jan.kurella
ext Jacob Elder [mailto:jel...@locamoda.com] >Sent: Montag, 29. November 2010 23:15 >To: solr-user@lucene.apache.org >Subject: Good example of multiple tokenizers for a single field > >I am looking for a clear example of using more than one tokenizer for a >source single field. My applicati

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Shawn Heisey
On 11/29/2010 3:15 PM, Jacob Elder wrote: I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single mes

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote: > > * As a tokenizer, I use the WhitespaceTokenizer. > > * Then I apply a custom filter that looks for CJK chars, and re-tokenizes > any CJK chars into one-token-per-char. This custom filter was written by > someone other than me; it is ope

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jonathan Rochkind
You can only use one tokenizer on given field, I think. But a tokenizer isn't in fact the only thing that can tokenize, an ordinary filter can change tokenization too, so you could use two filters in a row. You could also write your own custom tokenizer that does what you want, although I'm no

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote: > StandardTokenizer doesn't handle some of the tokens we need, like > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or > Korean. Am I wrong about that? it uses the unigram method for CJK ideographs... the CJKtokenizer

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
StandardTokenizer doesn't handle some of the tokens we need, like @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or Korean. Am I wrong about that? On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote: > On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > > The problem

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Robert Muir
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote: > The problem is that the field is not guaranteed to contain just a single > language. I'm looking for some way to pass it first through CJK, then > Whitespace. > > If I'm totally off-target here, is there a recommended way of dealing with > mixe

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
The problem is that the field is not guaranteed to contain just a single language. I'm looking for some way to pass it first through CJK, then Whitespace. If I'm totally off-target here, is there a recommended way of dealing with mixed-language fields? On Mon, Nov 29, 2010 at 5:22 PM, Markus Jels

Re: Good example of multiple tokenizers for a single field

2010-11-29 Thread Markus Jelsma
You can use only one tokenizer per analyzer. You'd better use separate fields + fieldTypes for different languages. > I am looking for a clear example of using more than one tokenizer for a > source single field. My application has a single "body" field which until > recently was all latin charac

Good example of multiple tokenizers for a single field

2010-11-29 Thread Jacob Elder
I am looking for a clear example of using more than one tokenizer for a source single field. My application has a single "body" field which until recently was all latin characters, but we're now encountering both English and Japanese words in a single message. Obviously, we need to be using CJK in