On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote:
> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote:
> > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> > past, we were using a patched version of StandardTokenizer which treated
> > @twitteruser and #hashtag bett
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder wrote:
>
> What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
> current stable StandardTokenizer handle CJK?
>
yes
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote:
> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
>
> On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind
> wrote:
> >
> > Wait, standardtokenizer already handles CJK and will put each CJK char
> into
> > it's own token?
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind wrote:
>
> Wait, standardtokenizer already handles CJK and will put each CJK char into
> it's own token? Really? I had no idea! Is that documented anywhere, or you
> just
On 11/29/2010 5:43 PM, Robert Muir wrote:
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote:
* As a tokenizer, I use the WhitespaceTokenizer.
* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written b
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote:
> Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> past, we were using a patched version of StandardTokenizer which treated
> @twitteruser and #hashtag better, but this became a release engineering
> nightmare so we swi
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.
Perhaps I could rephrase the question a
+1
That's exactly what we need, too.
On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote:
> On 11/29/2010 3:15 PM, Jacob Elder wrote:
>
>> I am looking for a clear example of using more than one tokenizer for a
>> source single field. My application has a single "body" field which until
>> rece
ext Jacob Elder [mailto:jel...@locamoda.com]
>Sent: Montag, 29. November 2010 23:15
>To: solr-user@lucene.apache.org
>Subject: Good example of multiple tokenizers for a single field
>
>I am looking for a clear example of using more than one tokenizer for a
>source single field. My applicati
On 11/29/2010 3:15 PM, Jacob Elder wrote:
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single mes
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote:
>
> * As a tokenizer, I use the WhitespaceTokenizer.
>
> * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
> any CJK chars into one-token-per-char. This custom filter was written by
> someone other than me; it is ope
You can only use one tokenizer on given field, I think. But a tokenizer
isn't in fact the only thing that can tokenize, an ordinary filter can
change tokenization too, so you could use two filters in a row.
You could also write your own custom tokenizer that does what you want,
although I'm no
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote:
> StandardTokenizer doesn't handle some of the tokens we need, like
> @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
> Korean. Am I wrong about that?
it uses the unigram method for CJK ideographs... the CJKtokenizer
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?
On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote:
> > The problem
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote:
> The problem is that the field is not guaranteed to contain just a single
> language. I'm looking for some way to pass it first through CJK, then
> Whitespace.
>
> If I'm totally off-target here, is there a recommended way of dealing with
> mixe
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.
If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?
On Mon, Nov 29, 2010 at 5:22 PM, Markus Jels
You can use only one tokenizer per analyzer. You'd better use separate fields +
fieldTypes for different languages.
> I am looking for a clear example of using more than one tokenizer for a
> source single field. My application has a single "body" field which until
> recently was all latin charac
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single message. Obviously, we need to be using CJK
in
18 matches
Mail list logo