On Tue, Nov 30, 2010 at 10:07 AM, Robert Muir wrote:
> On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote:
> > Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> > past, we were using a patched version of StandardTokenizer which treated
> > @twitteruser and #hashtag bett
On Wed, Dec 1, 2010 at 12:25 PM, Jacob Elder wrote:
>
> What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
> current stable StandardTokenizer handle CJK?
>
yes
On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir wrote:
> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
>
> On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind
> wrote:
> >
> > Wait, standardtokenizer already handles CJK and will put each CJK char
> into
> > it's own token?
(Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind wrote:
>
> Wait, standardtokenizer already handles CJK and will put each CJK char into
> it's own token? Really? I had no idea! Is that documented anywhere, or you
> just
On 11/29/2010 5:43 PM, Robert Muir wrote:
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote:
* As a tokenizer, I use the WhitespaceTokenizer.
* Then I apply a custom filter that looks for CJK chars, and re-tokenizes
any CJK chars into one-token-per-char. This custom filter was written b
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote:
> Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> past, we were using a patched version of StandardTokenizer which treated
> @twitteruser and #hashtag better, but this became a release engineering
> nightmare so we swi
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.
Perhaps I could rephrase the question a
+1
That's exactly what we need, too.
On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote:
> On 11/29/2010 3:15 PM, Jacob Elder wrote:
>
>> I am looking for a clear example of using more than one tokenizer for a
>> source single field. My application has a single "body" field which until
>> rece
We had the same problem for our fields and we wrote a Tokenizer using the icu4j
library. Breaking tokens at script changes, and dealing with them according the
script and the configured Breakiterators.
This works out very well, as we also add the "scrip" information to the token
so later filter
On 11/29/2010 3:15 PM, Jacob Elder wrote:
I am looking for a clear example of using more than one tokenizer for a
source single field. My application has a single "body" field which until
recently was all latin characters, but we're now encountering both English
and Japanese words in a single mes
On Mon, Nov 29, 2010 at 5:41 PM, Jonathan Rochkind wrote:
>
> * As a tokenizer, I use the WhitespaceTokenizer.
>
> * Then I apply a custom filter that looks for CJK chars, and re-tokenizes
> any CJK chars into one-token-per-char. This custom filter was written by
> someone other than me; it is ope
You can only use one tokenizer on given field, I think. But a tokenizer
isn't in fact the only thing that can tokenize, an ordinary filter can
change tokenization too, so you could use two filters in a row.
You could also write your own custom tokenizer that does what you want,
although I'm no
On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder wrote:
> StandardTokenizer doesn't handle some of the tokens we need, like
> @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
> Korean. Am I wrong about that?
it uses the unigram method for CJK ideographs... the CJKtokenizer
StandardTokenizer doesn't handle some of the tokens we need, like
@twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese or
Korean. Am I wrong about that?
On Mon, Nov 29, 2010 at 5:31 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote:
> > The problem
On Mon, Nov 29, 2010 at 5:30 PM, Jacob Elder wrote:
> The problem is that the field is not guaranteed to contain just a single
> language. I'm looking for some way to pass it first through CJK, then
> Whitespace.
>
> If I'm totally off-target here, is there a recommended way of dealing with
> mixe
The problem is that the field is not guaranteed to contain just a single
language. I'm looking for some way to pass it first through CJK, then
Whitespace.
If I'm totally off-target here, is there a recommended way of dealing with
mixed-language fields?
On Mon, Nov 29, 2010 at 5:22 PM, Markus Jels
You can use only one tokenizer per analyzer. You'd better use separate fields +
fieldTypes for different languages.
> I am looking for a clear example of using more than one tokenizer for a
> source single field. My application has a single "body" field which until
> recently was all latin charac
17 matches
Mail list logo