Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.

Perhaps I could rephrase the question as follows:

Is there a literal configuration example of what this wiki article suggests:

http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields

Further, could I then use copyFields to get those back into a single field?

On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir <rcm...@gmail.com> wrote:

> On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder <jel...@locamoda.com> wrote:
> > StandardTokenizer doesn't handle some of the tokens we need, like
> > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese
> or
> > Korean. Am I wrong about that?
>
> it uses the unigram method for CJK ideographs... the CJKtokenizer just
> uses the bigram method, its just an alternative method.
>
> the whitespace doesnt work at all though, so give up on that!
>



-- 
Jacob Elder
@jelder
(646) 535-3379

Reply via email to