Right. CJK doesn't tend to have a lot of whitespace to begin with. In the past, we were using a patched version of StandardTokenizer which treated @twitteruser and #hashtag better, but this became a release engineering nightmare so we switched to Whitespace.
Perhaps I could rephrase the question as follows: Is there a literal configuration example of what this wiki article suggests: http://wiki.apache.org/solr/SchemaXml#Indexing_same_data_in_multiple_fields Further, could I then use copyFields to get those back into a single field? On Mon, Nov 29, 2010 at 5:39 PM, Robert Muir <rcm...@gmail.com> wrote: > On Mon, Nov 29, 2010 at 5:35 PM, Jacob Elder <jel...@locamoda.com> wrote: > > StandardTokenizer doesn't handle some of the tokens we need, like > > @twitteruser, and as far as I can tell, doesn't handle Chinese, Japanese > or > > Korean. Am I wrong about that? > > it uses the unigram method for CJK ideographs... the CJKtokenizer just > uses the bigram method, its just an alternative method. > > the whitespace doesnt work at all though, so give up on that! > -- Jacob Elder @jelder (646) 535-3379