Thanks Shawn for your reply. Yes, I'm looking to see if we can implement a combination of tokenizes and filters.
However, I tried before that we can only implement one tokenizer for each fieldType. So is it true that I can only stick to one tokenizer, and the rest of the implementation have to be done by either filters or to customise the tokenizer in order to possibly achieve what I want? Regards, Edwin On 17 March 2016 at 09:34, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote: > > I found that HMMChineseTokenizer will split a string that consist of > > numbers and characters (alphanumeric). For example, if I have a code that > > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d > > This has caused the search query speed to slow quite tremendously (like > at > > least 10 seconds slower), as it has to search through individual tokens. > > > > Would like to check, is there any way that we can solve this issue > without > > re-indexing? We have quite alot of code in the index which consist of > > alphanumeric characters, and we have more than 10 million documents in > the > > index, so re-indexing with another tokenizer or pipeline is quite a huge > > process. > > ANY change you make to index analysis will require reindexing. > > I have no idea what the advantages and disadvantages are in the various > tokenizers and filters for Asian characters. There may be a combination > of tokenizer and filters that will do what you want. > > We do have an index for a company in Japan. I'm using ICUTokenizer with > some of the CJK filters, and in some cases I'm using > ICUFoldingFilterFactory for lowercasing and normalization. The jars > required for ICU analysis components can be found in the contrib folder > in the Solr download. > > There are ways to create a whole new index and then move it into place > to replace your existing index. For SolrCloud mode, you would use the > collection alias feature. For standalone Solr, you can swap cores. > > Thanks, > Shawn > >