On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote: > I found that HMMChineseTokenizer will split a string that consist of > numbers and characters (alphanumeric). For example, if I have a code that > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d > This has caused the search query speed to slow quite tremendously (like at > least 10 seconds slower), as it has to search through individual tokens. > > Would like to check, is there any way that we can solve this issue without > re-indexing? We have quite alot of code in the index which consist of > alphanumeric characters, and we have more than 10 million documents in the > index, so re-indexing with another tokenizer or pipeline is quite a huge > process.
ANY change you make to index analysis will require reindexing. I have no idea what the advantages and disadvantages are in the various tokenizers and filters for Asian characters. There may be a combination of tokenizer and filters that will do what you want. We do have an index for a company in Japan. I'm using ICUTokenizer with some of the CJK filters, and in some cases I'm using ICUFoldingFilterFactory for lowercasing and normalization. The jars required for ICU analysis components can be found in the contrib folder in the Solr download. There are ways to create a whole new index and then move it into place to replace your existing index. For SolrCloud mode, you would use the collection alias feature. For standalone Solr, you can swap cores. Thanks, Shawn