I found that in WordDelimiterFilterFactory, there is a parameter called splitOnNumerics, which does the same function as what HMMChineseTokenizer did.
- *splitOnNumerics="1"* causes alphabet => number transitions to generate a new part [Solr 1.3]: - "j2se" => "j" "2" "se" - default is true ("1"); set to 0 to turn off I suspect that HMMChineseTokenizer has build in a similar parameter which will split up the numeric characters, so I guess I have to check on the HMMChineseFilterFactory to see if it is possible to disable this parameters, since I believe there is no way to merge them back after it is split up? Regards, Edwin On 17 March 2016 at 11:11, Erick Erickson <erickerick...@gmail.com> wrote: > Yes, there is one and only one tokenizer allowed. > > Best, > Erick > > On Wed, Mar 16, 2016 at 7:51 PM, Zheng Lin Edwin Yeo > <edwinye...@gmail.com> wrote: > > Thanks Shawn for your reply. > > > > Yes, I'm looking to see if we can implement a combination of tokenizes > and > > filters. > > > > However, I tried before that we can only implement one tokenizer for each > > fieldType. So is it true that I can only stick to one tokenizer, and the > > rest of the implementation have to be done by either filters or to > > customise the tokenizer in order to possibly achieve what I want? > > > > Regards, > > Edwin > > > > > > On 17 March 2016 at 09:34, Shawn Heisey <apa...@elyograg.org> wrote: > > > >> On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote: > >> > I found that HMMChineseTokenizer will split a string that consist of > >> > numbers and characters (alphanumeric). For example, if I have a code > that > >> > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 > | d > >> > This has caused the search query speed to slow quite tremendously > (like > >> at > >> > least 10 seconds slower), as it has to search through individual > tokens. > >> > > >> > Would like to check, is there any way that we can solve this issue > >> without > >> > re-indexing? We have quite alot of code in the index which consist of > >> > alphanumeric characters, and we have more than 10 million documents in > >> the > >> > index, so re-indexing with another tokenizer or pipeline is quite a > huge > >> > process. > >> > >> ANY change you make to index analysis will require reindexing. > >> > >> I have no idea what the advantages and disadvantages are in the various > >> tokenizers and filters for Asian characters. There may be a combination > >> of tokenizer and filters that will do what you want. > >> > >> We do have an index for a company in Japan. I'm using ICUTokenizer with > >> some of the CJK filters, and in some cases I'm using > >> ICUFoldingFilterFactory for lowercasing and normalization. The jars > >> required for ICU analysis components can be found in the contrib folder > >> in the Solr download. > >> > >> There are ways to create a whole new index and then move it into place > >> to replace your existing index. For SolrCloud mode, you would use the > >> collection alias feature. For standalone Solr, you can swap cores. > >> > >> Thanks, > >> Shawn > >> > >> >