Re: HMMChineseTokenizer splits up alphanumeric characters

Shawn Heisey Fri, 18 Mar 2016 22:19:17 -0700

On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote:
> I found that HMMChineseTokenizer will split a string that consist of
> numbers and characters (alphanumeric). For example, if I have a code that
> looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d
> This has caused the search query speed to slow quite tremendously (like at
> least 10 seconds slower), as it has to search through individual tokens.
>
> Would like to check, is there any way that we can solve this issue without
> re-indexing? We have quite alot of code in the index which consist of
> alphanumeric characters, and we have more than 10 million documents in the
> index, so re-indexing with another tokenizer or pipeline is quite a huge
> process.


ANY change you make to index analysis will require reindexing.

I have no idea what the advantages and disadvantages are in the various
tokenizers and filters for Asian characters.  There may be a combination
of tokenizer and filters that will do what you want.

We do have an index for a company in Japan.  I'm using ICUTokenizer with
some of the CJK filters, and in some cases I'm using
ICUFoldingFilterFactory for lowercasing and normalization.  The jars
required for ICU analysis components can be found in the contrib folder
in the Solr download.

There are ways to create a whole new index and then move it into place
to replace your existing index.  For SolrCloud mode, you would use the
collection alias feature.  For standalone Solr, you can swap cores.

Thanks,
Shawn

Re: HMMChineseTokenizer splits up alphanumeric characters

Reply via email to