Re: HMMChineseTokenizer splits up alphanumeric characters

Zheng Lin Edwin Yeo Sat, 19 Mar 2016 07:32:43 -0700

Thanks Shawn for your reply.

Yes, I'm looking to see if we can implement a combination of tokenizes and
filters.


However, I tried before that we can only implement one tokenizer for each
fieldType. So is it true that I can only stick to one tokenizer, and the
rest of the implementation have to be done by either filters or to
customise the tokenizer in order to possibly achieve what I want?

Regards,
Edwin


On 17 March 2016 at 09:34, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote:
> > I found that HMMChineseTokenizer will split a string that consist of
> > numbers and characters (alphanumeric). For example, if I have a code that
> > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d
> > This has caused the search query speed to slow quite tremendously (like
> at
> > least 10 seconds slower), as it has to search through individual tokens.
> >
> > Would like to check, is there any way that we can solve this issue
> without
> > re-indexing? We have quite alot of code in the index which consist of
> > alphanumeric characters, and we have more than 10 million documents in
> the
> > index, so re-indexing with another tokenizer or pipeline is quite a huge
> > process.
>
> ANY change you make to index analysis will require reindexing.
>
> I have no idea what the advantages and disadvantages are in the various
> tokenizers and filters for Asian characters.  There may be a combination
> of tokenizer and filters that will do what you want.
>
> We do have an index for a company in Japan.  I'm using ICUTokenizer with
> some of the CJK filters, and in some cases I'm using
> ICUFoldingFilterFactory for lowercasing and normalization.  The jars
> required for ICU analysis components can be found in the contrib folder
> in the Solr download.
>
> There are ways to create a whole new index and then move it into place
> to replace your existing index.  For SolrCloud mode, you would use the
> collection alias feature.  For standalone Solr, you can swap cores.
>
> Thanks,
> Shawn
>
>

Re: HMMChineseTokenizer splits up alphanumeric characters

Reply via email to