Re: HMMChineseTokenizer splits up alphanumeric characters

Erick Erickson Sat, 19 Mar 2016 18:30:16 -0700

Yes, there is one and only one tokenizer allowed.

Best,
Erick


On Wed, Mar 16, 2016 at 7:51 PM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> Thanks Shawn for your reply.
>
> Yes, I'm looking to see if we can implement a combination of tokenizes and
> filters.
>
> However, I tried before that we can only implement one tokenizer for each
> fieldType. So is it true that I can only stick to one tokenizer, and the
> rest of the implementation have to be done by either filters or to
> customise the tokenizer in order to possibly achieve what I want?
>
> Regards,
> Edwin
>
>
> On 17 March 2016 at 09:34, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 3/16/2016 4:33 AM, Zheng Lin Edwin Yeo wrote:
>> > I found that HMMChineseTokenizer will split a string that consist of
>> > numbers and characters (alphanumeric). For example, if I have a code that
>> > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d
>> > This has caused the search query speed to slow quite tremendously (like
>> at
>> > least 10 seconds slower), as it has to search through individual tokens.
>> >
>> > Would like to check, is there any way that we can solve this issue
>> without
>> > re-indexing? We have quite alot of code in the index which consist of
>> > alphanumeric characters, and we have more than 10 million documents in
>> the
>> > index, so re-indexing with another tokenizer or pipeline is quite a huge
>> > process.
>>
>> ANY change you make to index analysis will require reindexing.
>>
>> I have no idea what the advantages and disadvantages are in the various
>> tokenizers and filters for Asian characters.  There may be a combination
>> of tokenizer and filters that will do what you want.
>>
>> We do have an index for a company in Japan.  I'm using ICUTokenizer with
>> some of the CJK filters, and in some cases I'm using
>> ICUFoldingFilterFactory for lowercasing and normalization.  The jars
>> required for ICU analysis components can be found in the contrib folder
>> in the Solr download.
>>
>> There are ways to create a whole new index and then move it into place
>> to replace your existing index.  For SolrCloud mode, you would use the
>> collection alias feature.  For standalone Solr, you can swap cores.
>>
>> Thanks,
>> Shawn
>>
>>

Re: HMMChineseTokenizer splits up alphanumeric characters

Reply via email to