HMMChineseTokenizer splits up alphanumeric characters

Zheng Lin Edwin Yeo Wed, 16 Mar 2016 03:34:47 -0700

Hi,

I'm using Solr 5.4.0, with the HMMChineseTokenizer in my Solr, and below is
my pipeline.


<fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
 <analyzer type="index">
<tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
 </analyzer>
 <analyzer type="query">
<!--<tokenizer class="solr.HMMChineseTokenizerFactory"/>-->
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.KStemFilterFactory"/>
      </analyzer>
  </fieldType>

I found that HMMChineseTokenizer will split a string that consist of
numbers and characters (alphanumeric). For example, if I have a code that
looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d
This has caused the search query speed to slow quite tremendously (like at
least 10 seconds slower), as it has to search through individual tokens.

Would like to check, is there any way that we can solve this issue without
re-indexing? We have quite alot of code in the index which consist of
alphanumeric characters, and we have more than 10 million documents in the
index, so re-indexing with another tokenizer or pipeline is quite a huge
process.


Regards,
Edwin

HMMChineseTokenizer splits up alphanumeric characters

Reply via email to