Hi,

I'm using Solr 5.4.0, with the HMMChineseTokenizer in my Solr, and below is
my pipeline.

<fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
 <analyzer type="index">
<tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
 </analyzer>
 <analyzer type="query">
<!--<tokenizer class="solr.HMMChineseTokenizerFactory"/>-->
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="0" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false"/>
<filter class="solr.KStemFilterFactory"/>
      </analyzer>
  </fieldType>

I found that HMMChineseTokenizer will split a string that consist of
numbers and characters (alphanumeric). For example, if I have a code that
looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d
This has caused the search query speed to slow quite tremendously (like at
least 10 seconds slower), as it has to search through individual tokens.

Would like to check, is there any way that we can solve this issue without
re-indexing? We have quite alot of code in the index which consist of
alphanumeric characters, and we have more than 10 million documents in the
index, so re-indexing with another tokenizer or pipeline is quite a huge
process.


Regards,
Edwin

Reply via email to