Sorry, the correct pipeline which I'm using should be this: <fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false"> <analyzer type="index"> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.KStemFilterFactory"/> </analyzer> </fieldType>
Regards, Edwin On 16 March 2016 at 18:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > Hi, > > I'm using Solr 5.4.0, with the HMMChineseTokenizer in my Solr, and below > is my pipeline. > > <fieldType name="text_chinese" class="solr.TextField" > positionIncrementGap="100" autoGeneratePhraseQueries="false"> > <analyzer type="index"> > <tokenizer class="solr.HMMChineseTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory"/> > <filter class="solr.StopFilterFactory" > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="1" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="1"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false"/> > <filter class="solr.KStemFilterFactory"/> > <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" > maxGramSize="15"/> > </analyzer> > <analyzer type="query"> > <!--<tokenizer class="solr.HMMChineseTokenizerFactory"/>--> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.CJKWidthFilterFactory"/> > <filter class="solr.CJKBigramFilterFactory"/> > <filter class="solr.StopFilterFactory" > words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="false"/> > <filter class="solr.KStemFilterFactory"/> > </analyzer> > </fieldType> > > I found that HMMChineseTokenizer will split a string that consist of > numbers and characters (alphanumeric). For example, if I have a code that > looks like "1a2b3c4d", it will be split to 1 | a | 2 | b | 3 | c | 4 | d > This has caused the search query speed to slow quite tremendously (like at > least 10 seconds slower), as it has to search through individual tokens. > > Would like to check, is there any way that we can solve this issue without > re-indexing? We have quite alot of code in the index which consist of > alphanumeric characters, and we have more than 10 million documents in the > index, so re-indexing with another tokenizer or pipeline is quite a huge > process. > > > Regards, > Edwin >