KeywordTokenizerFactory splits by whitespaces

Vadim Gorlovetsky Wed, 25 Mar 2015 09:27:57 -0700

Hello,

solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
SOLR documentation shouldn't do that.



For example I have the following configuration for the fields "proj_name" and 
"proj_name_sort":

<field name="proj_name" type="sortable_text_general" indexed="true" 
stored="true"/>
<field name="proj_name_sort" type="string_sort" indexed="true" stored="false"/>
......

<copyField source="proj_name" dest="proj_name_sort" />
..................

<fieldType name="string_sort" class="solr.TextField" sortMissingLast="true" 
omitNorms="true">
  <analyzer>
    <!-- KeywordTokenizer does no actual tokenizing, so the entire
         input string is preserved as a single token
     -->
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <!-- The LowerCase TokenFilter does what you expect, which can be
         when you want your sorting to be case insensitive
      -->
    <filter class="solr.LowerCaseFilterFactory" />
    <!-- The TrimFilter removes any leading or trailing whitespace -->
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldType>

There are 3 indexed documents having the respective field values:
proj_name:
Test1008
CR610070 Test1
CR610070 Another Test2

Searching on the "proj_name_sort" giving me the following results:

Query

Expected

Real

Comments

proj_name_sort : CR610070 Test1

CR610070 Test1

CR610070 Test1

Expectable as seems searching exact un-tokenized value

proj_name_sort : CR610070 Te

None

None

Expectable as seems searching exact un-tokenized value

proj_name_sort : CR610070 Te*

CR610070 Test1

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?????

proj_name_sort : CR610070 An*

CR610070 Another Test2

CR610070 Another Test2

Expectable as seems applying wild card on un-tokenized value

proj_name_sort : CR610070 Another Te*

CR610070 Another Test2

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?????

proj_name_sort : CR610070 Another Test1*

None

CR610070 Test1, Test1008, CR610070 Another Test2

Seems splits on tokens by whitespace ?????


Please, advise the way to search on un-tokenized fields using partial criteria 
and wild cards.

Thanks
Vadim


This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

KeywordTokenizerFactory splits by whitespaces

Reply via email to