KeywordTokenizerFactory splits the string for the exclamation mark

Romani Rupasinghe Thu, 15 May 2014 18:17:19 -0700

Hi All

I have a following field settings in solr schema


<field name="<b>Exact_Word" omitPositions="true" termVectors="false"
omitTermFreqAndPositions="true" compressed="true" type="string_ci"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<field name="Word" compressed="true" type="email_text_ptn"
multiValued="false" indexed="true" stored="true" required="false"
omitNorms="true"/>

<fieldtype name="string_ci" class="solr.TextField" sortMissingLast="true"
omitNorms="true"><analyzer><tokenizer
class="solr.KeywordTokenizerFactory"/><filter
class="solr.LowerCaseFilterFactory"/></analyzer></fieldtype>

<copyField source="Word" dest="Exact_Word"/>

As you can see Exact_Word has the KeywordTokenizerFactory and that should
treat the string as it is.

Following is my responseHeader. As you can see I am searching my string
only in the filed Exact_Word and expecting it to return the Word field and
the score

"responseHeader":{
    "status":0,
    "QTime":14,
    "params":{
      "explainOther":"",
      "fl":"Word,score",
      "debugQuery":"on",
      "indent":"on",
      "start":"0",
      "q":"[email protected]",
      "qf":"Exact_Word",
      "wt":"json",
      "fq":"",
      "version":"2.2",
      "rows":"10"}},


But when I enter email with the following string "d!
[email protected]" it splits the string to two. I was under the
impression that KeywordTokenizerFactory will treat the string as it is.

Following is the query debug result. There you can see it has split the word
 "parsedquery":"+((DisjunctionMaxQuery((Exact_Word:d))
-DisjunctionMaxQuery((Exact_Word:[email protected])))~1)",

can someone please tell why it produce the query result as this

If I put a string without the "!" sign as below, the produced query will be
as below
 "parsedquery":"+DisjunctionMaxQuery((
Exact_Word:[email protected]))",. This is what I expected
solr to even with the "!" mark. with "_" mark it wont do a string split and
treats the string as it is

I thought if the KeywordTokenizerFactory is applied then it should return
the exact string as it is

Please help me to understand what is going wrong here

KeywordTokenizerFactory splits the string for the exclamation mark

Reply via email to