Re: KeywordTokenizerFactory splits by whitespaces

Erick Erickson Wed, 25 Mar 2015 09:36:43 -0700

This is a _very_ common thing we all had to learn; what you're seeing
is the results of the _query parser_, not the analysis chain. Anything
like
proj_name_sort:term1 term2 gets split at the query parser level,
attaching &debug=query to the URL should show down in the "parsed
query" section something like:


proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes,
escape the space and such. That'll get the terms _as a single token_
to the analysis chain for that field where the behavior will be what
you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky <vadim...@amdocs.com> wrote:
> Hello,
>
> solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
> SOLR documentation shouldn't do that.
>
>
> For example I have the following configuration for the fields "proj_name" and 
> "proj_name_sort":
>
> <field name="proj_name" type="sortable_text_general" indexed="true" 
> stored="true"/>
> <field name="proj_name_sort" type="string_sort" indexed="true" 
> stored="false"/>
> ......
>
> <copyField source="proj_name" dest="proj_name_sort" />
> ..................
>
> <fieldType name="string_sort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>   <analyzer>
>     <!-- KeywordTokenizer does no actual tokenizing, so the entire
>          input string is preserved as a single token
>      -->
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <!-- The LowerCase TokenFilter does what you expect, which can be
>          when you want your sorting to be case insensitive
>       -->
>     <filter class="solr.LowerCaseFilterFactory" />
>     <!-- The TrimFilter removes any leading or trailing whitespace -->
>     <filter class="solr.TrimFilterFactory" />
>   </analyzer>
> </fieldType>
>
> There are 3 indexed documents having the respective field values:
> proj_name:
> Test1008
> CR610070 Test1
> CR610070 Another Test2
>
> Searching on the "proj_name_sort" giving me the following results:
>
> Query
>
> Expected
>
> Real
>
> Comments
>
> proj_name_sort : CR610070 Test1
>
> CR610070 Test1
>
> CR610070 Test1
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te
>
> None
>
> None
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te*
>
> CR610070 Test1
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 An*
>
> CR610070 Another Test2
>
> CR610070 Another Test2
>
> Expectable as seems applying wild card on un-tokenized value
>
> proj_name_sort : CR610070 Another Te*
>
> CR610070 Another Test2
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 Another Test1*
>
> None
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
>
> Please, advise the way to search on un-tokenized fields using partial 
> criteria and wild cards.
>
> Thanks
> Vadim
>
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement,
> you may review at http://www.amdocs.com/email_disclaimer.asp

Re: KeywordTokenizerFactory splits by whitespaces

Reply via email to