RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

Vadim Gorlovetsky Wed, 25 Mar 2015 10:07:22 -0700

Thanks for a quick response.

A bit confusing that analyzer of "query" type configured to use 
KeywordTokenizerFactory does not un-tokenize query criteria.
I guess whitespace only the special case because it separates phrases in a 
query and runs prior analyzing.


Actually I am handling a query the way you are recommended:
Double quotes for exact matching and escaped whitespace for a values with 
wildcards (double quotes do not work as probably considering "*" wildcard as a 
part of the criteria value).

Thanks
Vadim

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, March 25, 2015 6:34 PM
To: solr-user@lucene.apache.org
Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

This is a _very_ common thing we all had to learn; what you're seeing is the 
results of the _query parser_, not the analysis chain. Anything like
proj_name_sort:term1 term2 gets split at the query parser level, attaching 
&debug=query to the URL should show down in the "parsed query" section 
something like:

proj_name_sort:term1 default_search_field:term2

To get thing through the query parser, enclose in double quotes, escape the 
space and such. That'll get the terms _as a single token_ to the analysis chain 
for that field where the behavior will be what you expect.

Best,
Erick

On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky <vadim...@amdocs.com> wrote:
> Hello,
>
> solr.KeywordTokenizerFactory seems splitting by whitespaces though according 
> SOLR documentation shouldn't do that.
>
>
> For example I have the following configuration for the fields "proj_name" and 
> "proj_name_sort":
>
> <field name="proj_name" type="sortable_text_general" indexed="true" 
> stored="true"/> <field name="proj_name_sort" type="string_sort" 
> indexed="true" stored="false"/> ......
>
> <copyField source="proj_name" dest="proj_name_sort" /> 
> ..................
>
> <fieldType name="string_sort" class="solr.TextField" sortMissingLast="true" 
> omitNorms="true">
>   <analyzer>
>     <!-- KeywordTokenizer does no actual tokenizing, so the entire
>          input string is preserved as a single token
>      -->
>     <tokenizer class="solr.KeywordTokenizerFactory"/>
>     <!-- The LowerCase TokenFilter does what you expect, which can be
>          when you want your sorting to be case insensitive
>       -->
>     <filter class="solr.LowerCaseFilterFactory" />
>     <!-- The TrimFilter removes any leading or trailing whitespace -->
>     <filter class="solr.TrimFilterFactory" />
>   </analyzer>
> </fieldType>
>
> There are 3 indexed documents having the respective field values:
> proj_name:
> Test1008
> CR610070 Test1
> CR610070 Another Test2
>
> Searching on the "proj_name_sort" giving me the following results:
>
> Query
>
> Expected
>
> Real
>
> Comments
>
> proj_name_sort : CR610070 Test1
>
> CR610070 Test1
>
> CR610070 Test1
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te
>
> None
>
> None
>
> Expectable as seems searching exact un-tokenized value
>
> proj_name_sort : CR610070 Te*
>
> CR610070 Test1
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 An*
>
> CR610070 Another Test2
>
> CR610070 Another Test2
>
> Expectable as seems applying wild card on un-tokenized value
>
> proj_name_sort : CR610070 Another Te*
>
> CR610070 Another Test2
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
> proj_name_sort : CR610070 Another Test1*
>
> None
>
> CR610070 Test1, Test1008, CR610070 Another Test2
>
> Seems splits on tokens by whitespace ?????
>
>
> Please, advise the way to search on un-tokenized fields using partial 
> criteria and wild cards.
>
> Thanks
> Vadim
>
>
> This message and the information contained herein is proprietary and 
> confidential and subject to the Amdocs policy statement, you may review at 
> http://www.amdocs.com/email_disclaimer.asp

This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces

Reply via email to