RE: including a minus sign "-" in the token

Phil Scadden Sun, 11 Jun 2017 16:33:07 -0700

Looking at the Classic tokenizer I notice that it does not split on hyphen if 
there is a  number in the word. Pretty much exactly what I want. What are the 
downsides to using Classic?

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Monday, 12 June 2017 2:44 a.m.
To: Phil Scadden <p.scad...@gns.cri.nz>
Subject: Re: including a minus sign "-" in the token

On 6/9/2017 8:12 PM, Phil Scadden wrote:
> So, the field I am using for search has type of:
>   <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100" multiValued="true">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="stopwords.txt" 
> ignoreCase="true"/>
>       <filter class="solr.SynonymFilterFactory" expand="true" 
> ignoreCase="true" synonyms="synonyms.txt"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>     </analyzer>
>   </fieldType>
>
> You are saying "wainui-8" will indexed as one token? But I should add a 
> worddelimiterfilter to the analyser to prevent it being split? Or I guess the 
> Worddelimitergraphfilter.

No, I was saying that the query parser won't look at the hyphen in
wainui-8 and treat it as a "NOT" operator.

Whatever you've got for index/query analysis will still take effect after that 
-- and it will do that even if you escape characters with a backslash.

Your index and query analysis are almost the same, but query analysis does 
synonym replacement.  The StandardTokenizerFactory will split "wainui-8" into 
two tokens and remove the hyphen, even if you escape it at query time.

> Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen 
> followed by a number to NOT be treated as a hyphenated. That would mean 
> catenateWords:1 but catenateNumbers:0???
> What would it do with Wainui-10A?

I'm not sure that there is any single built-in analysis component that will do 
what you want.  Your index analysis includes StandardTokenizerFactory, so it is 
going to remove hyphens and split tokens at those locations, whether it is 
followed by numbers or not.
You're going to need to switch to the whitespace tokenizer and add a filter 
(like the word delimeter filter) to do further splitting.  The 
"splitOnNumerics" setting for the word delimeter filter *might* do it, but I'm 
not sure.  It might take a combination of filters.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: including a minus sign "-" in the token

Reply via email to