RE: including a minus sign "-" in the token

Phil Scadden Fri, 09 Jun 2017 19:13:02 -0700

So, the field I am using for search has type of:
  <fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" 
ignoreCase="true"/>
      <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" 
synonyms="synonyms.txt"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

You are saying "wainui-8" will indexed as one token? But I should add a 
worddelimiterfilter to the analyser to prevent it being split? Or I guess the 
Worddelimitergraphfilter.

Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen 
followed by a number to NOT be treated as a hyphenated. That would mean 
catenateWords:1 but catenateNumbers:0???
What would it do with Wainui-10A?

-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: Saturday, 10 June 2017 12:43 a.m.
To: solr-user@lucene.apache.org
Subject: Re: including a minus sign "-" in the token

On 6/8/2017 8:39 PM, Phil Scadden wrote:
> We have important entities referenced in indexed documents which have
> convention naming of geographicname-number. Eg Wainui-8 I want the tokenizer 
> to treat it as Wainui-8 when indexing, and when I search I want to a q of 
> Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 
> but not with Wainui-9 or plain Wainui.
>
> Docs are pdfs, and I have using tika to extract text.
>
> How do I set up solr for queries like this?

At indexing time, Solr does not treat the hyphen as a special character like it 
does at query time.  Many analysis components do, though.  If your analysis 
chain includes certain components (the standard tokenizer, the ICU tokenizer, 
and WordDelimeterFilter are on that list), then the hypen may be treated as a 
word break character and the analysis could remove it.

At query time, a hyphen in the middle of a word is not treated as a special 
character.  It would need to be at the beginning of the query text or after a 
space for the query parser to treat it as a negation.
So Wainui-8 would not be a problem, but -7 would, and you'd need to specify it 
as \-7 for it to work like you want.

Thanks,
Shawn

Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.

RE: including a minus sign "-" in the token

Reply via email to