So, the field I am using for search has type of: <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
You are saying "wainui-8" will indexed as one token? But I should add a worddelimiterfilter to the analyser to prevent it being split? Or I guess the Worddelimitergraphfilter. Ideally I want "inter-montane" say, to be treated as hyphenated, but hyphen followed by a number to NOT be treated as a hyphenated. That would mean catenateWords:1 but catenateNumbers:0??? What would it do with Wainui-10A? -----Original Message----- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Saturday, 10 June 2017 12:43 a.m. To: solr-user@lucene.apache.org Subject: Re: including a minus sign "-" in the token On 6/8/2017 8:39 PM, Phil Scadden wrote: > We have important entities referenced in indexed documents which have > convention naming of geographicname-number. Eg Wainui-8 I want the tokenizer > to treat it as Wainui-8 when indexing, and when I search I want to a q of > Wainui-8 (must it be specified as Wainui\-8 ??) to return docs with Wainui-8 > but not with Wainui-9 or plain Wainui. > > Docs are pdfs, and I have using tika to extract text. > > How do I set up solr for queries like this? At indexing time, Solr does not treat the hyphen as a special character like it does at query time. Many analysis components do, though. If your analysis chain includes certain components (the standard tokenizer, the ICU tokenizer, and WordDelimeterFilter are on that list), then the hypen may be treated as a word break character and the analysis could remove it. At query time, a hyphen in the middle of a word is not treated as a special character. It would need to be at the beginning of the query text or after a space for the query parser to treat it as a negation. So Wainui-8 would not be a problem, but -7 would, and you'd need to specify it as \-7 for it to work like you want. Thanks, Shawn Notice: This email and any attachments are confidential and may not be used, published or redistributed without the prior written consent of the Institute of Geological and Nuclear Sciences Limited (GNS Science). If received in error please destroy and immediately notify GNS Science. Do not copy or disclose the contents.