Thank you Upayavira.

I'm trying to figure out what will make Solr stem on "multi" in the word 
"multicad" so that any attempt to search on "multicad", "Multi-CAD" or 
"multiCAD" will return results. The WordDelimiterFilterFactory helps with the 
case of multi followed by a dash or a capital letter, but I'm not sure how to 
get Solr to tokenize the word "multi". Should I look at ngram configurations? 
Or is there a filter which promotes (rather than protects) words from being 
stemmed? (in other words, I could configure in a txt file that "multi" should 
be stemmed.

Just to reiterate, I am not getting any results when I search for the word 
"multicad", even though it appears many times in the text as "multiCAD" and 
"Multi-CAD".

Here is my configuration:

<analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords_en.txt" enablePositionIncrements="true"/>
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>                         
            <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.SnowballPorterFilterFactory" language="English" 
protected="protwords.txt"/>
  </analyzer>

-----Original Message-----
From: Upayavira [mailto:u...@odoko.co.uk] 
Sent: Monday, September 30, 2013 1:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Searching on (hyphenated/capitalized) word issue

You need to look at your analysis chain. The stuff you're talking about there 
is all configurable.

There's different tokenisers available to split your fields differently, then 
you might use the WordDelimiterFilterFactory to split existing tokens further 
(e.g. WiFi might become "wi", "fi" and "WiFi"). So really, you need to craft 
your own analysis chain to fit the kind of data you are working with.

Upayavira

On Mon, Sep 30, 2013, at 06:50 PM, Van Tassell, Kristian wrote:
> I have a search term "multi-CAD" being issues on tokenized text.  The 
> problem is that you cannot get any search results when you type 
> "multicad" unless you add a hyphen (multi-cad) or type "multiCAD"
> (omitting the hyphen, but correctly adding the CAPS into the spelling).
> 
> 
> 
> However, for the similar but unhyphenated word AutoCAD, you can type 
> "autocad" and get hits for AutoCAD, as you would expect. You can type 
> "auto-cad" and get the same results.
> 
> The query seems to get parsed as separate words (resulting in hits) 
> for multi-CAD, multiCAD, autocad, auto-cad and AUTOCAD, but not for multicad.
> In other words, the search terms  become "multi cad" and "auto cad" 
> for all cases except for when the term is "multicad".
> 
> I'm guessing this may be in part to "auto" being a more common word 
> prefix, but I may be wrong. Can anyone provide some clarity (and maybe 
> point me towards a potential solution)?
> 
> Thanks in advance!
> 
> 
> Kristian Van Tassell
> Siemens Industry Sector
> Siemens Product Lifecycle Management Software Inc.
> 5939 Rice Creek Parkway
> Shoreview, MN  55126 United States
> Tel.      :+1 (651) 855-6194
> Fax      :+1 (651) 855-6280
> kristian.vantass...@siemens.com <kristian.vantass...@siemens.com%20>
> www.siemens.com/plm
> 

Reply via email to