Re: Is there a way to stop some hyphenated terms from being tokenized

Michael Sokolov Wed, 05 Nov 2014 17:21:42 -0800

You didn't describe your analysis chain, but maybe you are usingWordDelimiterFilter to break up hyphenated words? If so, it has aprotwords.txt feature that lets you specify exceptions


-Mike


On 11/5/2014 5:36 PM, Michael Della Bitta wrote:

Pretty sure what you need is called KeywordMarkerFilterFactory.
|<filter class="solr.KeywordMarkerFilterFactory"protected="protwords.txt" />|
On 11/5/14 17:24, Tang, Rebecca wrote:
Hi there,
For some hyphenated terms, I want them to stay as is instead of beingtokenized. For example: e-cigarette, e-cig, I-pad. I don't wantthem to be split into e and cig or I and pad because the singleletter e and I produces too many false positive matches.
Is there a way to tell the standard tokenizer to skip tokenizing someterms?
Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library<legacy.library.ucsf.edu/>
E: rebecca.t...@ucsf.edu

Re: Is there a way to stop some hyphenated terms from being tokenized

Reply via email to