Re: About solr.HyphenatedWordsFilter

Shawn Heisey Tue, 25 Aug 2020 23:57:24 -0700

On 8/26/2020 12:05 AM, Kayak28 wrote:

I would like to tokenize the following sentence. I do want to tokensthat remain hyphens. So, for example, original text: This is a newabc-edg and xyz-abc is coming soon! desired output tokens:this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any waythat I do not omit hyphens from tokens? I though HyphenatedWordsFilterdoes have similar functionalities, but it gets rid of hyphens.


I doubt that filter is what you need.  It is fully described in Javadocs:

https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html

Your requirement is a little odd. Are you SURE that you want topreserve hyphens like that?

I think that you could probably achieve it with a carefully configuredWordDelimiterGraphFilter. This filter can be highly customized with its"types" parameter. This parameter refers to a file in the confdirectory that can change how the filter recognizes certain characters. I think that if you used the whitespace tokenizer along with the worddelimiter filter, and put the following line into the file referenced bythe "types" parameter, it would do most of what you're after:


- => ALPHA

What that config would do is cause the word delimiter filter to treatthe hyphen as an alpha character -- so it will not use it as adelimiter. One thing about the way it works -- the exclamation point atthe end of your sentence would NOT be emitted as a token as you havedescribed. If that is critically important, and I cannot imagine thatit would be, you're probably going to want to write your own customfilter. That would be very much an expert option.


Thanks,
Shawn

Re: About solr.HyphenatedWordsFilter

Reply via email to