On 8/26/2020 12:05 AM, Kayak28 wrote:
I would like to tokenize the following sentence. I do want to tokens that remain hyphens. So, for example, original text: This is a new abc-edg and xyz-abc is coming soon! desired output tokens: this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way that I do not omit hyphens from tokens? I though HyphenatedWordsFilter does have similar functionalities, but it gets rid of hyphens.

I doubt that filter is what you need.  It is fully described in Javadocs:

https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html

Your requirement is a little odd.  Are you SURE that you want to preserve hyphens like that?

I think that you could probably achieve it with a carefully configured WordDelimiterGraphFilter.  This filter can be highly customized with its "types" parameter.  This parameter refers to a file in the conf directory that can change how the filter recognizes certain characters.  I think that if you used the whitespace tokenizer along with the word delimiter filter, and put the following line into the file referenced by the "types" parameter, it would do most of what you're after:

- => ALPHA

What that config would do is cause the word delimiter filter to treat the hyphen as an alpha character -- so it will not use it as a delimiter.  One thing about the way it works -- the exclamation point at the end of your sentence would NOT be emitted as a token as you have described.  If that is critically important, and I cannot imagine that it would be, you're probably going to want to write your own custom filter.  That would be very much an expert option.

Thanks,
Shawn

Reply via email to