On 8/26/2020 12:05 AM, Kayak28 wrote:
I would like to tokenize the following sentence. I do want to tokens
that remain hyphens. So, for example, original text: This is a new
abc-edg and xyz-abc is coming soon! desired output tokens:
this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way
that I do not omit hyphens from tokens? I though HyphenatedWordsFilter
does have similar functionalities, but it gets rid of hyphens.
I doubt that filter is what you need. It is fully described in Javadocs:
https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html
Your requirement is a little odd. Are you SURE that you want to
preserve hyphens like that?
I think that you could probably achieve it with a carefully configured
WordDelimiterGraphFilter. This filter can be highly customized with its
"types" parameter. This parameter refers to a file in the conf
directory that can change how the filter recognizes certain characters.
I think that if you used the whitespace tokenizer along with the word
delimiter filter, and put the following line into the file referenced by
the "types" parameter, it would do most of what you're after:
- => ALPHA
What that config would do is cause the word delimiter filter to treat
the hyphen as an alpha character -- so it will not use it as a
delimiter. One thing about the way it works -- the exclamation point at
the end of your sentence would NOT be emitted as a token as you have
described. If that is critically important, and I cannot imagine that
it would be, you're probably going to want to write your own custom
filter. That would be very much an expert option.
Thanks,
Shawn