Re: About solr.HyphenatedWordsFilter

Erick Erickson Wed, 26 Aug 2020 05:02:12 -0700

Another option is to suggest from a copyField with a very simple analysis 
chain. Say:


PatternReplaceCharFilterFactory - to remove everything you don’t want to keep.
WhitespaceTokenizerFactory
LowercaseFilterFactory - maybe

And I think you miss Shawn’s point about the exclamation point. If you just 
used,
say, WhitespaceTokenizerFactory, then your original example would have a token
“soon!" (including the exclamation). WordDelimiterGraphFilterFactory and a 
number
of the usual tokenizers automatically remove punctuation, so when you start
creating your own, you have to be sure to test edge cases like that.

You have to think very carefully about all kinds of input. What about 
contractions?
“don’t” for instance? Do your part numbers contain anything except hyphens that
might be confusing? Do you want to conditionally remove hyphens if your input
is, say, a hyphenated last name (smith-jones)?

Best,
Erick


> On Aug 26, 2020, at 6:27 AM, Kayak28 <kaya.ota....@gmail.com> wrote:
> 
> Hello, Shawn
> 
> Thank you for your response.
> 
> Yes. I am sure that I need to preserve "-" in the words.
> What I want to do is not actually search, it is for a suggestion.
> "abc-efg" is a dummy sample of our product ID.
> So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so
> on.
> When a user types "abc", I would like to suggest the above candidates, with
> hyphens.
> But also, I would like to do usual suggestions such as:
> when a user types "com" and I would like to suggest "coming" as well.
> (Probably my first example sentence is not good ...)
> 
> I apologize for confusing you, but "!" is not important at all.
> 
> I will consider WordDelimiterGraphFilter.
> 
> Again, thank you for your response.
> 
> Sincerely,
> Kaya Ota
> 
> 
> 2020年8月26日(水) 15:57 Shawn Heisey <apa...@elyograg.org>:
> 
>> On 8/26/2020 12:05 AM, Kayak28 wrote:
>>> I would like to tokenize the following sentence. I do want to tokens
>>> that remain hyphens. So, for example, original text: This is a new
>>> abc-edg and xyz-abc is coming soon! desired output tokens:
>>> this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way
>>> that I do not omit hyphens from tokens? I though HyphenatedWordsFilter
>>> does have similar functionalities, but it gets rid of hyphens.
>> 
>> I doubt that filter is what you need.  It is fully described in Javadocs:
>> 
>> 
>> https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html
>> 
>> Your requirement is a little odd.  Are you SURE that you want to
>> preserve hyphens like that?
>> 
>> I think that you could probably achieve it with a carefully configured
>> WordDelimiterGraphFilter.  This filter can be highly customized with its
>> "types" parameter.  This parameter refers to a file in the conf
>> directory that can change how the filter recognizes certain characters.
>> I think that if you used the whitespace tokenizer along with the word
>> delimiter filter, and put the following line into the file referenced by
>> the "types" parameter, it would do most of what you're after:
>> 
>> - => ALPHA
>> 
>> What that config would do is cause the word delimiter filter to treat
>> the hyphen as an alpha character -- so it will not use it as a
>> delimiter.  One thing about the way it works -- the exclamation point at
>> the end of your sentence would NOT be emitted as a token as you have
>> described.  If that is critically important, and I cannot imagine that
>> it would be, you're probably going to want to write your own custom
>> filter.  That would be very much an expert option.
>> 
>> Thanks,
>> Shawn
>> 
>> 
> 
> -- 
> 
> Sincerely,
> Kaya
> github: https://github.com/28kayak

Re: About solr.HyphenatedWordsFilter

Reply via email to