Re: StandardTokenizerFactory doesn't split on underscore

Rahul Goswami Sat, 09 Jan 2021 13:58:25 -0800

Nope. The underscore is preserved right after tokenization even before it
reaches any filters. You can choose the type "text_general" and try an
index time analysis through the "Analysis" page on Solr Admin UI.


Thanks,
Rahul

On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <fengchang_fi...@163.com> wrote:

> did you configured PatternReplaceFilterFactory?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> At 2021-01-08 12:16:06, "Rahul Goswami" <rahul196...@gmail.com> wrote:
> >Hello,
> >So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
> >returning the desired results. Turned out that the indexed terms had
> >underscore separated terms, but the query didn't. I was under the
> >impression that terms separated by underscore are also tokenized by
> >StandardTokenizerFactory, but turns out that's not the case. Eg:
> >'hello-world' would be tokenized into 'hello' and 'world', but
> >'hello_world' is treated as a single token.
> >Is this a bug or a designed behavior?
> >
> >If this is by design, it would be helpful if this behavior is included in
> >the documentation since it is similar to the behavior with periods.
> >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> >"Periods (dots) that are not followed by whitespace are kept as part of
> the
> >token, including Internet domain names. "
> >
> >Thanks,
> >Rahul
>

Re: StandardTokenizerFactory doesn't split on underscore

Reply via email to