It is expected that the StandardTokenizer will not break on underscores.
The StandardTokenizer follows the Unicode UAX 29
<https://unicode.org/reports/tr29/#Word_Boundaries> standard which
specifies an underscore as an "extender" and this rule
<https://unicode.org/reports/tr29/#WB13a> says to not break from extenders.
This is why xiefengchang was suggesting to use a
PatternReplaceFilterFactory after the StandardTokenizer in order to further
split on underscores if that is your use case.

On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <rahul196...@gmail.com> wrote:

> Nope. The underscore is preserved right after tokenization even before it
> reaches any filters. You can choose the type "text_general" and try an
> index time analysis through the "Analysis" page on Solr Admin UI.
>
> Thanks,
> Rahul
>
> On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <fengchang_fi...@163.com>
> wrote:
>
> > did you configured PatternReplaceFilterFactory?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > At 2021-01-08 12:16:06, "Rahul Goswami" <rahul196...@gmail.com> wrote:
> > >Hello,
> > >So recently I was debugging a problem on Solr 7.7.2 where the query
> wasn't
> > >returning the desired results. Turned out that the indexed terms had
> > >underscore separated terms, but the query didn't. I was under the
> > >impression that terms separated by underscore are also tokenized by
> > >StandardTokenizerFactory, but turns out that's not the case. Eg:
> > >'hello-world' would be tokenized into 'hello' and 'world', but
> > >'hello_world' is treated as a single token.
> > >Is this a bug or a designed behavior?
> > >
> > >If this is by design, it would be helpful if this behavior is included
> in
> > >the documentation since it is similar to the behavior with periods.
> > >
> > >
> >
> https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
> > >"Periods (dots) that are not followed by whitespace are kept as part of
> > the
> > >token, including Internet domain names. "
> > >
> > >Thanks,
> > >Rahul
> >
>


-- 
Adam Walz

Reply via email to