It is expected that the StandardTokenizer will not break on underscores. The StandardTokenizer follows the Unicode UAX 29 <https://unicode.org/reports/tr29/#Word_Boundaries> standard which specifies an underscore as an "extender" and this rule <https://unicode.org/reports/tr29/#WB13a> says to not break from extenders. This is why xiefengchang was suggesting to use a PatternReplaceFilterFactory after the StandardTokenizer in order to further split on underscores if that is your use case.
On Sat, Jan 9, 2021 at 2:58 PM Rahul Goswami <rahul196...@gmail.com> wrote: > Nope. The underscore is preserved right after tokenization even before it > reaches any filters. You can choose the type "text_general" and try an > index time analysis through the "Analysis" page on Solr Admin UI. > > Thanks, > Rahul > > On Sat, Jan 9, 2021 at 8:22 AM xiefengchang <fengchang_fi...@163.com> > wrote: > > > did you configured PatternReplaceFilterFactory? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2021-01-08 12:16:06, "Rahul Goswami" <rahul196...@gmail.com> wrote: > > >Hello, > > >So recently I was debugging a problem on Solr 7.7.2 where the query > wasn't > > >returning the desired results. Turned out that the indexed terms had > > >underscore separated terms, but the query didn't. I was under the > > >impression that terms separated by underscore are also tokenized by > > >StandardTokenizerFactory, but turns out that's not the case. Eg: > > >'hello-world' would be tokenized into 'hello' and 'world', but > > >'hello_world' is treated as a single token. > > >Is this a bug or a designed behavior? > > > > > >If this is by design, it would be helpful if this behavior is included > in > > >the documentation since it is similar to the behavior with periods. > > > > > > > > > https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer > > >"Periods (dots) that are not followed by whitespace are kept as part of > > the > > >token, including Internet domain names. " > > > > > >Thanks, > > >Rahul > > > -- Adam Walz