StandardTokenizerFactory doesn't split on underscore

Rahul Goswami Thu, 07 Jan 2021 20:16:35 -0800

Hello,
So recently I was debugging a problem on Solr 7.7.2 where the query wasn't
returning the desired results. Turned out that the indexed terms had
underscore separated terms, but the query didn't. I was under the
impression that terms separated by underscore are also tokenized by
StandardTokenizerFactory, but turns out that's not the case. Eg:
'hello-world' would be tokenized into 'hello' and 'world', but
'hello_world' is treated as a single token.
Is this a bug or a designed behavior?


If this is by design, it would be helpful if this behavior is included in
the documentation since it is similar to the behavior with periods.

https://lucene.apache.org/solr/guide/6_6/tokenizers.html#Tokenizers-StandardTokenizer
"Periods (dots) that are not followed by whitespace are kept as part of the
token, including Internet domain names. "

Thanks,
Rahul

StandardTokenizerFactory doesn't split on underscore

Reply via email to