"I thought that the StandardTokenizer always split on punctuation, "
Proving that you haven't read my book! The section on the standard tokenizer
details the rules that the tokenizer uses (in addition to extensive
examples.) That's what I mean by "deep dive."
-- Jack Krupansky
-----Original Message-----
From: Shawn Heisey
Sent: Wednesday, August 21, 2013 10:41 PM
To: [email protected]
Subject: Re: How to avoid underscore sign indexing problem?
On 8/21/2013 7:54 PM, Floyd Wu wrote:
When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
ST
textraw_bytesstartendtypeposition
pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
How to make this string to be tokenized to these two tokens "Pacific",
"Rim"?
Set _ as stopword?
Please kindly help on this.
Many thanks.
Interesting. I thought that the StandardTokenizer always split on
punctuation, but apparently that's not the case for the underscore
character.
You can always use the WordDelimeterFilter after the StandardTokenizer.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
Thanks,
Shawn