Hi Robert, The StandardTokenizer implements the word boundaries rules from UAX#29 <http://unicode.org/reports/tr29/#Word_Boundaries>, discarding anything between boundaries that is exclusively non-alphanumeric (e.g. punctuation).
-- Steve www.lucidworks.com > On May 24, 2017, at 3:05 PM, Robert Hume <rhum...@gmail.com> wrote: > > I have a Solr 3.6 deployment I inherited. > > The schema.xml specifies the use of StandardTokenizerFactory like so ... > > <fieldType name="text_general" class="solr.TextField" > positionIncrementGap="100"> > ... > <tokenizer class="solr.StandardTokenizerFactory"/> > ... > > > According to this reference guide ( > https://home.apache.org/~ctargett/RefGuidePOC/jekyll/Tokenizers.html) ... > the StandardTokenizer will treat punctuation as a delimiters. > > > However, here is my content that gets indexed: > > "IOM-1:BA9ATS0FAB,\"Company Name > > Module\",8.1.0.16.0.2,B-A,000006KB09029932,PASS,,0,0,0,Y:0,0,0,0,0:BA9AUT0FAB,\"Company > CM Rear Module\",B-6,000009XP12133407," > > > > This piece `B-A,000006KB09029932` gets tokenized into two words ... `|B-A|` > and `|000006KB09029932|`. > > > But this piece `B-6,000009XP12133407` gets tokenized into one word ... > `|B-6,000009XP12133407|`. > > What I've observed is the comma is not considered a delimiter when it is > proceeded by a digit ... almost like it considers "6,000" to be currency or > something? > > > QUESTION: Is this a bug in StandardTokenizer, or do I misunderstand how > commas are used as delimiters? > > Rob