Hi Robert, Two possibilities come to mind:
1. Use a char filter factory (runs before the tokenizer) to convert commas between digits to spaces, e.g. PatternReplaceCharFilterFactory <https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory>. 2. Use WordDelimiterFilterFactory <https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter> -- Steve www.lucidworks.com > On May 24, 2017, at 4:19 PM, Robert Hume <rhum...@gmail.com> wrote: > > Hi, > > Following up on my last email question ... I've learned more and I > simplified by question ... > > I have a Solr 3.6 deployment. Currently I'm using > solr.StandardTokenizerFactory to parse tokens during indexing. > > Here's two example streams that demonstrate my issue: > > Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|` > ... which is good. > > Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|` > ... which is not good because users can't search by "000123". > > It seems StandardTokenizerFactory treats the "6,000" differently (like it's > currency or a product number, maybe?) so it doesn't tokenize at the comma. > > QUESTION: How can I enhance StandardTokenizer to do everything it's doing > now plus produce a couple of additional tokens like this ... > > `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|` > > ... so users can search by "000123"? > > Thanks! > Rob