Re: [Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

Steve Rowe Wed, 24 May 2017 14:06:02 -0700

Hi Robert,

Two possibilities come to mind:


1. Use a char filter factory (runs before the tokenizer) to convert commas 
between digits to spaces, e.g. PatternReplaceCharFilterFactory 
<https://cwiki.apache.org/confluence/display/solr/CharFilterFactories#CharFilterFactories-solr.PatternReplaceCharFilterFactory>.
2. Use WordDelimiterFilterFactory 
<https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter>

--
Steve
www.lucidworks.com

> On May 24, 2017, at 4:19 PM, Robert Hume <rhum...@gmail.com> wrote:
> 
> Hi,
> 
> Following up on my last email question ... I've learned more and I
> simplified by question ...
> 
> I have a Solr 3.6 deployment.  Currently I'm using
> solr.StandardTokenizerFactory to parse tokens during indexing.
> 
> Here's two example streams that demonstrate my issue:
> 
> Example 1: `bob,a-z,000123,xyz` produces tokens ... `|bob|a-z|000123|xyz|`
> ... which is good.
> 
> Example 2: `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|`
> ... which is not good because users can't search by "000123".
> 
> It seems StandardTokenizerFactory treats the "6,000" differently (like it's
> currency or a product number, maybe?) so it doesn't tokenize at the comma.
> 
> QUESTION: How can I enhance StandardTokenizer to do everything it's doing
> now plus produce a couple of additional tokens like this ...
> 
> `bob,a-6,000123,xyz` produces tokens ... `|bob|a-6,000123|xyz|a-6|000123|`
> 
> ... so users can search by "000123"?
> 
> Thanks!
> Rob

Re: [Simplified my question] How to enhance solr.StandardTokenizerFactory? (was: Why is Standard Tokenizer not separating at this comma?)

Reply via email to