Re: Small Tokenization issue

2018-01-05 Thread Rick Leir
Nawab Look at classicTokenizer. It is a good choice if you have part numbers with hyphens. The second tokenizer on this page: https://lucene.apache.org/solr/guide/6_6/tokenizers.html Cheers -- Rick On 01/03/2018 04:52 PM, Shawn Heisey wrote: On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote

Re: Small Tokenization issue

2018-01-03 Thread Shawn Heisey
On 1/3/2018 1:56 PM, Nawab Zada Asad Iqbal wrote: Thanks Emir, Erick. What i want to do is remove empty tokens after WordDelimiterGraphFilter ? Is there any such option in WordDelimiterGraphFilter to not generate empty tokens? I use LengthFilterFactory with a minimum of 1 and a maximum of 512

Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
WordDelimiterGraphFilterFactory is a new implementation so it's also quite possible that the behavior just changed. I just took a look and indeed it does. WordDelimiterFilterFactory (done on "p / n whatever) produces token: p n whatever position: 1 2 3 whereas WordDelimiterGraphFilt

Re: Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Thanks Emir, Erick. What i want to do is remove empty tokens after WordDelimiterGraphFilter ? Is there any such option in WordDelimiterGraphFilter to not generate empty tokens? This index field is intended to use for strange strings e.g. part numbers. P/N HSC0424PP The benefit of removing the emp

Re: Small Tokenization issue

2018-01-03 Thread Emir Arnautović
Hi Nawab, The reason why you do not get shingle is because there is empty token because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token that you are interested in are not next to each other and cannot form shingle. What you can do is apply char filter before tokenization to re

Re: Small Tokenization issue

2018-01-03 Thread Erick Erickson
If it's regular, you could try using a PatternReplaceCharFilterFactory (PRCFF), which gets applied to the input before tokenization (note, this is NOT PatternReplaceFilterFatory, which gets applied after tokenization). I don't really see how you could make this work though. WhitespaceTokenizer wil

Small Tokenization issue

2018-01-03 Thread Nawab Zada Asad Iqbal
Hi, So, I have a string for indexing: abc - def (notice the space on either side of hyphen) which is being processed with this filter-list:- I get two shingle tokens at the e