Re: Small Tokenization issue

Nawab Zada Asad Iqbal Wed, 03 Jan 2018 12:56:21 -0800

Thanks Emir, Erick.

What i want to do is remove empty tokens after WordDelimiterGraphFilter ?
Is there any such option in WordDelimiterGraphFilter to not generate empty
tokens?


This index field is intended to use for strange strings e.g. part numbers.
P/N HSC0424PP
The benefit of removing the empty tokens is that if someone unintentionally
puts a space around the '/' (in above example) this field is still able to
match.

In previous solr version, ShingleFilter used to work fine in case of empty
positions and was making shingles across the empty space. Although, it is
possible that i have learned to rely on a bug.






On Wed, Jan 3, 2018 at 12:23 PM, Emir Arnautović <
emir.arnauto...@sematext.com> wrote:

> Hi Nawab,
> The reason why you do not get shingle is because there is empty token
> because after tokenizer you have 3 tokens ‘abc’, ‘-’ and ‘def’ so the token
> that you are interested in are not next to each other and cannot form
> shingle.
> What you can do is apply char filter before tokenization to remove ‘-‘
> something like:
>
> <charFilter class="solr.PatternReplaceCharFilterFactory"
>              pattern=“\s*-\s*” replacement=“ ”/>
>
> Regards,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 3 Jan 2018, at 21:04, Nawab Zada Asad Iqbal <khi...@gmail.com> wrote:
> >
> > Hi,
> >
> > So, I have a string for indexing:
> >
> > abc - def (notice the space on either side of hyphen)
> >
> > which is being processed with this filter-list:-
> >
> >
> >    <fieldType name="shingle" class="solr.TextField"
> > positionIncrementGap="100">
> >      <analyzer type="index">
> >        <charFilter
> > class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
> > name="nfkc" mode="compose"/>
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.WordDelimiterGraphFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" preserveOriginal="0"
> > splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
> >        <filter class="solr.FlattenGraphFilterFactory"/>
> >        <filter class="solr.PatternReplaceFilterFactory"
> > pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.ASCIIFoldingFilterFactory"/>
> >        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
> > outputUnigrams="false" fillerToken=""/>
> >        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >        <filter class="solr.LimitTokenCountFilterFactory"
> > maxTokenCount="10000" consumeAllTokens="false"/>
> >        <filter class="solr.LengthFilterFactory" min="1" max="255"/>
> >      </analyzer>
> >
> >
> > I get two shingle tokens at the end:
> >
> > "abc" "def"
> >
> > I want to get "abc def" . What can I tweak to get this?
> >
> >
> > Thanks
> > Nawab
>
>

Re: Small Tokenization issue

Reply via email to