Re: WordDelimiterFilter Leading & Trailing Special Character

Sathiya N Sundararajan Tue, 21 Jul 2015 16:46:08 -0700

Upayavira,

thanks for the helpful suggestion, that works. I was looking for an option
to turn off/circumvent that particular WordDelimiterFilter's behavior
completely. Since our indexes are hundred's of Terabytes, every time we
find a term that needs to be added, it will be a cumbersome process to
reload all the cores.



thanks

On Tue, Jul 21, 2015 at 12:57 AM, Upayavira <[email protected]> wrote:

> Looking at the javadoc for the WordDelimiterFilterFactory, it suggests
> this config:
>
>  <fieldType name="text_wd" class="solr.TextField"
>  positionIncrementGap="100">
>    <analyzer>
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.WordDelimiterFilterFactory"
>      protected="protectedword.txt"
>              preserveOriginal="0" splitOnNumerics="1"
>              splitOnCaseChange="1"
>              catenateWords="0" catenateNumbers="0" catenateAll="0"
>              generateWordParts="1" generateNumberParts="1"
>              stemEnglishPossessive="1"
>              types="wdfftypes.txt" />
>    </analyzer>
>  </fieldType>
>
> Note the protected="xxxxx" attribute. I suspect if you put Yahoo! into a
> file referenced by that attribute, it may survive analysis. I'd be
> curious to hear whether it works.
>
> Upayavira
>
> On Tue, Jul 21, 2015, at 12:51 AM, Sathiya N Sundararajan wrote:
> > Question about WordDelimiterFilter. The search behavior that we
> > experience
> > with WordDelimiterFilter satisfies well, except for the case where there
> > is
> > a special character either at the leading or trailing end of the term.
> >
> > For instance:
> >
> > *‘d&b’ *  —>  Works as expected. Finds all docs with ‘d&b’.
> > *‘p!nk’*  —>  Works fine as above.
> >
> > But on cases when, there is a special character towards the trailing end
> > of
> > the term, like ‘Yahoo!’
> >
> > *‘yahoo!’* —> Turns out to be a search for just *‘yahoo’* with the
> > special
> > character *‘!’* stripped out.  This WordDelimiterFilter behavior is
> > documented
> >
> http://lucene.apache.org/core/4_6_0/analyzers-common/index.html?org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html
> >
> > What I would like to have is, the search performed without stripping out
> > the leading & trailing special character. Is there a way to achieve this
> > behavior with WordDelimiterFilter.
> >
> > This is current config that we have for the field:
> >
> > <fieldType name="text_wdf" class="solr.TextField"
> > positionIncrementGap="100">
> >         <analyzer type="index">
> >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >             <filter class="solr.WordDelimiterFilterFactory"
> > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > preserveOriginal="1"
> > types="specialchartypes.txt"/>
> >             <filter class="solr.LowerCaseFilterFactory" />
> >         </analyzer>
> >         <analyzer type="query">
> >             <tokenizer class="solr.WhitespaceTokenizerFactory" />
> >             <filter class="solr.WordDelimiterFilterFactory"
> > splitOnCaseChange="0" generateWordParts="0" generateNumberParts="0"
> > catenateWords="0" catenateNumbers="0" catenateAll="0"
> > preserveOriginal="1"
> > types="specialchartypes.txt"/>
> >             <filter class="solr.LowerCaseFilterFactory" />
> >         </analyzer>
> >     </fieldType>
> >
> >
> > thanks
>

Re: WordDelimiterFilter Leading & Trailing Special Character

Reply via email to