Re: Indexing word with plus sign

Erick Erickson Tue, 23 May 2017 10:42:34 -0700

You need to distinguish between

PatternReplaceCharFilterFactory


and

PatternReplaceFilterFactory

The first one is applied to the entire input _before_ tokenization.
The second is applied _after_ tokenization to individual tokens, by
that time it's too late.

It's an easy thing to miss.

And at query time you'll have to be careful to keep the + sign from
being interpreted as an operator.
Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
<funderadevelo...@outlook.com> wrote:
> I have also tried this option, by using a PatternReplaceFilterFactory, like 
> this:
>
> <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
> replacement="investigación y desarrollo"/>
>
> but it gets processed AFTER the Tokenizer, so when it executes there is no 
> longer an "i+d" token, but two "i" and "d" independent tokens.
>
> Is there a way I could make the filter execute before the Tokenizer? I have 
> tried to place it first in the Analyzer definition like this:
>
>      <analyzer type="index">
>        <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-FoldToASCII.txt"/>
>        <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
> replacement="investigación y desarrollo"/>
>        <tokenizer class="solr.StandardTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>      </analyzer>
>
> But I had no luck.
>
> Are there any other approaches I could be missing?
>
> Thanks!
>
>
> El 22/05/17 a las 20:50, Rick Leir escribió:
>
> Fundera,
> You need a regex which matches a '+' with non-blank chars before and after. 
> It should not replace a  '+' preceded by white space, that is important in 
> Solr. This is not a perfect solution, but might improve matters for you.
> Cheers -- Rick
>
> On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
> <funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com> wrote:
>
>
> Thank you Zahid and Erik,
>
> I was going to try the CharFilter suggestion, but then I doubted. I see
> the indexing process, and how the appearance of 'i+d' would be handled,
> but, what happens at query time? If I use the same filter, I could
> remove '+' chars that are added by the user to identify compulsory
> tokens in the search results, couldn't I?  However, if i do not use the
> CharFilter I would not be able to match the 'i+d' search tokens...
>
> Thanks all!
>
>
>
> El 22/05/17 a las 16:39, Erick Erickson escribió:
>
> You can also use any of the other tokenizers. WhitespaceTokenizer for
> instance. There are a couple that use regular expressions. Etc. See:
> https://cwiki.apache.org/confluence/display/solr/Tokenizers
>
> Each one has it's considerations. WhitespaceTokenizer won't, for
> instance, separate out punctuation so you might then have to use a
> filter to remove those. Regex's can be tricky to get right ;). Etc....
>
> Best,
> Erick
>
> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
> <zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net>
> wrote:
>
>
> Hi,
>
>
> Before applying tokenizer, you can replace your special symbols with
> some
> phrase to preserve it and after tokenized you can replace it back.
>
> For example:
> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)"
> replacement="xxx" />
>
>
> Thanks,
> Zahid iqbal
>
> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
> funderadevelo...@outlook.com<mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>>
> wrote:
>
>
>
> Hi all,
>
> I am a bit stuck at a problem that I feel must be easy to solve. In
> Spanish it is usual to find the term 'i+d'. We are working with Solr
> 5.5,
> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
> the
> index documents both in Spanish and Catalan, and in Catalan it is
> frequent
> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
> documents as results.
>
> I have tried to use the SynonymFilter, with something like:
>
> i+d => investigacionYdesarrollo
>
> But it does not seem to change anything.
>
> Is there a way I could set an exception to the Tokenizer so that it
> does
> not split this word?
>
> Thanks in advance!
>
>
>
>
>

Re: Indexing word with plus sign

Reply via email to