Re: Indexing word with plus sign

Walter Underwood Tue, 23 May 2017 11:47:58 -0700

That was on Solr 1.3, so I’m pretty sure it was the whitespace tokenizer.


The synonym substitution for “+/-" was done in client code and indexing code, 
outside of Solr. We also sanitized queries to remove all query syntax 
characters. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 23, 2017, at 11:21 AM, Fundera Developer 
> <funderadevelo...@outlook.com> wrote:
> 
> Thanks Walter!!
> 
> For the sake of curiosity, do you remember which Tokenizer were you using in 
> that case?
> 
> Thanks!
> 
> 
> El 23/05/17 a las 20:02, Walter Underwood escribió:
> 
> Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I 
> gave up and translated that to “plusminus” at index and query time.
> 
> http://plusmin.us/ <http://plusmin.us/><http://plusmin.us/>
> 
> Luckily, “.hack//Sign” and other related dot-hack anime matched if I just 
> deleted all the punctuation. And everyone searched for "[•REC]²” as “rec2”. 
> The middot is supposed to be red. Movie studios are clueless about searchable 
> strings.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org<mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
> 
> 
> 
> On May 23, 2017, at 10:41 AM, Erick Erickson 
> <erickerick...@gmail.com><mailto:erickerick...@gmail.com> wrote:
> 
> You need to distinguish between
> 
> PatternReplaceCharFilterFactory
> 
> and
> 
> PatternReplaceFilterFactory
> 
> The first one is applied to the entire input _before_ tokenization.
> The second is applied _after_ tokenization to individual tokens, by
> that time it's too late.
> 
> It's an easy thing to miss.
> 
> And at query time you'll have to be careful to keep the + sign from
> being interpreted as an operator.
> Best,
> Erick
> 
> On Tue, May 23, 2017 at 10:12 AM, Fundera Developer
> <funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com> wrote:
> 
> 
> I have also tried this option, by using a PatternReplaceFilterFactory, like 
> this:
> 
> <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
> replacement="investigación y desarrollo"/>
> 
> but it gets processed AFTER the Tokenizer, so when it executes there is no 
> longer an "i+d" token, but two "i" and "d" independent tokens.
> 
> Is there a way I could make the filter execute before the Tokenizer? I have 
> tried to place it first in the Analyzer definition like this:
> 
>    <analyzer type="index">
>      <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-FoldToASCII.txt"/>
>      <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" 
> replacement="investigación y desarrollo"/>
>      <tokenizer class="solr.StandardTokenizerFactory"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
>    </analyzer>
> 
> But I had no luck.
> 
> Are there any other approaches I could be missing?
> 
> Thanks!
> 
> 
> El 22/05/17 a las 20:50, Rick Leir escribió:
> 
> Fundera,
> You need a regex which matches a '+' with non-blank chars before and after. 
> It should not replace a  '+' preceded by white space, that is important in 
> Solr. This is not a perfect solution, but might improve matters for you.
> Cheers -- Rick
> 
> On May 22, 2017 1:58:21 PM EDT, Fundera Developer 
> <funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>
>  wrote:
> 
> 
> Thank you Zahid and Erik,
> 
> I was going to try the CharFilter suggestion, but then I doubted. I see
> the indexing process, and how the appearance of 'i+d' would be handled,
> but, what happens at query time? If I use the same filter, I could
> remove '+' chars that are added by the user to identify compulsory
> tokens in the search results, couldn't I?  However, if i do not use the
> CharFilter I would not be able to match the 'i+d' search tokens...
> 
> Thanks all!
> 
> 
> 
> El 22/05/17 a las 16:39, Erick Erickson escribió:
> 
> You can also use any of the other tokenizers. WhitespaceTokenizer for
> instance. There are a couple that use regular expressions. Etc. See:
> https://cwiki.apache.org/confluence/display/solr/Tokenizers
> 
> Each one has it's considerations. WhitespaceTokenizer won't, for
> instance, separate out punctuation so you might then have to use a
> filter to remove those. Regex's can be tricky to get right ;). Etc....
> 
> Best,
> Erick
> 
> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal
> <zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net>
> wrote:
> 
> 
> Hi,
> 
> 
> Before applying tokenizer, you can replace your special symbols with
> some
> phrase to preserve it and after tokenized you can replace it back.
> 
> For example:
> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)"
> replacement="xxx" />
> 
> 
> Thanks,
> Zahid iqbal
> 
> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer <
> funderadevelo...@outlook.com<mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>>
> wrote:
> 
> 
> 
> Hi all,
> 
> I am a bit stuck at a problem that I feel must be easy to solve. In
> Spanish it is usual to find the term 'i+d'. We are working with Solr
> 5.5,
> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in
> the
> index documents both in Spanish and Catalan, and in Catalan it is
> frequent
> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan
> documents as results.
> 
> I have tried to use the SynonymFilter, with something like:
> 
> i+d => investigacionYdesarrollo
> 
> But it does not seem to change anything.
> 
> Is there a way I could set an exception to the Tokenizer so that it
> does
> not split this word?
> 
> Thanks in advance!
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: Indexing word with plus sign

Reply via email to