You need to distinguish between PatternReplaceCharFilterFactory
and PatternReplaceFilterFactory The first one is applied to the entire input _before_ tokenization. The second is applied _after_ tokenization to individual tokens, by that time it's too late. It's an easy thing to miss. And at query time you'll have to be careful to keep the + sign from being interpreted as an operator. Best, Erick On Tue, May 23, 2017 at 10:12 AM, Fundera Developer <funderadevelo...@outlook.com> wrote: > I have also tried this option, by using a PatternReplaceFilterFactory, like > this: > > <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" > replacement="investigación y desarrollo"/> > > but it gets processed AFTER the Tokenizer, so when it executes there is no > longer an "i+d" token, but two "i" and "d" independent tokens. > > Is there a way I could make the filter execute before the Tokenizer? I have > tried to place it first in the Analyzer definition like this: > > <analyzer type="index"> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping-FoldToASCII.txt"/> > <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" > replacement="investigación y desarrollo"/> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" /> > </analyzer> > > But I had no luck. > > Are there any other approaches I could be missing? > > Thanks! > > > El 22/05/17 a las 20:50, Rick Leir escribió: > > Fundera, > You need a regex which matches a '+' with non-blank chars before and after. > It should not replace a '+' preceded by white space, that is important in > Solr. This is not a perfect solution, but might improve matters for you. > Cheers -- Rick > > On May 22, 2017 1:58:21 PM EDT, Fundera Developer > <funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com> wrote: > > > Thank you Zahid and Erik, > > I was going to try the CharFilter suggestion, but then I doubted. I see > the indexing process, and how the appearance of 'i+d' would be handled, > but, what happens at query time? If I use the same filter, I could > remove '+' chars that are added by the user to identify compulsory > tokens in the search results, couldn't I? However, if i do not use the > CharFilter I would not be able to match the 'i+d' search tokens... > > Thanks all! > > > > El 22/05/17 a las 16:39, Erick Erickson escribió: > > You can also use any of the other tokenizers. WhitespaceTokenizer for > instance. There are a couple that use regular expressions. Etc. See: > https://cwiki.apache.org/confluence/display/solr/Tokenizers > > Each one has it's considerations. WhitespaceTokenizer won't, for > instance, separate out punctuation so you might then have to use a > filter to remove those. Regex's can be tricky to get right ;). Etc.... > > Best, > Erick > > On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal > <zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net> > wrote: > > > Hi, > > > Before applying tokenizer, you can replace your special symbols with > some > phrase to preserve it and after tokenized you can replace it back. > > For example: > <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)" > replacement="xxx" /> > > > Thanks, > Zahid iqbal > > On Mon, May 22, 2017 at 12:57 AM, Fundera Developer < > funderadevelo...@outlook.com<mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>> > wrote: > > > > Hi all, > > I am a bit stuck at a problem that I feel must be easy to solve. In > Spanish it is usual to find the term 'i+d'. We are working with Solr > 5.5, > and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in > the > index documents both in Spanish and Catalan, and in Catalan it is > frequent > to find 'i' as a word, when a user searches for 'i+d' it gets Catalan > documents as results. > > I have tried to use the SynonymFilter, with something like: > > i+d => investigacionYdesarrollo > > But it does not seem to change anything. > > Is there a way I could set an exception to the Tokenizer so that it > does > not split this word? > > Thanks in advance! > > > > >