Years ago at Netflix, I had to deal with a DVD from a band named “+/-“. I gave up and translated that to “plusminus” at index and query time.
http://plusmin.us/ <http://plusmin.us/> Luckily, “.hack//Sign” and other related dot-hack anime matched if I just deleted all the punctuation. And everyone searched for "[•REC]²” as “rec2”. The middot is supposed to be red. Movie studios are clueless about searchable strings. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 23, 2017, at 10:41 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > You need to distinguish between > > PatternReplaceCharFilterFactory > > and > > PatternReplaceFilterFactory > > The first one is applied to the entire input _before_ tokenization. > The second is applied _after_ tokenization to individual tokens, by > that time it's too late. > > It's an easy thing to miss. > > And at query time you'll have to be careful to keep the + sign from > being interpreted as an operator. > Best, > Erick > > On Tue, May 23, 2017 at 10:12 AM, Fundera Developer > <funderadevelo...@outlook.com> wrote: >> I have also tried this option, by using a PatternReplaceFilterFactory, like >> this: >> >> <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" >> replacement="investigación y desarrollo"/> >> >> but it gets processed AFTER the Tokenizer, so when it executes there is no >> longer an "i+d" token, but two "i" and "d" independent tokens. >> >> Is there a way I could make the filter execute before the Tokenizer? I have >> tried to place it first in the Analyzer definition like this: >> >> <analyzer type="index"> >> <charFilter class="solr.MappingCharFilterFactory" >> mapping="mapping-FoldToASCII.txt"/> >> <filter class="solr.PatternReplaceFilterFactory" pattern="i\+d" >> replacement="investigación y desarrollo"/> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> <filter class="solr.StopFilterFactory" ignoreCase="true" >> words="stopwords.txt" /> >> </analyzer> >> >> But I had no luck. >> >> Are there any other approaches I could be missing? >> >> Thanks! >> >> >> El 22/05/17 a las 20:50, Rick Leir escribió: >> >> Fundera, >> You need a regex which matches a '+' with non-blank chars before and after. >> It should not replace a '+' preceded by white space, that is important in >> Solr. This is not a perfect solution, but might improve matters for you. >> Cheers -- Rick >> >> On May 22, 2017 1:58:21 PM EDT, Fundera Developer >> <funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com> wrote: >> >> >> Thank you Zahid and Erik, >> >> I was going to try the CharFilter suggestion, but then I doubted. I see >> the indexing process, and how the appearance of 'i+d' would be handled, >> but, what happens at query time? If I use the same filter, I could >> remove '+' chars that are added by the user to identify compulsory >> tokens in the search results, couldn't I? However, if i do not use the >> CharFilter I would not be able to match the 'i+d' search tokens... >> >> Thanks all! >> >> >> >> El 22/05/17 a las 16:39, Erick Erickson escribió: >> >> You can also use any of the other tokenizers. WhitespaceTokenizer for >> instance. There are a couple that use regular expressions. Etc. See: >> https://cwiki.apache.org/confluence/display/solr/Tokenizers >> >> Each one has it's considerations. WhitespaceTokenizer won't, for >> instance, separate out punctuation so you might then have to use a >> filter to remove those. Regex's can be tricky to get right ;). Etc.... >> >> Best, >> Erick >> >> On Mon, May 22, 2017 at 5:26 AM, Muhammad Zahid Iqbal >> <zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net><mailto:zahid.iq...@northbaysolutions.net> >> wrote: >> >> >> Hi, >> >> >> Before applying tokenizer, you can replace your special symbols with >> some >> phrase to preserve it and after tokenized you can replace it back. >> >> For example: >> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(\+)" >> replacement="xxx" /> >> >> >> Thanks, >> Zahid iqbal >> >> On Mon, May 22, 2017 at 12:57 AM, Fundera Developer < >> funderadevelo...@outlook.com<mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com><mailto:funderadevelo...@outlook.com>> >> wrote: >> >> >> >> Hi all, >> >> I am a bit stuck at a problem that I feel must be easy to solve. In >> Spanish it is usual to find the term 'i+d'. We are working with Solr >> 5.5, >> and StandardTokenizer splits 'i' and 'd' and sometimes, as we have in >> the >> index documents both in Spanish and Catalan, and in Catalan it is >> frequent >> to find 'i' as a word, when a user searches for 'i+d' it gets Catalan >> documents as results. >> >> I have tried to use the SynonymFilter, with something like: >> >> i+d => investigacionYdesarrollo >> >> But it does not seem to change anything. >> >> Is there a way I could set an exception to the Tokenizer so that it >> does >> not split this word? >> >> Thanks in advance! >> >> >> >> >>