On 24 November 2011 15:18, Tomasz Wegrzanowski <tomasz.wegrzanow...@gmail.com> wrote: > On 22 November 2011 14:28, Jan Høydahl <jan....@cominvent.com> wrote: >> Why do you need spaces in the replacement? >> >> Try pattern="\+" replacement="plus" - it will cause the transformed >> charstream to contain as many tokens as the original and avoid the >> highlighting crash. > > I tried that, it still crashes. > > Replacing it with single character, including single non-ASCII > character, doesn't cause a crash. > > I'm sort of tempted to just use reuse some CJK character, and synonym filter > it to mean "plus".
In case anybody else runs into this problem, I found a solution. The only thing that works and doesn't seem to crash solr is CJK expansions: <!-- they're not random, that's just what these characters mean --> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\+" replacement="加"/> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="&" replacement="和"/> Followed by un-CJK-ing in synonym filter: # General rules 加 => plus 和 => and # And any special synonyms you want: r and d, r 和 d => r and d, research and development s and p, s 和 p => s and p, standand and poor's at and t, at 和 t => at and t, american telephone and telegraph User never sees these CJK characters, they only exist for a brief time within solr pipeline to make tokenizer happy. I also tried private use Unicode characters, but they're ignored by tokenizer.