On 24 November 2011 15:18, Tomasz Wegrzanowski
<tomasz.wegrzanow...@gmail.com> wrote:
> On 22 November 2011 14:28, Jan Høydahl <jan....@cominvent.com> wrote:
>> Why do you need spaces in the replacement?
>>
>> Try pattern="\+" replacement="plus" - it will cause the transformed 
>> charstream to contain as many tokens as the original and avoid the 
>> highlighting crash.
>
> I tried that, it still crashes.
>
> Replacing it with single character, including single non-ASCII
> character, doesn't cause a crash.
>
> I'm sort of tempted to just use reuse some CJK character, and synonym filter
> it to mean "plus".

In case anybody else runs into this problem, I found a solution.

The only thing that works and doesn't seem to crash solr is CJK expansions:

  <!-- they're not random, that's just what these characters mean -->
  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\+" replacement="加"/>  <charFilter
class="solr.PatternReplaceCharFilterFactory" pattern="&amp;"
replacement="和"/>
Followed by un-CJK-ing in synonym filter:

# General rules
加 => plus
和 => and
# And any special synonyms you want:
r and d, r 和 d => r and d, research and development
s and p, s 和 p => s and p, standand and poor's
at and t, at  和 t => at and t, american telephone and telegraph

User never sees these CJK characters, they only exist for a brief time
within solr pipeline to make tokenizer happy.

I also tried private use Unicode characters, but they're ignored by tokenizer.

Reply via email to