Re: Solr, ICUTokenizer with Latin-break-only-on-whitespace

Shawn Heisey Thu, 20 Jun 2013 12:41:27 -0700

On 6/20/2013 1:26 PM, Jonathan Rochkind wrote:

I want, for instance, "C++ Language" to be tokenized into "C++","Language". But the ICUTokenizer, even with therulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbifile from the Solr 4.3 source [1].
But the ICUTokenizer, even with the that rulefile, is still strippingthe punctuation, and tokenizing that into "C", "Language".

This screenshot is using branch_4x downloaded and compiled a couple ofhours ago, with the rbbi file you mentioned copied to the conf directory:


https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png

It shows that the ++ is maintained by the ICU tokenizer. It alsoillustrates a UI bug that I will have to show to steffkes where the ++is lost from the input field after analysis.


Thanks,
Shawn

Re: Solr, ICUTokenizer with Latin-break-only-on-whitespace

Reply via email to