On 6/20/2013 1:26 PM, Jonathan Rochkind wrote:
I want, for instance, "C++ Language" to be tokenized into "C++", "Language". But the ICUTokenizer, even with the rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi file from the Solr 4.3 source [1].

But the ICUTokenizer, even with the that rulefile, is still stripping the punctuation, and tokenizing that into "C", "Language".

This screenshot is using branch_4x downloaded and compiled a couple of hours ago, with the rbbi file you mentioned copied to the conf directory:

https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png

It shows that the ++ is maintained by the ICU tokenizer. It also illustrates a UI bug that I will have to show to steffkes where the ++ is lost from the input field after analysis.

Thanks,
Shawn

Reply via email to