On 6/20/2013 1:26 PM, Jonathan Rochkind wrote:
I want, for instance, "C++ Language" to be tokenized into "C++",
"Language". But the ICUTokenizer, even with the
rulefiles="Latn:Latin-break-only-on-whitespace.rbbi", with the rbbi
file from the Solr 4.3 source [1].
But the ICUTokenizer, even with the that rulefile, is still stripping
the punctuation, and tokenizing that into "C", "Language".
This screenshot is using branch_4x downloaded and compiled a couple of
hours ago, with the rbbi file you mentioned copied to the conf directory:
https://dl.dropboxusercontent.com/u/97770508/icutokenizer-whitespace-only.png
It shows that the ++ is maintained by the ICU tokenizer. It also
illustrates a UI bug that I will have to show to steffkes where the ++
is lost from the input field after analysis.
Thanks,
Shawn