Hi there, The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not yet support these characters.
Solr seems to accept the documents without problem, but when I retrieve the documents, there are strange placeholders like #0; etc. in its place. Might this be a configuration issue? While most of the characters in this range are very rare, due to the latest mapping tables between Unicode and the Japanese JIS coded character sets, some of the characters in everyday use in Japan are now encoded in this area. It does therefore seems highly desirable that this problem gets solved. I am testing this on a Mac OS X 10.5.2 system, with Java 1.5.0_13 and Solr 1.2.0. Any hints appreciated, Christian Wittern -- Christian Wittern Institute for Research in Humanities, Kyoto University 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN