Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Erik Hatcher wrote: How are you POSTing the documents to Solr? What content-type are you using with the HTTP header? And what encoding are you using with the XML (file?) being POSTed, and is that encoding specified in the XML file itself? For these tests I used the script post.sh from the ex

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Ken Krugler wrote: What was the actual format of the Extension B characters in the XML being posted? I tried both a binary (UTF-8) format and numeric character representation of the type 𠀀 -- the results where the same. Christian -- Christian Wittern Institute for Research in Humanitie

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitch, so there is definitely no other issue involved. What w

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
On Feb 28, 2008, at 6:56 PM, Christian Wittern wrote: Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitc

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitch, so there is definitely no other issue involved. Ch

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Wow - great stuff Steve! As for StandardTokenizer and Java version - no worries there really, as Solr itself requires Java 1.5+, so when such a tokenizer is made available it could be used just fine in Solr even if it isn't built into a core Lucene release for a while. Erik On

RE: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Steven A Rowe
On 02/28/2008 at 11:26 AM, Ken Krugler wrote: > And as Erik mentioned, it appears that line 114 of > StandardTokenizerImpl.jflex: > > http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > > needs to be updat

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
To elaborate StandardTokenizer comes into play for indexing and querying (and only if you have that configured for that field in schema.xml). But the original issue seems to be with actually parsing the content properly and storing it in the Lucene index, which is separate from the tok

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Hi Christian, The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not yet support these characters. Solr seems to accept the

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Hi Christian, The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not yet support these characters. Solr seems to accept the

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Christian, This bit of trivia is probably useful to you as well. Lucene's StandardTokenizer uses these Unicode ranges for CJK characters: KOREAN = [\uac00-\ud7af\u1100-\u11ff] // Chinese, Japanese CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF \u31F0-\u31FF\u3300-\u

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Christian, Is this an issue with the encoding used when adding the documents to the index? There are two encodings that need to be gotten right, the one for the XML content POSTed to Solr, and also the HTTP header on that POST request. If you are getting mangled content back from a st

Re: no support for CJK characters from Extension B in Solr

2008-02-27 Thread Christian Wittern
Leonardo Santagada wrote: On 28/02/2008, at 00:23, Christian Wittern wrote: The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Just to give more information, does java suport this? I beleive

Re: no support for CJK characters from Extension B in Solr

2008-02-27 Thread Leonardo Santagada
On 28/02/2008, at 00:23, Christian Wittern wrote: The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Just to give more information, does java suport this? I beleive they don't support cha