Erik Hatcher wrote:
How are you POSTing the documents to Solr? What content-type are you
using with the HTTP header? And what encoding are you using with the
XML (file?) being POSTed, and is that encoding specified in the XML
file itself?
For these tests I used the script post.sh from the ex
Ken Krugler wrote:
What was the actual format of the Extension B characters in the XML
being posted?
I tried both a binary (UTF-8) format and numeric character
representation of the type 𠀀 -- the results where the same.
Christian
--
Christian Wittern
Institute for Research in Humanitie
Thanks to all for clearing this up. It seems we are still quite far
away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of
the other characters in the documents come through without a glitch,
so there is definitely no other issue involved.
What w
On Feb 28, 2008, at 6:56 PM, Christian Wittern wrote:
Thanks to all for clearing this up. It seems we are still quite
far away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of
the other characters in the documents come through without a
glitc
Thanks to all for clearing this up. It seems we are still quite far
away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of the
other characters in the documents come through without a glitch, so
there is definitely no other issue involved.
Ch
Wow - great stuff Steve!
As for StandardTokenizer and Java version - no worries there really,
as Solr itself requires Java 1.5+, so when such a tokenizer is made
available it could be used just fine in Solr even if it isn't built
into a core Lucene release for a while.
Erik
On
On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
> And as Erik mentioned, it appears that line 114 of
> StandardTokenizerImpl.jflex:
>
> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> needs to be updat
To elaborate StandardTokenizer comes into play for indexing and
querying (and only if you have that configured for that field in
schema.xml). But the original issue seems to be with actually
parsing the content properly and storing it in the Lucene index,
which is separate from the tok
Hi Christian,
The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.
Solr seems to accept the
Hi Christian,
The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.
Solr seems to accept the
Christian,
This bit of trivia is probably useful to you as well. Lucene's
StandardTokenizer uses these Unicode ranges for CJK characters:
KOREAN = [\uac00-\ud7af\u1100-\u11ff]
// Chinese, Japanese
CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF
\u31F0-\u31FF\u3300-\u
Christian,
Is this an issue with the encoding used when adding the documents to
the index? There are two encodings that need to be gotten right,
the one for the XML content POSTed to Solr, and also the HTTP header
on that POST request. If you are getting mangled content back from
a st
Leonardo Santagada wrote:
On 28/02/2008, at 00:23, Christian Wittern wrote:
The documents I am trying to index with Solr contain characters from
the CJK
Extension B, which had been added to Unicode in version 3.1 (March
2001).
Just to give more information, does java suport this? I beleive
On 28/02/2008, at 00:23, Christian Wittern wrote:
The documents I am trying to index with Solr contain characters from
the CJK
Extension B, which had been added to Unicode in version 3.1 (March
2001).
Just to give more information, does java suport this? I beleive they
don't support cha
14 matches
Mail list logo