Right, the charset must agree with the charset of the program that wrote the
file.
-- Jack Krupansky
-----Original Message-----
From: Shawn Heisey
Sent: Monday, July 08, 2013 7:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing fails for docs with high Latin1 chars
On 7/8/2013 4:43 PM, John Randall wrote:
I'm new to Solr, so I'm probably missing something. So far I've
successfully indexed .xml docs with low Ascii chars. However when I try to
add a doc that has Latin1 chars with diacritics, it fails. I've tried
using the Jetty exampledocs post.jar, as well as using curl and directly
from a browser. All three of the following methods work fine when the docs
contain Ascii 32-126:
From a browser:
http://localhost:8080/solr/update/?stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml
Using cURL:
curl
"http://localhost:8080/solr/update/?commit=true&stream.file=c:/solr/tml/exampledocs/57917486.xml&stream.contentType=application/xml”
Using post.jar from exampledocs directory
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486
java -jar -Durl=http://localhost:8080/solr/update post.jar 57917486.xml
I've tried other things: e.g., I've added the following line to the Tomcat
server.xml file, <Connector .../> section.
URIEncoding="UTF-8"
I've also copied some characters out of the utf8-example.xml file that
came with the Jetty app. It still fails. I also changed the offending
characters to their unicode equivalent: e.g., N with tilde to Ñ and
Ñ without success. For N with tilde and e with acute I get the
following message:
HTTP Status 400 - Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
________________________________
type Status report
message Invalid UTF-8 middle byte 0x4f (at char #159, byte #37)
description The request sent by the client was syntactically incorrect.
________________________________
Apache Tomcat/7.0.40
The file I am trying to add is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<add>
<doc>
<field name="id">57917486</field>
<field name="descrip_fw">NIÑO VOLANTE YOUNG FLYER</field>
</doc>
</add>
If I use your xml file (copy/paste from your email), changing the field
names so it's compatible with my index, it works with Solr 4.4-SNAPSHOT:
[root@bigindy5 ~]# java
"-Durl=http://localhost:8982/solr/s0live/update/" -jar
/index/src/branch_4x/solr/example/exampledocs/post.jar input.xml
SimplePostTool version 1.5
Posting files to base url http://localhost:8982/solr/s0live/update/
using content-type application/xml..
POSTing file input.xml
1 files indexed.
COMMITting Solr index changes to http://localhost:8982/solr/s0live/update/..
Time spent: 0:00:01.385
One thing to note: Solr requires UTF-8 for its input. If anything in
the chain (text editor, software outputting the XML, etc.) is using
Latin1 rather than UTF-8, that could explain the problem.
The hex representation of the UTF-8 character for the N with a tilde
accent is C3 91 -- two bytes. I have verified that this is what is in
my XML file.
I am betting that your file actually contains the Latin1 representation,
which is a single byte. When interpreting that as UTF-8, the byte has
the high bit set, so Java is expecting the next byte to finish out the
character. The next byte is a capital O, or hex 4F, which matches your
error message.
The entities that you are trying, like "Ñ", are HTML entities.
Those entities do not work in XML. XML has a very restricted list of
valid entities, including < which is the < character.
Perhaps if you used Jack's advice, but told it that it was Latin1
instead of UTF-8, it would convert the character to UTF-8 for you.
Thanks,
Shawn